EU AI Act & Copyright: A Practical Guide for Developers

Community Article Published August 16, 2025

You might think copyright is just a lawyer’s headache — but if you’re training or releasing AI models in the EU, it’s now your problem too. The EU AI Act (Regulation (EU) 2024/1689) makes copyright compliance a hard requirement for anyone placing general-purpose AI (GPAI) models on the EU market.

This isn’t just for OpenAI or Anthropic. If you’re fine-tuning, hosting, or even just releasing a model on Hugging Face that qualifies as a GPAI model, you need to care.

The Rules:

The Act says:

Article 53(1)(c):

Providers “shall put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with a reservation of rights expressed by rightsholders pursuant to Article 4(3) of Directive (EU) 2019/790.”

Article 53(1)(d):

Providers must “draw up and make publicly available sufficiently detailed summaries about the content used for training … according to a template provided by the AI Office.”

DSM Directive 2019/790, Article 4(3):

The text and data mining exception “shall not apply to uses of works … made expressly reserved by their rightholders in an appropriate manner, such as machine-readable means.”

Translation:

You need a copyright policy.

You need to publish a summary of your training data.

If a website says “don’t scrape me” (robots.txt, metadata, machine-readable flags), you legally can’t use it.

What This Means in Practice

  1. Lawful Access Only

OK: using open datasets (e.g. Common Crawl subsets filtered for copyright, Creative Commons datasets, datasets with explicit licenses).

Not OK: bypassing paywalls, scraping journals, or ripping YouTube videos.

  1. Respect Robots.txt and Rights Metadata

The DSM Directive gives rightsholders the ability to “opt-out” of text/data mining. That means:

If a site sets robots.txt: Disallow: /, you can’t just ignore it.

If metadata like appears, treat it as a hard stop.

Think of it like GDPR’s “do not process” — except for training data.

  1. Dataset Tracking & Documentation

The AI Act requires you to publish a summary of your training data. Not the full dataset, but enough for regulators (and possibly rightsholders) to understand what you used.

Practical approach:

Keep an inventory of sources (URLs, datasets, repositories).

Version control your dataset pipelines (git + DVC works well).

Note any filtering you applied (e.g. removing copyrighted works, excluding opt-outs).

Example: Instead of “we trained on the internet,” you’d say:

“Training data includes LAION-5B (filtered for opt-outs), Wikipedia (CC-BY-SA), and CC-licensed Flickr images.”

  1. Contractual Safety Net for Deployers

If you’re using models ask your model provider if they have a copyright policy.

Demand warranties in contracts that outputs won’t infringe third-party rights.

Have fallback processes (e.g. human review of AI-generated ads before publishing).

Example Scenarios:

You scrape Medium.com for text: Not allowed if Medium’s TOS or robots.txt blocks it.

You use Creative Commons music samples: Allowed if license permits, but you must check attribution rules.

Here I drafter quick checklist for a Copyright Policy (Article 53(1)(c))

[ ] Lawful access only – no DRM breaking, no paywall scraping.

[ ] Respect rights reservations – honor robots.txt, no-AI flags, metadata.

[ ] Dataset inventory – document sources, licenses, filtering steps.

[ ] Publish summary – follow AI Office template once available.

[ ] Exclude bad actors – don’t use blacklisted or infringing sources.

[ ] Assign responsibility – who on your team “owns” copyright compliance?

Bottom Line

The EU AI Act has turned copyright compliance into a dev requirement, not just a legal clause. If you’re training or releasing models, treat your copyright policy the same way you treat security or privacy by design: baked into your workflow.

image/png

Community

Sign up or log in to comment