EU AI Act & Copyright: A Practical Guide for Developers
You might think copyright is just a lawyer’s headache — but if you’re training or releasing AI models in the EU, it’s now your problem too. The EU AI Act (Regulation (EU) 2024/1689) makes copyright compliance a hard requirement for anyone placing general-purpose AI (GPAI) models on the EU market.
This isn’t just for OpenAI or Anthropic. If you’re fine-tuning, hosting, or even just releasing a model on Hugging Face that qualifies as a GPAI model, you need to care.
The Rules:
The Act says:
Article 53(1)(c):
Providers “shall put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with a reservation of rights expressed by rightsholders pursuant to Article 4(3) of Directive (EU) 2019/790.”
Article 53(1)(d):
Providers must “draw up and make publicly available sufficiently detailed summaries about the content used for training … according to a template provided by the AI Office.”
DSM Directive 2019/790, Article 4(3):
The text and data mining exception “shall not apply to uses of works … made expressly reserved by their rightholders in an appropriate manner, such as machine-readable means.”
Translation:
You need a copyright policy.
You need to publish a summary of your training data.
If a website says “don’t scrape me” (robots.txt, metadata, machine-readable flags), you legally can’t use it.
What This Means in Practice
- Lawful Access Only
OK: using open datasets (e.g. Common Crawl subsets filtered for copyright, Creative Commons datasets, datasets with explicit licenses).
Not OK: bypassing paywalls, scraping journals, or ripping YouTube videos.
- Respect Robots.txt and Rights Metadata
The DSM Directive gives rightsholders the ability to “opt-out” of text/data mining. That means:
If a site sets robots.txt: Disallow: /, you can’t just ignore it.
If metadata like appears, treat it as a hard stop.
Think of it like GDPR’s “do not process” — except for training data.
- Dataset Tracking & Documentation
The AI Act requires you to publish a summary of your training data. Not the full dataset, but enough for regulators (and possibly rightsholders) to understand what you used.
Practical approach:
Keep an inventory of sources (URLs, datasets, repositories).
Version control your dataset pipelines (git + DVC works well).
Note any filtering you applied (e.g. removing copyrighted works, excluding opt-outs).
Example: Instead of “we trained on the internet,” you’d say:
“Training data includes LAION-5B (filtered for opt-outs), Wikipedia (CC-BY-SA), and CC-licensed Flickr images.”
- Contractual Safety Net for Deployers
If you’re using models ask your model provider if they have a copyright policy.
Demand warranties in contracts that outputs won’t infringe third-party rights.
Have fallback processes (e.g. human review of AI-generated ads before publishing).
Example Scenarios:
You scrape Medium.com for text: Not allowed if Medium’s TOS or robots.txt blocks it.
You use Creative Commons music samples: Allowed if license permits, but you must check attribution rules.
Here I drafter quick checklist for a Copyright Policy (Article 53(1)(c))
[ ] Lawful access only – no DRM breaking, no paywall scraping.
[ ] Respect rights reservations – honor robots.txt, no-AI flags, metadata.
[ ] Dataset inventory – document sources, licenses, filtering steps.
[ ] Publish summary – follow AI Office template once available.
[ ] Exclude bad actors – don’t use blacklisted or infringing sources.
[ ] Assign responsibility – who on your team “owns” copyright compliance?
Bottom Line
The EU AI Act has turned copyright compliance into a dev requirement, not just a legal clause. If you’re training or releasing models, treat your copyright policy the same way you treat security or privacy by design: baked into your workflow.