Copyright vs. GDPR: The Tension in AI Training and Data Protection

Every dataset scraped from the internet to train an AI model contains two things: copyrighted material and personal data. A news article used for training includes the publisher’s copyrighted text, but it also includes the names, opinions, and sometimes photographs of real people.

Satisfying copyright law does not satisfy data protection law. They are separate legal frameworks with separate obligations, and regulators on both sides are tightening the rules. If your organisation develops, deploys, or procures AI systems, you need to understand how both apply — and why compliance with one does not give you a free pass on the other.

The EU Parliament Draws a Line on AI and Copyright

On 28 January 2026, the European Parliament’s Committee on Legal Affairs (JURI) adopted a report titled “Copyright and generative artificial intelligence — opportunities and challenges.” The vote was 17 in favour, 3 against, with 2 abstentions. A plenary vote is scheduled for March 2026.

The report sets out several key positions. First, EU copyright law should apply to all generative AI systems serving the EU market, regardless of where the training takes place. Second, AI developers should be required to disclose which copyrighted works were used in training. Third, if transparency obligations are not met, courts should be able to apply a rebuttable presumption — meaning they may assume copyrighted works were used unless the developer proves otherwise. Fourth, AI-generated content produced without meaningful human authorship should receive no copyright protection.

It is worth noting that this is an own-initiative report — a political signal, not binding legislation. The rapporteur, Axel Voss, included a suggestion for a flat-rate compensation fee in his accompanying Explanatory Statement, but the operative text of the adopted resolution did not include a mandatory compensation mechanism. The direction of travel, however, is clear: the EU wants transparency and accountability from AI developers regarding copyrighted training data.

The UK’s AI Copyright Stalemate

The UK government’s consultation on AI and copyright (December 2024 to February 2025) received over 11,500 responses. The results were striking: 88% of respondents supported mandatory licensing of copyrighted works for AI training, while only 0.5% backed a broad text and data mining (TDM) exception.

On 6 March 2026, the House of Lords Communications and Digital Committee published its report “AI, copyright and the creative industries”. The Committee called for a “licensing-first” approach and recommended ruling out a TDM exception entirely. This matters because the UK’s creative industries contributed £124 billion in gross value added to the economy in 2023 and employ 2.4 million people. Lawmakers are urging ministers to drop any plans for an AI-related copyright exception.

Adding a deadline to this debate, the Data (Use and Access) Act 2025 requires, under Sections 135-136, an economic impact assessment and report on copyright works in AI development, due before Parliament by 18 March 2026. The government’s official position remains that it is “considering all options,” but the political pressure is overwhelmingly in one direction.

Where Copyright and GDPR Collide

The Double Compliance Problem

When an AI developer scrapes a news article, they need copyright clearance for the text itself — but the personal data within that article triggers a completely separate set of obligations under the GDPR (or UK GDPR).

Copyright’s TDM exception under the EU Directive (Articles 3-4) does not authorise processing the personal data contained within those works. An organisation that has a valid copyright licence, or that qualifies for a TDM exception, still needs to independently address data protection requirements.

In practice, this means any training dataset built from publicly available content faces four overlapping requirements. You need copyright licensing or TDM compliance. You need a GDPR lawful basis — typically legitimate interests. You need to satisfy purpose limitation, given that the original collection of that data almost certainly did not contemplate AI training. And you need to be able to respond to data subject rights requests, including erasure requests, even where personal data is embedded in model weights.

What Regulators Say About GDPR and AI Training

Data protection regulators have been increasingly specific about how existing rules apply to AI training data.

The European Data Protection Board (EDPB) adopted Opinion 28/2024 on 17 December 2024, confirming that legitimate interests may support AI model training — but only if organisations pass a three-step balancing test. A key factor in that test is whether data subjects would reasonably expect their data to be used in this way. For most people whose personal data appears in scraped web content, the answer is almost certainly no.

The French data protection authority (CNIL) has taken a similar position, stating that AI training on publicly available data can comply under legitimate interests — but only with appropriate safeguards in place.

In the UK, the ICO announced a formal investigation into Grok/xAI on 3 February 2026. The investigation is examining whether personal data was lawfully processed for AI training, with particular focus on harmful sexualised content, including content involving children. This signals that the ICO is prepared to take enforcement action where AI training data practices fall short of data protection standards.

At the EU level, the Digital Omnibus proposal published in November 2025 would introduce an explicit legitimate interest basis for AI training under a new Article 88c, with an opt-out model. However, critics argue this is unworkable for scraped data — you cannot meaningfully opt out of processing that has already happened. The EDPB and EDPS raised concerns about this approach in their Joint Opinion 2/2026, particularly around proposals that could narrow the scope of what counts as personal data.

What DPOs and Compliance Teams Should Do

The regulatory picture is moving quickly, but the practical steps are clear:

Do not assume a copyright licence covers data protection. They are separate legal obligations. A licence to use copyrighted text does not authorise processing the personal data within it.
Conduct a Legitimate Interest Assessment before using personal data in AI training. The EDPB’s three-step test is your benchmark, and “reasonable expectations of data subjects” is the critical factor.
Document your lawful basis, purpose, and retention decisions for all training data. If you cannot explain why you hold the data and what you are doing with it, you have a compliance gap.
Monitor the EU JURI proposals and the UK’s 18 March 2026 deadline. The rules are actively changing, and what is permissible today may not be tomorrow.
Conduct due diligence on third-party AI models. If you are deploying a model trained by someone else, the EDPB has made clear that downstream controllers can be affected by upstream unlawful processing. Ask your vendors how their training data was sourced.
Prepare for transparency obligations under both frameworks. The EU’s proposed copyright disclosure requirements and GDPR’s existing transparency obligations (Articles 13-14) both point in the same direction: you will need to explain what data went into your AI systems.

The copyright and data protection frameworks for AI training are developing in parallel but not in sync. Copyright regulators are focused on licensing, compensation, and creative rights. Data protection regulators are focused on lawful basis, purpose limitation, and individual rights. Both sets of rules apply to the same training datasets, and compliance with one framework does not satisfy the other.

For organisations working with AI, the practical requirement is straightforward even if the execution is not: address both copyright and data protection from the outset, track the rapidly evolving regulatory positions in both the EU and UK, and build compliance processes that treat these as two distinct — but equally important — obligations.

Author

Scott Dooley

Scott Dooley is a seasoned entrepreneur and data protection expert with over 15 years of experience in the tech industry. As the founder of Measured Collective and Kahunam, Scott has dedicated his career to helping businesses navigate the complex landscape of data privacy and GDPR compliance.

With a background in marketing and web development, Scott brings a unique perspective to data protection issues, understanding both the technical and business implications of privacy regulations. His expertise spans from cookie compliance to implementing privacy-by-design principles in software development.

Scott is passionate about demystifying GDPR and making data protection accessible to businesses of all sizes. Through his blog, he shares practical insights, best practices, and the latest developments in data privacy law, helping readers stay informed and compliant in an ever-changing regulatory environment.

View all posts

Copyright vs. GDPR: The Tension in AI Training and Data Protection

The EU Parliament Draws a Line on AI and Copyright

The UK’s AI Copyright Stalemate

Where Copyright and GDPR Collide

The Double Compliance Problem

What Regulators Say About GDPR and AI Training

What DPOs and Compliance Teams Should Do

Author

Popular posts on Measured Collective

Start learning today

Essentials GDPR Course

GDPR Refresher

PECR for Marketers

Free GDPR Basics