Home

Donate
Perspective

Addressing GDPR’s Shortcomings in AI Training Data Transparency with the AI Act

Ameneh Dehshiri / Jul 31, 2025

Ground Up and Spat Out / Janet Turra & Cambridge Diversity Fund / Better Images of AI

In July 2025—just weeks before the EU Artificial Intelligence Act (AI Act) comes into force—two long-anticipated transparency tools finally pulled back the curtain on how general-purpose AI (GPAI) models are trained. On July 10, the Model Documentation Form (MDF) was published as part of the voluntary GPAI Code of Practice, and on July 24, the European Commission released the updated Public Summary Template (PST), requiring for the first time public disclosure of basic information about training datasets.

Both instruments are grounded in Article 53 of the EU AI Act and share the same overarching purpose: to enhance transparency about the datasets that power GPAI models. But they are designed for different audiences and involve distinct levels of disclosure, with significant implications for data protection rights.

Two tiers of transparency: Regulatory oversight vs. public accountability

The transparency obligations in the EU AI Act concerning the training data of GPAI models can serve different purposes. They help providers of AI systems better understand the capabilities and limitations of these models and comply with their regulatory duties; support the enforcement of copyright and related rights; and make it easier to exercise and enforce data protection rights, among others. This article focuses on the latter—examining what these new transparency rules mean for personal data protection.

When it comes to the training data of GPAI models, the EU AI Act divides transparency into two tracks: a deep, confidential stream for regulators, and a shallow, public stream designed to build trust without giving away the full picture.

The first track operates through the Model Documentation Form (MDF), mandated under Article 53(1)(a) and (b) of the AI Act. It compels providers to disclose detailed technical and operational information, including the provenance, source, and composition of training, validation, and testing datasets. This information is submitted to the AI Office (AIO) and national competent authorities (NCAs) (see Art 3(48) of the AI Act ) but remains strictly confidential under Article 78 of the AI Act, protected by trade secret provisions. The MDF is designed primarily as a tool for regulatory oversight: it equips authorities to investigate compliance, detect violations, and impose corrective measures, including fines or market restrictions.

The MDF was first introduced on July 10, 2025, as part of the voluntary GPAI Code of Practice, a soft-law instrument meant to encourage early compliance. Its rollout, however, immediately highlighted the tension between regulators and Big Tech. Meta publicly announced it would not sign the code, arguing that it introduced “measures which go far beyond the scope of the AI Act”.

Yet, legally speaking, such opposition may soon become moot. On August 2, 2025, the AI Act takes full effect, making transparency obligations for GPAI legally binding. As a transitional measure, companies that sign the code are granted a one-year grace period—until August 2026—before enforcement begins. But as Meta’s refusal illustrates, some providers may choose to test regulators’ resolve, betting on the slow pace of enforcement and the complexity of cross-border oversight rather than embracing early compliance.

The second track, by contrast, is designed for public-facing accountability. The PST, grounded in Article 53(1)(d), obliges GPAI providers to publish a high-level summary of their training data sources, addressed to individuals, civil society organizations, and rightsholders. Recital 107 of the AI Act explicitly requires the summary to be “simple” and “effective,” accessible even to non-expert audiences. Yet the level of disclosure is deliberately limited to protect trade secrets. Providers must indicate the most relevant domain names crawled, the time period of data collection, and offer a general description of the type of content, including its geographic or linguistic profile. What they do not have to share—at least publicly—are URL-level data, user-specific identifiers, or detailed breakdowns of dataset composition.

Consequently, individuals cannot verify with certainty whether their personal data was used. But for the first time, they can make reasonable inferences: if the PST lists x.com (formerly Twitter) as a major source for the period 2021–2023, a Twitter user active during that time could reasonably suspect their posts were processed, even if they cannot prove it directly.

Wasn’t the GDPR alone enough?

The GDPR already imposes transparency obligations, but it was never designed for the scale and complexity of training data used in general-purpose AI models. Its rules address transparency primarily on a case-by-case basis, making them ill-suited for the vast, scraped datasets that power GPAI.

For example, the Records of Processing Activities (see Article 30 of GDPR) is an internal document maintained by controllers and disclosed to supervisory authorities only upon request. It does not include detailed information about the provenance or composition of datasets, such as that required by the Model Documentation Form (MDF) under the AI Act. The Public Summary Template (PST) introduces, for the first time, a public-facing layer of transparency about the sources of data used to train GPAI models—something not provided for under the GDPR.

Another transparency mechanism is the GDPR’s proactive transparency framework found in Articles 13 and 14 of the GDPR . The articles require controllers to inform individuals about the source of the data collected about them at the time of processing. Article 14 even requires disclosure of the source of personal data within a month when collected indirectly, such as through scraping or data brokers. However, AI companies may invoke the “disproportionate effort” exemption under Article 14(5)(b)GDPR to avoid notifying millions of affected users. The European Data Protection Board (EDPB), in its report on the work of its “ChatGPT Taskforce,” acknowledged that, in the context of personal data scraped from publicly accessible sources, the Article 14(5)(b)GDPR exemption “could apply,” as notifying affected individuals would involve disproportionate effort.

The GDPR’s reactive transparency mechanisms are equally limited. Individuals must actively request access to their data. Art. 12(5) of the GDPR allows controllers to refuse or charge for “manifestly unfounded or excessive” access requests. Additionally, under Art. 11(1) and Recital 64 of GDPR, controllers are not obliged to maintain or obtain additional information solely to identify data subjects when identification is no longer necessary for processing—an argument frequently invoked by AI companies processing de‑identified or aggregated training data.

These structural limitations explain why the AI Act’s transparency regime is viewed not as a competing framework but as a complementary one, designed to close some of the gaps left by the GDPR. As the European Data Protection Board (EDPB) emphasized in a March 2024 statement discussing the role of data protection authorities within the AI Act framework, “the AI Act and the Union data protection legislation … need to be, in principle, considered (and coherently interpreted) as complementary and mutually reinforcing instruments.”

From transparency to enforcement: Bridging the AI Act and the GDPR

The EU AI Act marks a significant shift by making GPAI providers legally responsible for publishing basic information about their training datasets, creating for the first time a foundation for systemic, public-facing accountability rather than leaving individuals to pursue answers case by case.

Although Article 53 does not create new individual rights. The obligations in Article 53 of the AI Act do not establish individual rights or remedies for data subjects; its transparency requirements can indirectly strengthen the enforcement of existing GDPR obligations, especially when it comes to the use of personal data by big AI companies to train their models. Public summaries may serve as evidence, helping individuals justify data subject access requests under Article 15 of the GDPR or support complaints to Data Protection Authorities (DPAs) under Article 77 of the GDPR. They can also enable collective redress: under Article 80 of the GDPR and the Representative Actions Directive, qualified entities may seek injunctions or damages by using AI Act disclosures to demonstrate unlawful data processing.

The AI Act further reinforces enforcement through institutional cooperation. Article 64(7) of the AI Act requires National Competent Authorities (NCAs) to share relevant information, including confidential MDF data, with DPAs. Under Article 58(1)(e) of the GDPR, DPAs are empowered to request any documentation necessary for their investigations, allowing them to build on AI Act disclosures to assess compliance. Member States may even designate DPAs as NCAs or Market Surveillance Authorities (MSAs). The European Data Protection Board has encouraged such alignment, noting that combining these roles would streamline oversight and strengthen a rights-based supervisory approach.

Taken together, these developments suggest that the AI Act’s transparency regime, though limited, could become a powerful lever for enforcing existing data protection rules against AI companies that rely heavily on large-scale personal data scraping for training their models—if regulators are willing to use it.

Authors

Ameneh Dehshiri
Ameneh Dehshiri is a London-based lawyer and digital law expert, with a focus on AI governance, data regulation, and digital human rights. She holds advanced degrees and certifications from Iran, the UK, Belgium, and Italy, where she completed her PhD on a full scholarship. For over a decade, she ha...

Related

How the EU AI Act Can Increase Transparency Around AI Training DataDecember 9, 2024
Perspective
EU Rules for General Purpose AI Model Developers Are Ready, Despite What Industry SaysJuly 10, 2025

Topics