Question 12 of 45

Is our training data compliant with global privacy laws?

By Cody Maxwell · AI Governance Institute · 2026

Ensuring you had the right to use data for model training, identifying PII in datasets, and navigating GDPR and EU AI Act data obligations.

If you only do 3 things, do this:

1.Before using any dataset for training, document the source, the original collection purpose, and whether training use is compatible with that purpose under GDPR's purpose limitation principle.
2.Run a PII assessment on every training dataset — datasets that appear anonymized often contain quasi-identifiers that re-identification attacks can exploit.
3.Retain records of every dataset used to train each model version. You will need this when a data subject makes an erasure request or a regulator conducts an AI inquiry.

The Situation

Who this is for: Data science teams, privacy officers, and legal counsel involved in AI model development

When you need this: Before initiating any AI training run, or when auditing existing models for privacy compliance

The Decision

Do we have a lawful basis to use this data for this training purpose, and what are our obligations to data subjects whose data was used?

The Steps

1Inventory every dataset used in model training or fine-tuning, including its source and original collection context
2For each dataset, document the legal basis for collection and assess whether training use is compatible with that purpose
3Run a PII assessment: identify direct identifiers, quasi-identifiers, and sensitive category data
4Apply data minimization: remove unnecessary fields, aggregate where possible, consider differential privacy
5Document your approach to data subject rights (erasure, access) for training data, including your position on machine unlearning
6Retain provenance records for each training dataset as part of the model's documentation

The Artifacts

—Training data provenance record template (source, collection basis, training compatibility assessment, PII assessment)
—PII assessment methodology (direct identifiers, quasi-identifiers, sensitive categories)
—Data minimization checklist for training datasets
—Data subject rights response template for training data
—EU AI Act training data governance checklist (for high-risk systems)

The Output

A documented legal basis for each training dataset, a PII assessment on file, data minimization measures applied, and provenance records retained with the model.

The right to use data for training is not implied

Collecting data lawfully and using it to train an AI model are two different things. GDPR's purpose limitation principle requires that personal data be collected for specified, explicit, and legitimate purposes and not processed in a manner incompatible with those purposes. If you collected customer data to provide a service, using it to train a model that optimizes a different process may not be compatible with the original purpose without a fresh legal basis.

Before using any dataset for AI training, document the source of the data, the legal basis under which it was collected, whether the subjects were informed of or consented to this use, and whether the intended training use is compatible with that legal basis. This analysis should be completed before training begins, not after.

PII identification and minimization

Datasets that appear anonymized often contain re-identifiable information when combined with other data or processed by powerful models. Before using any dataset for training, conduct a PII assessment that covers direct identifiers (names, email addresses, account numbers), quasi-identifiers (combinations of attributes that could identify individuals), and sensitive categories (health, financial, racial or ethnic origin, political opinion).

Apply data minimization before training: remove fields that are not necessary for the model's intended function, aggregate or generalize where precision is not required, and consider differential privacy techniques for datasets with high re-identification risk. The EU AI Act requires that training data for high-risk AI systems be subject to appropriate data governance practices, including examination for biases and relevance.

Ongoing obligations after training

Privacy obligations do not end when training is complete. Data subject rights, including the right to erasure, apply to personal data used in training. While true machine unlearning remains technically challenging for large models, organizations should document their approach to data subject requests involving training data and be prepared to justify their position to regulators.

Retain records of the datasets used to train each model version, including data provenance, processing steps applied, and the legal basis for use. This documentation is required by the EU AI Act for high-risk systems and is increasingly expected by data protection authorities conducting AI-related inquiries.

Governance Controls

Operational controls that implement the guidance in this playbook.

DGC-001Training Data Provenance DGC-002PII Handling in AI Systems SEC-003Sensitive Data Handling in AI Pipelines

Related frameworks

GDPR Applicability to AI Systems EU AI Act NIST AI RMF

Not sure where to start? Answer 3 questions and get a tailored compliance action plan.

What applies to me? →

← How do we ensure human-in-the-loop review is actually effective?How do we measure and mitigate algorithmic bias? →