AI Governance Institute logo
AI Governance Institute

aigovernance.com — Global AI Regulation & Framework Directory

← AI Governance Playbook

Question 12 of 24

Is our training data compliant with global privacy laws?

Ensuring you had the right to use data for model training, identifying PII in datasets, and navigating GDPR and EU AI Act data obligations.

The right to use data for training is not implied

Collecting data lawfully and using it to train an AI model are two different things. GDPR's purpose limitation principle requires that personal data be collected for specified, explicit, and legitimate purposes and not processed in a manner incompatible with those purposes. If you collected customer data to provide a service, using it to train a model that optimizes a different process may not be compatible with the original purpose without a fresh legal basis.

Before using any dataset for AI training, document the source of the data, the legal basis under which it was collected, whether the subjects were informed of or consented to this use, and whether the intended training use is compatible with that legal basis. This analysis should be completed before training begins, not after.

PII identification and minimization

Datasets that appear anonymized often contain re-identifiable information when combined with other data or processed by powerful models. Before using any dataset for training, conduct a PII assessment that covers direct identifiers (names, email addresses, account numbers), quasi-identifiers (combinations of attributes that could identify individuals), and sensitive categories (health, financial, racial or ethnic origin, political opinion).

Apply data minimization before training: remove fields that are not necessary for the model's intended function, aggregate or generalize where precision is not required, and consider differential privacy techniques for datasets with high re-identification risk. The EU AI Act requires that training data for high-risk AI systems be subject to appropriate data governance practices, including examination for biases and relevance.

Ongoing obligations after training

Privacy obligations do not end when training is complete. Data subject rights, including the right to erasure, apply to personal data used in training. While true machine unlearning remains technically challenging for large models, organizations should document their approach to data subject requests involving training data and be prepared to justify their position to regulators.

Retain records of the datasets used to train each model version, including data provenance, processing steps applied, and the legal basis for use. This documentation is required by the EU AI Act for high-risk systems and is increasingly expected by data protection authorities conducting AI-related inquiries.