Training Data Provenance
Track and document the origin, composition, licensing, and preprocessing history of data used to train or fine-tune AI models.
Objective
Enable accountability for model behavior and compliance with data licensing, copyright, and privacy obligations by maintaining a documented data lineage.
Maturity Levels
Initial
Training data provenance is unknown; datasets are used without documentation of their source or composition.
Developing
Some datasets are documented informally; licensing is checked for major sources but not systematically.
Defined
A data lineage record documents every dataset used in training: source, license, preprocessing steps, known biases, and date of acquisition.
Managed
Provenance records are maintained throughout the model lifecycle and updated when training data changes.
Optimizing
Automated data lineage tracking captures provenance in real time; compliance checks run automatically at dataset ingestion.
Evidence Requirements
What an auditor or assessor would expect to see for this control.
- —Data lineage documentation tracing each training dataset to its source, acquisition date, and applicable license or legal basis
- —Consent or legal basis records for each data source used in model training
- —Data quality assessment results run before training datasets were approved for use
- —Version control records linking each model version to the specific dataset version used in training
- —Records of data sources removed or modified in response to rights requests, takedown notices, or compliance requirements
Implementation Notes
Key steps
- Create a dataset card for every training dataset: source, license type, applicable use restrictions, known biases, and the processing applied before use.
- Check data licenses before use, not after — open-source and public web data often carry restrictions on commercial use or derivative works.
- Document the absence of certain data categories (e.g. 'no personal data from EU residents') as carefully as the presence of data, for audit purposes.
- For fine-tuned models, document both the base model provenance and the fine-tuning dataset.
Example Implementation
AI team fine-tuning an LLM on internal customer support transcripts
Dataset Card — Customer Support Fine-Tuning Dataset v2
| Field | Value |
|---|---|
| Dataset ID | cs-finetune-v2 |
| Source | Internal Zendesk ticket exports (2023–2025) |
| Volume | 48,000 ticket/resolution pairs |
| License | Internal use only — no third-party data included |
| Personal data | Contains customer names and email addresses |
| PII handling | Names replaced with [CUSTOMER]; emails with [EMAIL] via automated scrubbing (Presidio v2.2) |
| Scrubbing verified | Manual spot-check of 200 records: 0 residual PII found |
| Known biases | English-language only; enterprise B2B skew; limited representation of SMB customers |
| Excluded categories | Escalations involving legal disputes (removed); conversations flagged as sensitive |
| DPO sign-off | Approved for fine-tuning use 2026-02-14 — M. Santos, DPO |
| Preprocessing steps | Deduplication, length filtering (< 50 tokens removed), scrubbing |
| Date of acquisition | 2026-02-01 |
Control Details
- Control ID
- DGC-001
- Domain
- Data Governance
- Typical owner
- AI Engineering / Legal / Privacy
- Implementation effort
- High effort
- Agent-relevant
- No
