Training Data Provenance

Track and document the origin, composition, licensing, and preprocessing history of data used to train or fine-tune AI models.

Objective

Enable accountability for model behavior and compliance with data licensing, copyright, and privacy obligations by maintaining a documented data lineage.

Maturity Levels

Initial

Training data provenance is unknown; datasets are used without documentation of their source or composition.

Developing

Some datasets are documented informally; licensing is checked for major sources but not systematically.

Defined

A data lineage record documents every dataset used in training: source, license, preprocessing steps, known biases, and date of acquisition.

Managed

Provenance records are maintained throughout the model lifecycle and updated when training data changes.

Optimizing

Automated data lineage tracking captures provenance in real time; compliance checks run automatically at dataset ingestion.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

—Data lineage documentation tracing each training dataset to its source, acquisition date, and applicable license or legal basis
—Consent or legal basis records for each data source used in model training
—Data quality assessment results run before training datasets were approved for use
—Version control records linking each model version to the specific dataset version used in training
—Records of data sources removed or modified in response to rights requests, takedown notices, or compliance requirements

Implementation Notes

Key steps

Create a dataset card for every training dataset: source, license type, applicable use restrictions, known biases, and the processing applied before use.
Check data licenses before use, not after — open-source and public web data often carry restrictions on commercial use or derivative works.
Document the absence of certain data categories (e.g. 'no personal data from EU residents') as carefully as the presence of data, for audit purposes.
For fine-tuned models, document both the base model provenance and the fine-tuning dataset.

Example Implementation

AI team fine-tuning an LLM on internal customer support transcripts

Dataset Card — Customer Support Fine-Tuning Dataset v2

Field	Value
Dataset ID	cs-finetune-v2
Source	Internal Zendesk ticket exports (2023–2025)
Volume	48,000 ticket/resolution pairs
License	Internal use only — no third-party data included
Personal data	Contains customer names and email addresses
PII handling	Names replaced with [CUSTOMER]; emails with [EMAIL] via automated scrubbing (Presidio v2.2)
Scrubbing verified	Manual spot-check of 200 records: 0 residual PII found
Known biases	English-language only; enterprise B2B skew; limited representation of SMB customers
Excluded categories	Escalations involving legal disputes (removed); conversations flagged as sensitive
DPO sign-off	Approved for fine-tuning use 2026-02-14 — M. Santos, DPO
Preprocessing steps	Deduplication, length filtering (< 50 tokens removed), scrubbing
Date of acquisition	2026-02-01

Control Details

Control ID: DGC-001
Domain: Data Governance
Typical owner: AI Engineering / Legal / Privacy
Implementation effort: High effort
Agent-relevant: No

Get control updates weekly

New and updated controls, maturity guidance, and the regulatory changes behind them. Every Thursday.

Training Data Provenance

Maturity Levels

Evidence Requirements

Implementation Notes

Key steps

Example Implementation

Dataset Card — Customer Support Fine-Tuning Dataset v2

Control Details

Tags

Related Controls

Related Playbook

Recent Coverage