AI Governance Institute logo
AI Governance Institute

Practical Governance for Enterprise AI

Data Governance
DGC · Data GovernanceDGC-001High effort

Training Data Provenance

Track and document the origin, composition, licensing, and preprocessing history of data used to train or fine-tune AI models.

Objective

Enable accountability for model behavior and compliance with data licensing, copyright, and privacy obligations by maintaining a documented data lineage.

Maturity Levels

1

Initial

Training data provenance is unknown; datasets are used without documentation of their source or composition.

2

Developing

Some datasets are documented informally; licensing is checked for major sources but not systematically.

3

Defined

A data lineage record documents every dataset used in training: source, license, preprocessing steps, known biases, and date of acquisition.

4

Managed

Provenance records are maintained throughout the model lifecycle and updated when training data changes.

5

Optimizing

Automated data lineage tracking captures provenance in real time; compliance checks run automatically at dataset ingestion.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

  • Data lineage documentation tracing each training dataset to its source, acquisition date, and applicable license or legal basis
  • Consent or legal basis records for each data source used in model training
  • Data quality assessment results run before training datasets were approved for use
  • Version control records linking each model version to the specific dataset version used in training
  • Records of data sources removed or modified in response to rights requests, takedown notices, or compliance requirements

Implementation Notes

Key steps

  • Create a dataset card for every training dataset: source, license type, applicable use restrictions, known biases, and the processing applied before use.
  • Check data licenses before use, not after — open-source and public web data often carry restrictions on commercial use or derivative works.
  • Document the absence of certain data categories (e.g. 'no personal data from EU residents') as carefully as the presence of data, for audit purposes.
  • For fine-tuned models, document both the base model provenance and the fine-tuning dataset.

Example Implementation

AI team fine-tuning an LLM on internal customer support transcripts

Dataset Card — Customer Support Fine-Tuning Dataset v2

FieldValue
Dataset IDcs-finetune-v2
SourceInternal Zendesk ticket exports (2023–2025)
Volume48,000 ticket/resolution pairs
LicenseInternal use only — no third-party data included
Personal dataContains customer names and email addresses
PII handlingNames replaced with [CUSTOMER]; emails with [EMAIL] via automated scrubbing (Presidio v2.2)
Scrubbing verifiedManual spot-check of 200 records: 0 residual PII found
Known biasesEnglish-language only; enterprise B2B skew; limited representation of SMB customers
Excluded categoriesEscalations involving legal disputes (removed); conversations flagged as sensitive
DPO sign-offApproved for fine-tuning use 2026-02-14 — M. Santos, DPO
Preprocessing stepsDeduplication, length filtering (< 50 tokens removed), scrubbing
Date of acquisition2026-02-01

Control Details

Control ID
DGC-001
Typical owner
AI Engineering / Legal / Privacy
Implementation effort
High effort
Agent-relevant
No

Tags

data provenancedata lineagetraining datadata governance