Sensitive Data Handling in AI Pipelines

Prevent personally identifiable information, credentials, health data, and other sensitive content from entering AI models, prompts, or logs inappropriately.

Objective

Reduce the risk of data exposure through AI systems by enforcing data classification and handling requirements at every stage of the AI pipeline.

Maturity Levels

Initial

No controls prevent sensitive data from entering AI prompts or being stored in logs.

Developing

Engineers are aware of sensitive data risks but handling controls are inconsistent.

Defined

Documented data handling rules specify which data categories may be used in AI inputs, with technical enforcement where feasible.

Managed

Sensitive data flows through AI pipelines are mapped and reviewed; violations are tracked.

Optimizing

Automated data classification and redaction are applied in real time before data reaches AI models.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

—AI data flow map documenting what data enters prompts, model APIs, logs, and training pipelines at each stage

—PII detection and redaction configuration with sample test results demonstrating effectiveness

—Vendor contract or BAA confirming data use and retention restrictions with each AI API provider

—Log scrubbing configuration and sample scrubbed log entries demonstrating sensitive content removal

—DPO or Privacy Officer approval records for any exceptions permitting sensitive data in AI training or prompts

Implementation Notes

Key steps

Map your AI data flows: what data enters prompts, what gets logged, what is used for fine-tuning? Most organizations underestimate how much PII flows through these channels.
Apply automatic PII detection and redaction before data reaches model APIs, especially in retrieval-augmented generation (RAG) pipelines.
Review your vendor contracts: most AI API providers state that prompts may be used for model improvement — ensure sensitive data is not inadvertently being shared.
Implement logging policies that scrub sensitive content from AI logs while preserving sufficient context for security monitoring.

Sensitive Data Handling Policy — Clinical Documentation RAG Pipeline

Data classification in scope: PHI, patient identifiers, diagnosis codes, treatment records

Pipeline controls:

Stage	Control	Implementation
Document ingestion	PII detection	Automated scan with Microsoft Presidio before indexing
Prompt construction	Field-level redaction	Patient name replaced with [PATIENT], DOB with [DATE] in prompts sent to external model API
Model API calls	Vendor data use	BAA in place with model provider; zero-retention API tier selected
Output logging	Log scrubbing	PII detector runs on all logged outputs; flagged content stored in restricted PHI log store
Fine-tuning	Training data review	PHI explicitly prohibited in fine-tuning datasets; DPO sign-off required before any training run

Exception process: Any case where unredacted PHI must be sent to external model requires written Privacy Officer approval and is logged separately

Sensitive Data Handling in AI Pipelines

Maturity Levels

Evidence Requirements

Implementation Notes

Key steps

Example Implementation

Sensitive Data Handling Policy — Clinical Documentation RAG Pipeline

Control Details

Tags

Related Controls

Related Playbook