Sensitive Data Handling in AI Pipelines
Prevent personally identifiable information, credentials, health data, and other sensitive content from entering AI models, prompts, or logs inappropriately.
Objective
Reduce the risk of data exposure through AI systems by enforcing data classification and handling requirements at every stage of the AI pipeline.
Maturity Levels
Initial
No controls prevent sensitive data from entering AI prompts or being stored in logs.
Developing
Engineers are aware of sensitive data risks but handling controls are inconsistent.
Defined
Documented data handling rules specify which data categories may be used in AI inputs, with technical enforcement where feasible.
Managed
Sensitive data flows through AI pipelines are mapped and reviewed; violations are tracked.
Optimizing
Automated data classification and redaction are applied in real time before data reaches AI models.
Evidence Requirements
What an auditor or assessor would expect to see for this control.
- —AI data flow map documenting what data enters prompts, model APIs, logs, and training pipelines at each stage
- —PII detection and redaction configuration with sample test results demonstrating effectiveness
- —Vendor contract or BAA confirming data use and retention restrictions with each AI API provider
- —Log scrubbing configuration and sample scrubbed log entries demonstrating sensitive content removal
- —DPO or Privacy Officer approval records for any exceptions permitting sensitive data in AI training or prompts
Implementation Notes
Key steps
- Map your AI data flows: what data enters prompts, what gets logged, what is used for fine-tuning? Most organizations underestimate how much PII flows through these channels.
- Apply automatic PII detection and redaction before data reaches model APIs, especially in retrieval-augmented generation (RAG) pipelines.
- Review your vendor contracts: most AI API providers state that prompts may be used for model improvement — ensure sensitive data is not inadvertently being shared.
- Implement logging policies that scrub sensitive content from AI logs while preserving sufficient context for security monitoring.
Example Implementation
Healthcare company using a RAG pipeline to assist clinicians with documentation
Sensitive Data Handling Policy — Clinical Documentation RAG Pipeline
Data classification in scope: PHI, patient identifiers, diagnosis codes, treatment records
Pipeline controls:
| Stage | Control | Implementation |
|---|---|---|
| Document ingestion | PII detection | Automated scan with Microsoft Presidio before indexing |
| Prompt construction | Field-level redaction | Patient name replaced with [PATIENT], DOB with [DATE] in prompts sent to external model API |
| Model API calls | Vendor data use | BAA in place with model provider; zero-retention API tier selected |
| Output logging | Log scrubbing | PII detector runs on all logged outputs; flagged content stored in restricted PHI log store |
| Fine-tuning | Training data review | PHI explicitly prohibited in fine-tuning datasets; DPO sign-off required before any training run |
Exception process: Any case where unredacted PHI must be sent to external model requires written Privacy Officer approval and is logged separately
Control Details
- Control ID
- SEC-003
- Domain
- Security
- Typical owner
- CISO / Privacy / AI Engineering
- Implementation effort
- Medium effort
- Agent-relevant
- Yes
