AI Governance Institute logo
AI Governance Institute

Practical Governance for Enterprise AI

Security
SEC · SecuritySEC-003Medium effortAgent-relevant

Sensitive Data Handling in AI Pipelines

Prevent personally identifiable information, credentials, health data, and other sensitive content from entering AI models, prompts, or logs inappropriately.

Objective

Reduce the risk of data exposure through AI systems by enforcing data classification and handling requirements at every stage of the AI pipeline.

Maturity Levels

1

Initial

No controls prevent sensitive data from entering AI prompts or being stored in logs.

2

Developing

Engineers are aware of sensitive data risks but handling controls are inconsistent.

3

Defined

Documented data handling rules specify which data categories may be used in AI inputs, with technical enforcement where feasible.

4

Managed

Sensitive data flows through AI pipelines are mapped and reviewed; violations are tracked.

5

Optimizing

Automated data classification and redaction are applied in real time before data reaches AI models.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

  • AI data flow map documenting what data enters prompts, model APIs, logs, and training pipelines at each stage
  • PII detection and redaction configuration with sample test results demonstrating effectiveness
  • Vendor contract or BAA confirming data use and retention restrictions with each AI API provider
  • Log scrubbing configuration and sample scrubbed log entries demonstrating sensitive content removal
  • DPO or Privacy Officer approval records for any exceptions permitting sensitive data in AI training or prompts

Implementation Notes

Key steps

  • Map your AI data flows: what data enters prompts, what gets logged, what is used for fine-tuning? Most organizations underestimate how much PII flows through these channels.
  • Apply automatic PII detection and redaction before data reaches model APIs, especially in retrieval-augmented generation (RAG) pipelines.
  • Review your vendor contracts: most AI API providers state that prompts may be used for model improvement — ensure sensitive data is not inadvertently being shared.
  • Implement logging policies that scrub sensitive content from AI logs while preserving sufficient context for security monitoring.

Example Implementation

Healthcare company using a RAG pipeline to assist clinicians with documentation

Sensitive Data Handling Policy — Clinical Documentation RAG Pipeline

Data classification in scope: PHI, patient identifiers, diagnosis codes, treatment records

Pipeline controls:

StageControlImplementation
Document ingestionPII detectionAutomated scan with Microsoft Presidio before indexing
Prompt constructionField-level redactionPatient name replaced with [PATIENT], DOB with [DATE] in prompts sent to external model API
Model API callsVendor data useBAA in place with model provider; zero-retention API tier selected
Output loggingLog scrubbingPII detector runs on all logged outputs; flagged content stored in restricted PHI log store
Fine-tuningTraining data reviewPHI explicitly prohibited in fine-tuning datasets; DPO sign-off required before any training run

Exception process: Any case where unredacted PHI must be sent to external model requires written Privacy Officer approval and is logged separately

Control Details

Control ID
SEC-003
Domain
Security
Typical owner
CISO / Privacy / AI Engineering
Implementation effort
Medium effort
Agent-relevant
Yes

Tags

data protectionPIIsensitive dataAI pipeline security