Generative AI Input Data Classification
Establish a classification policy for data entering generative AI systems as inputs — prompts, context windows, retrieved documents, tool outputs, and conversation history — addressing privacy, confidentiality, and regulatory risks specific to the generative AI input surface that general data classification policies do not cover.
Objective
Prevent regulated, privileged, or sensitive data from entering generative AI systems without appropriate controls, by extending data classification to the AI input surface and establishing handling requirements for each classification level.
Maturity Levels
Initial
The organization has a general data classification policy, but it does not address data entering generative AI systems as inputs. Employees may include regulated or confidential data in prompts without guidance.
Developing
Acceptable use policy for AI tools includes general guidance (e.g., do not enter client data into AI systems) but the classification criteria and handling requirements are not defined precisely, and coverage is inconsistent across AI tools in use.
Defined
An AI input data classification policy defines categories of data, handling requirements for each category when used as AI input, and the AI systems for which each category is permitted. The policy covers prompts, context, retrieved documents, tool call inputs and outputs, and conversation history for all AI systems in the inventory.
Managed
Classification requirements are implemented through a combination of policy, training, and technical controls (e.g., system prompt instructions prohibiting certain data types, DLP tools configured to detect regulated data in AI API calls). Exceptions requiring use of restricted data are documented and approved.
Optimizing
Input data classification is reviewed when new AI capabilities (multimodal inputs, agentic tool use, ambient data collection) expand the input surface. Classification is enforced by technical controls that minimize reliance on employee judgment for high-risk data types. Classification evidence is produced as part of the continuous assurance function.
Evidence Requirements
What an auditor or assessor would expect to see for this control.
- —AI input data classification policy defining categories, handling requirements, and permitted AI systems for each category.
- —Training records showing employees using AI systems have completed AI input data classification training.
- —Technical control evidence (DLP configuration, system prompt instructions, or equivalent) enforcing classification limits for high-risk categories.
- —Exception log documenting approved uses of Category 4 or 5 data as AI inputs, with legal basis and approval.
Implementation Notes
Why generative AI requires its own input classification
General data classification policies were designed for data at rest and in transit in traditional systems: files, databases, email, and network communication. They were not designed for the generative AI input surface, which has characteristics that make standard classification insufficient:
The context window as a data aggregator: A generative AI prompt or context window may aggregate data from multiple classification levels in a single input. A RAG system might retrieve a public document, an internal policy document, and a client contract in a single retrieval operation and inject all three into the same context window. Standard classification policies do not address how to handle aggregated inputs where components have different classification levels.
Implicit data entry: Users routinely enter data into AI systems that they would not consciously classify as regulated or sensitive: vendor names in contract summaries, employee feedback in performance review drafts, patient symptoms in medical documentation assistance. The classification policy must address this implicit data entry, not just intentional data uploads.
Conversation history as a data store: Multi-turn AI interactions accumulate a conversation history that may grow to contain sensitive data entered in earlier turns. This history may be retained by the vendor or accessible to other users on shared AI platforms.
Agentic tool call I/O: AI agents calling tools (web search, database queries, email drafts, calendar access) produce input and output data that is not visible in the prompt but is processed by the AI model and may contain regulated data.
Multimodal inputs: AI systems that accept images, audio, or documents as inputs create input classification challenges not addressed by text-focused policies.
Classification categories for AI inputs
Category 1 — Public: Data that is publicly available or approved for public disclosure. No restrictions on use as AI input. Examples: public regulatory text, published research, marketing materials.
Category 2 — Internal: Non-public organizational data without specific regulatory classification. Permitted as input only in AI systems hosted by approved vendors with adequate DPA. Examples: internal policies, project documentation, meeting notes not containing PII.
Category 3 — Confidential: Organizational data with heightened sensitivity: trade secrets, M&A information, strategic plans, competitive intelligence. Permitted as input only in AI systems with specific approval and on-premises or private cloud hosting where possible. Examples: acquisition targets, pricing strategies, unpublished financial projections.
Category 4 — Regulated: Data subject to specific legal obligations: PII (GDPR/CCPA), PHI (HIPAA), financial records (GLBA, SOX), attorney-client privileged communications. Permitted as AI input only with specific use case approval, documented legal basis, DPIA completion, and vendor DPA covering AI-specific processing.
Category 5 — Restricted: Data that should not enter AI systems under most circumstances: biometric data without explicit consent, data subject to active litigation hold, classified government information, data with contractual prohibitions on AI processing. Prohibited as AI input except with executive approval and documented exceptional justification.
Applying classification to the full input surface
For each AI system in the inventory, document:
- Which input types are in use: direct prompts, context injection, RAG retrieval, tool call I/O, file uploads, multimodal inputs.
- The maximum classification level permitted for each input type.
- Technical or procedural controls enforcing classification limits.
- Any exceptions approved and their expiry.
Example Implementation
AI Input Data Classification — Quick Reference Card
| Category | Examples | Permitted AI systems | Requires |
|---|---|---|---|
| 1 — Public | Public regulatory text, published standards, marketing copy | All approved AI tools | Nothing additional |
| 2 — Internal | Internal policies, project plans, meeting notes (no PII) | Approved vendor AI tools with DPA | Standard employee attestation |
| 3 — Confidential | M&A targets, pricing, strategic plans | On-premises or private cloud AI only; no public SaaS LLMs | Manager approval; log the session |
| 4 — Regulated | PII, PHI, financial records, privileged communications | Only AI systems with specific use case approval + DPIA + AI DPA | Use case approval + privacy counsel sign-off |
| 5 — Restricted | Biometric data, litigation hold, contractual AI-prohibited data | Prohibited | Executive approval required for exceptions |
Common questions:
Can I paste a client contract into an AI writing assistant? Client contracts typically contain Category 4 data (PII, confidential business terms). Check whether the AI system has an approved use case for contract review and an adequate DPA. If not, summarize manually first.
Can I use AI to help write an employee performance review? Performance data is Category 4 (PII). Only permitted in AI systems approved for HR use with DPIA complete. Check the approved AI tools list for HR.
The RAG system pulls documents automatically — how do I classify the inputs? RAG inputs are classified by the highest-category document that could be retrieved. If your retrieval corpus contains Category 4 documents, the RAG system operates under Category 4 restrictions.
