Harmful Content Filtering
Apply input and output filtering to prevent AI systems from generating or acting on harmful, toxic, illegal, or policy-violating content.
Objective
Protect users, third parties, and the organization from harm caused by AI-generated content that violates safety or policy standards.
Maturity Levels
Initial
No content filtering is applied; harmful outputs reach users without interception.
Developing
Basic content filtering is applied using provider-default settings without customization for enterprise context.
Defined
Content filtering is configured for the enterprise context with documented allowed/prohibited content categories and a defined response to filtered content.
Managed
Filter effectiveness is measured through red team testing; false positive and false negative rates are tracked.
Optimizing
Filters are continuously updated based on emerging harms; filtering decisions are explainable and auditable.
Evidence Requirements
What an auditor or assessor would expect to see for this control.
- —Content filtering policy document specifying prohibited content categories, filter configuration, and escalation process
- —Filter test results confirming the filter correctly blocks prohibited content and does not over-block legitimate content
- —Production filter event logs showing content blocked or flagged over a defined period, with category breakdown
- —False positive rate records and tuning history showing the filter is calibrated to minimize both misses and over-blocking
- —Periodic policy review records confirming filter categories remain appropriate given current threat landscape and use cases
Implementation Notes
Key steps
- Configure content filters beyond provider defaults: most AI providers offer adjustable content policies — review and configure them for your specific enterprise context and user population.
- Apply input filtering as well as output filtering — blocking harmful prompts before they reach the model reduces cost and prevents harmful outputs.
- Define a response policy for filtered content: what does the system tell the user when content is filtered, and is this logged for review?
- Test filters against your specific user population and use cases — general-purpose filters may over-block legitimate professional content in some domains (medical, legal, security research).
Example Implementation
Consumer-facing AI assistant deployed in an HR software platform used by employees
Content Filter Configuration — HR Assistant
Input filters (applied before prompt reaches model):
| Category | Action | Rationale |
|---|---|---|
| Instructions to ignore system prompt | Block + log | Injection defense |
| Requests for personal data about other employees | Block + log | Privacy policy |
| Threats or self-harm language | Block; route to HR emergency line info | User safety |
| Competitor product discussion | Allow (log only) | Not a safety issue; monitor |
Output filters (applied before response delivered to user):
| Category | Threshold | Action |
|---|---|---|
| Hate speech / discrimination | Provider filter: medium | Block; log; serve "I can't help with that" |
| Personal data about other employees | Custom NER check | Block if names + sensitive attributes co-occur |
| Legal advice ("you should sue", "file a complaint with EEOC") | Custom classifier | Append disclaimer: "For legal advice, consult your HR Business Partner or employment counsel" |
| Inaccurate HR policy claims | N/A — human review | High-confidence factual claims about company policy routed for async review |
Calibration: Quarterly review of blocked output sample to tune false positive rate; current false positive rate: 1.2%
Control Details
- Control ID
- SAF-005
- Domain
- Safety & Reliability
- Typical owner
- AI Engineering / Trust & Safety / Legal
- Implementation effort
- Medium effort
- Agent-relevant
- Yes
