AI Governance Institute logo
AI Governance Institute

Practical Governance for Enterprise AI

Safety & Reliability
SAF · Safety & ReliabilitySAF-005Medium effortAgent-relevant

Harmful Content Filtering

Apply input and output filtering to prevent AI systems from generating or acting on harmful, toxic, illegal, or policy-violating content.

Objective

Protect users, third parties, and the organization from harm caused by AI-generated content that violates safety or policy standards.

Maturity Levels

1

Initial

No content filtering is applied; harmful outputs reach users without interception.

2

Developing

Basic content filtering is applied using provider-default settings without customization for enterprise context.

3

Defined

Content filtering is configured for the enterprise context with documented allowed/prohibited content categories and a defined response to filtered content.

4

Managed

Filter effectiveness is measured through red team testing; false positive and false negative rates are tracked.

5

Optimizing

Filters are continuously updated based on emerging harms; filtering decisions are explainable and auditable.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

  • Content filtering policy document specifying prohibited content categories, filter configuration, and escalation process
  • Filter test results confirming the filter correctly blocks prohibited content and does not over-block legitimate content
  • Production filter event logs showing content blocked or flagged over a defined period, with category breakdown
  • False positive rate records and tuning history showing the filter is calibrated to minimize both misses and over-blocking
  • Periodic policy review records confirming filter categories remain appropriate given current threat landscape and use cases

Implementation Notes

Key steps

  • Configure content filters beyond provider defaults: most AI providers offer adjustable content policies — review and configure them for your specific enterprise context and user population.
  • Apply input filtering as well as output filtering — blocking harmful prompts before they reach the model reduces cost and prevents harmful outputs.
  • Define a response policy for filtered content: what does the system tell the user when content is filtered, and is this logged for review?
  • Test filters against your specific user population and use cases — general-purpose filters may over-block legitimate professional content in some domains (medical, legal, security research).

Example Implementation

Consumer-facing AI assistant deployed in an HR software platform used by employees

Content Filter Configuration — HR Assistant

Input filters (applied before prompt reaches model):

CategoryActionRationale
Instructions to ignore system promptBlock + logInjection defense
Requests for personal data about other employeesBlock + logPrivacy policy
Threats or self-harm languageBlock; route to HR emergency line infoUser safety
Competitor product discussionAllow (log only)Not a safety issue; monitor

Output filters (applied before response delivered to user):

CategoryThresholdAction
Hate speech / discriminationProvider filter: mediumBlock; log; serve "I can't help with that"
Personal data about other employeesCustom NER checkBlock if names + sensitive attributes co-occur
Legal advice ("you should sue", "file a complaint with EEOC")Custom classifierAppend disclaimer: "For legal advice, consult your HR Business Partner or employment counsel"
Inaccurate HR policy claimsN/A — human reviewHigh-confidence factual claims about company policy routed for async review

Calibration: Quarterly review of blocked output sample to tune false positive rate; current false positive rate: 1.2%

Control Details

Control ID
SAF-005
Typical owner
AI Engineering / Trust & Safety / Legal
Implementation effort
Medium effort
Agent-relevant
Yes

Tags

content filteringsafetytoxic contenttrust and safety