Harmful Content Filtering

Maturity Levels

Initial

No content filtering is applied; harmful outputs reach users without interception.

Developing

Basic content filtering is applied using provider-default settings without customization for enterprise context.

Defined

Content filtering is configured for the enterprise context with documented allowed/prohibited content categories and a defined response to filtered content.

Managed

Filter effectiveness is measured through red team testing; false positive and false negative rates are tracked.

Optimizing

Filters are continuously updated based on emerging harms; filtering decisions are explainable and auditable.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

—Content filtering policy document specifying prohibited content categories, filter configuration, and escalation process

—Filter test results confirming the filter correctly blocks prohibited content and does not over-block legitimate content

—Production filter event logs showing content blocked or flagged over a defined period, with category breakdown

—False positive rate records and tuning history showing the filter is calibrated to minimize both misses and over-blocking

—Periodic policy review records confirming filter categories remain appropriate given current threat landscape and use cases

Implementation Notes

Key steps

Configure content filters beyond provider defaults: most AI providers offer adjustable content policies — review and configure them for your specific enterprise context and user population.
Apply input filtering as well as output filtering — blocking harmful prompts before they reach the model reduces cost and prevents harmful outputs.
Define a response policy for filtered content: what does the system tell the user when content is filtered, and is this logged for review?
Test filters against your specific user population and use cases — general-purpose filters may over-block legitimate professional content in some domains (medical, legal, security research).

Consumer-facing AI assistant deployed in an HR software platform used by employees

Content Filter Configuration — HR Assistant

Input filters (applied before prompt reaches model):

Category	Action	Rationale
Instructions to ignore system prompt	Block + log	Injection defense
Requests for personal data about other employees	Block + log	Privacy policy
Threats or self-harm language	Block; route to HR emergency line info	User safety
Competitor product discussion	Allow (log only)	Not a safety issue; monitor

Output filters (applied before response delivered to user):

Category	Threshold	Action
Hate speech / discrimination	Provider filter: medium	Block; log; serve "I can't help with that"
Personal data about other employees	Custom NER check	Block if names + sensitive attributes co-occur
Legal advice ("you should sue", "file a complaint with EEOC")	Custom classifier	Append disclaimer: "For legal advice, consult your HR Business Partner or employment counsel"
Inaccurate HR policy claims	N/A — human review	High-confidence factual claims about company policy routed for async review

Calibration: Quarterly review of blocked output sample to tune false positive rate; current false positive rate: 1.2%

Maturity Levels

Evidence Requirements

Implementation Notes

Key steps

Example Implementation

Content Filter Configuration — HR Assistant

Control Details

Tags

Mapped Regulations

Related Controls