MGV · Model & Program GovernanceMGV-006High effortAgent-relevant

RAI Benchmark-Aligned Evaluation Framework

Map internal AI system evaluations to published responsible AI benchmarks and standards (HELM Safety, AIR-Bench, FACTS, and equivalents) to produce evaluation evidence that is interpretable against an independent external standard by regulators, auditors, and enterprise customers.

Objective

Strengthen the credibility and completeness of internal AI evaluations by grounding them in published responsible AI benchmark methodologies, producing evidence that demonstrates not just that evaluation occurred but that it meets an externally verifiable standard.

Maturity Levels

Initial

Internal AI evaluations are designed ad hoc and are not mapped to published benchmarks. Evaluation coverage and rigor vary by team and use case. Evaluation evidence cannot be compared against external standards.

Developing

The organization is aware of published RAI benchmarks and may reference them in documentation, but internal evaluations are not systematically mapped to benchmark methodologies or coverage areas.

Defined

A benchmark alignment mapping document identifies which internal evaluation procedures correspond to which published benchmark dimensions for each AI system type in the organization's portfolio. New AI deployments select applicable benchmarks from the organization's framework during the intake process.

Managed

Internal evaluations are conducted using documented procedures that are explicitly aligned to benchmark methodologies. Evaluation results are recorded in a format that enables comparison with published benchmark results for equivalent models. Material gaps between internal evaluation coverage and benchmark coverage are documented and tracked.

Optimizing

The organization participates in or adopts published benchmark evaluation cycles where practical, in addition to internal evaluation. Benchmark selection is reviewed annually as the benchmark landscape evolves. Evaluation evidence is packaged for regulatory and customer assurance use in formats that reference the underlying benchmark standards.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

—Benchmark alignment mapping document matching internal evaluation procedures to published benchmark dimensions for each AI system type.
—Evaluation records showing results on benchmark-aligned dimensions for AI systems evaluated in the past 12 months.
—Documentation of evaluation methodology divergences from benchmark standards, with justification.
—Evidence that evaluation coverage gaps relative to benchmark standards are tracked and addressed in the assurance function.

Implementation Notes

Why benchmark alignment matters for governance credibility

The core problem with purely internal AI evaluations is that their credibility depends entirely on the evaluator's methodology, which external audiences cannot assess without detailed documentation. When a regulator or enterprise customer asks "how did you evaluate this AI system's safety?", the answer "we ran our internal evaluation" raises the obvious follow-up: "by what standard?"

Benchmark-aligned evaluation answers this question by grounding internal evaluations in methodologies that external audiences can verify and compare. When the organization says "our evaluation covers the dimensions in HELM Safety's safety subset and our model's refusal rate on the HELM harm prompts was X%", this is an auditable, comparable claim.

This is increasingly a practical requirement. The EU AI Act requires that high-risk AI systems be evaluated for specific properties (accuracy, robustness, cybersecurity) using appropriate methods. Multiple enterprise procurement standards now require vendors to provide evaluation evidence referenced to published methodologies. The shift from "we evaluated it" to "we evaluated it against [standard]" is the shift from unverifiable assertion to auditable evidence.

Key benchmark frameworks to know

HELM Safety (Holistic Evaluation of Language Models — Safety): Provides structured evaluation of language model behavior across multiple safety-relevant dimensions including toxicity, bias, disinformation, and code security. Developed by Stanford CRFM. Suitable for internal models and vendor model evaluation.

AIR-Bench (AI Risk Benchmark): Evaluates AI systems against a risk taxonomy derived from regulatory and policy sources. Covers 8 safety risk categories. Useful for grounding internal evaluations in the risk taxonomy used by emerging regulatory frameworks.

FACTS (Factuality, Attribution, Calibration, Truthfulness, and Safety): Focuses on factual accuracy, grounding, and truthfulness in AI outputs. Particularly relevant for RAG systems and AI systems producing factual claims.

METR Autonomy Evaluations: Evaluates agentic AI systems on their ability to complete autonomous tasks. Relevant for organizations deploying AI agents and needing to characterize autonomy levels.

NIST AISIC Evaluation Results: As the AI Safety Institute Consortium matures, AISIC evaluation results will become the reference standard for U.S. vendor evaluations.

Red-Team Benchmarks: Evaluation of model resistance to adversarial prompting, jailbreaking, and capability elicitation. Several published datasets and methodologies exist; the HarmBench dataset is a commonly referenced starting point.

Building the alignment mapping

For each AI system type in the portfolio, document:

What properties need to be evaluated (from intake risk classification and regulatory obligations).
Which published benchmark dimensions correspond to those properties.
Whether the organization's internal evaluation methodology aligns to the benchmark methodology, and where it diverges.
What score or pass/fail result the system achieves on the aligned evaluation, and how this compares to published benchmark results for comparable models.

A complete mapping does not require running every published benchmark in full. It requires that for each evaluated property, the organization can point to an external standard that their evaluation aligns to.

Example Implementation

RAI Benchmark Alignment Map — Text Generation Systems (excerpt)

Property to evaluate	Internal evaluation name	Aligned benchmark	Benchmark dimension	Our result	Benchmark reference result
Toxicity / harmful output	Output harmlessness eval	HELM Safety	Toxicity (ToxiGen, BBQ)	Pass rate: 94.2%	Claude 3.5 Sonnet (HELM, 2025): 96.1%
Refusal on harmful requests	Refusal rate test	HELM Safety	HarmBench refusal rate	91.7% refusal	Model-specific; documented per vendor
Factual accuracy on domain content	Domain fact-check eval	FACTS	Factuality dimension	87.3% accuracy on legal domain test set	Not directly comparable; documented methodology
Bias: demographic representation	Representation eval	AIR-Bench	Category 5 (Discrimination)	Low-bias classification: pass	Pass threshold defined in AIR-Bench rubric
Adversarial prompt resistance	Red-team resistance eval	HarmBench	Direct request, roleplay, context manipulation attack types	Resistance: 88.4% (direct), 79.1% (roleplay)	Disclosed per vendor red-team summary
Agentic task boundary adherence	Scope limit test (for agent deployments)	METR Autonomy	Sandbagging / capability elicitation	Scope escape rate: <2% in test harness	—

Coverage gaps identified:

Code security evaluation not yet mapped to benchmark. HELM Safety includes CyberSecEval; internal eval does not currently cover this dimension. Remediation: add CyberSecEval-aligned prompts to internal eval set by Q3 2026.
Calibration (confidence vs. accuracy) not covered. FACTS includes calibration; internal eval does not. Low priority for current use cases; flagged for future review.

RAI Benchmark-Aligned Evaluation Framework

Maturity Levels

Evidence Requirements

Implementation Notes

Why benchmark alignment matters for governance credibility

Key benchmark frameworks to know

Building the alignment mapping

Example Implementation

RAI Benchmark Alignment Map — Text Generation Systems (excerpt)

Control Details

Tags

Mapped Regulations

Related Controls

Related Playbook

Recent Coverage