AI Safety Index and Benchmark Monitoring

Track external AI safety indices, benchmark ratings, and third-party evaluation results for AI vendors and models used by the organization, and incorporate material findings into the vendor risk assessment and re-assessment cycle.

Objective

Supplement internal vendor assessments with independent, externally-produced safety intelligence so that the organization's vendor risk posture reflects the current state of published safety research and does not depend solely on vendor self-reporting.

Maturity Levels

Initial

The organization is not systematically tracking external AI safety indices or benchmark publications. Vendor risk is assessed entirely through vendor self-reporting and internal testing.

Developing

Individual team members may monitor safety benchmark publications informally, but findings are not systematically incorporated into the vendor risk program.

Defined

A defined list of external safety indices and benchmark sources is monitored on a regular cadence. Material findings about vendors or models in the organization's portfolio are reviewed by the AI governance function and routed to the vendor risk assessment process when they indicate a change in the risk profile.

Managed

Safety index findings are documented in the vendor risk register. New publications are reviewed within a defined window. Material adverse findings trigger a re-assessment notification under PRC-008. The organization has a view of how its current vendors and models perform relative to the external safety landscape.

Optimizing

Safety index monitoring is integrated into the vendor scorecard. The organization uses safety benchmark performance as a factor in model selection. Safety benchmark trends across the AI market inform the organization's AI risk appetite review.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

—Defined monitoring list specifying external safety indices and benchmarks to be tracked, with review frequency.

—Monitoring log documenting publications reviewed, material findings, and disposition for the past 12 months.

—Evidence that material adverse findings were routed to the vendor re-assessment or risk register update process.

Implementation Notes

Why external safety intelligence matters for vendor governance

Vendor due diligence (PRC-001) and safety commitment verification (PRC-006) rely primarily on vendor self-reporting and contractual commitments. Vendors have an inherent incentive to present safety information favorably. External indices and benchmarks provide independent signals that vendors cannot control.

Key external sources to monitor:

Stanford HAI AI Index: Annual report with capability and safety trend data across frontier models. Published each spring. Relevant for understanding the broad safety landscape and identifying patterns across the model ecosystem.

Future of Life Institute AI Safety Index: Evaluates major AI labs on safety practices, transparency, and governance across multiple dimensions. Published periodically. High relevance for enterprise vendor governance: provides a consistent framework for comparing vendor safety practices.

ML Commons MLPerf Safety Benchmarks: Quantitative benchmarks for model capability, increasingly including safety-relevant dimensions. Useful for capability tracking alongside safety.

METR Autonomy Evaluations: Model evaluations specifically focused on agentic capability and autonomy. Relevant for organizations deploying AI agents or evaluating agentic vendors.

NIST AISIC Evaluation Results: As the AI Safety Institute Consortium matures its evaluation programs, published results will become authoritative external references for U.S. vendor assessments.

AIR-Bench and HELM Safety: Academic benchmarks covering safety-relevant model behaviors including refusal, bias, and robustness. These change frequently and should be tracked at version level.

Publisher red-team disclosures: When AI vendors publish red-team reports or safety evaluation summaries for their own models, these are important external signals even though they are vendor-produced, because they reflect the vendor's own assessment of their model's risk profile.

What to watch for

When reviewing external safety intelligence, flag:

Material changes in a vendor's safety ranking or score relative to prior periods.
Significant adverse findings about a specific model's behavior in the relevant capability domain.
Safety evaluation methodologies that suggest the vendor's prior self-assessments were incomplete.
New evaluation frameworks that cover capability dimensions not addressed in the organization's existing vendor assessments.
Capability uplift findings suggesting a model can assist in domains the organization has not assessed (CBRN, cyberoffense, manipulation).

Incorporating findings into vendor governance

Monitoring is only valuable if findings are acted upon. Define clear escalation criteria:

Finding indicates a model has material safety-relevant capabilities not disclosed by the vendor: trigger re-assessment notification (PRC-008).
Finding indicates the vendor's safety practices are rated materially worse than disclosed: escalate to vendor safety commitment verification (PRC-006).
Finding indicates industry-wide safety concern for a model category the organization uses: incorporate into AI risk register and risk appetite review.

Date	Source	Publication	Vendors/models in portfolio covered	Material findings	Action taken
2026-01-15	Stanford HAI	AI Index 2026	All major frontier model vendors	No material adverse findings for portfolio vendors. Noted trend toward improved refusal on CBRN queries across ecosystem.	No action required. Added trend note to vendor scorecard.
2026-03-08	Future of Life Institute	AI Safety Index Q1 2026	Anthropic (A), OpenAI (B+), Google DeepMind (A-)	OpenAI rating declined from A- to B+ due to reduced transparency on pre-deployment evaluations.	Vendor monitoring flag added to OpenAI entry. PRC-008 re-assessment for GPT-5.5 deployment accelerated.
2026-04-22	METR	Autonomy Evaluation Report	Claude 3.8, GPT-5.5 (both in portfolio)	Both models showed increased agentic capability scores. Neither exceeded METR's critical autonomy threshold.	Noted in agentic AI deployment readiness assessments. No re-assessment required.
2026-05-30	Academic (arXiv)	Benchmark: multi-turn jailbreak robustness	All frontier models	Unscheduled publication. One portfolio model showed materially lower multi-turn robustness than vendor-published figures.	Immediate review. Vendor notification issued requesting updated red-team data. Re-assessment initiated under PRC-008.

Date

Source

Publication

Vendors/models in portfolio covered

Material findings

Action taken

2026-01-15

Stanford HAI

AI Index 2026

All major frontier model vendors

No material adverse findings for portfolio vendors. Noted trend toward improved refusal on CBRN queries across ecosystem.

No action required. Added trend note to vendor scorecard.

2026-03-08

Future of Life Institute

AI Safety Index Q1 2026

Anthropic (A), OpenAI (B+), Google DeepMind (A-)

OpenAI rating declined from A- to B+ due to reduced transparency on pre-deployment evaluations.

Vendor monitoring flag added to OpenAI entry. PRC-008 re-assessment for GPT-5.5 deployment accelerated.

2026-04-22

METR

Autonomy Evaluation Report

Claude 3.8, GPT-5.5 (both in portfolio)

Both models showed increased agentic capability scores. Neither exceeded METR's critical autonomy threshold.

Noted in agentic AI deployment readiness assessments. No re-assessment required.

2026-05-30

Academic (arXiv)

Benchmark: multi-turn jailbreak robustness

All frontier models

Unscheduled publication. One portfolio model showed materially lower multi-turn robustness than vendor-published figures.

Immediate review. Vendor notification issued requesting updated red-team data. Re-assessment initiated under PRC-008.

AI Safety Index and Benchmark Monitoring

Maturity Levels

Evidence Requirements

Implementation Notes

Why external safety intelligence matters for vendor governance

What to watch for

Incorporating findings into vendor governance

Example Implementation

AI Safety Index Monitoring Log (excerpt)

Control Details

Tags

Mapped Regulations

Related Controls

Related Playbook

Recent Coverage