AI Governance Institute logo
AI Governance Institute

Practical Governance for Enterprise AI

Procurement
PRC · ProcurementPRC-012Low effort

AI Safety Index and Benchmark Monitoring

Track external AI safety indices, benchmark ratings, and third-party evaluation results for AI vendors and models used by the organization, and incorporate material findings into the vendor risk assessment and re-assessment cycle.

Objective

Supplement internal vendor assessments with independent, externally-produced safety intelligence so that the organization's vendor risk posture reflects the current state of published safety research and does not depend solely on vendor self-reporting.

Maturity Levels

1

Initial

The organization is not systematically tracking external AI safety indices or benchmark publications. Vendor risk is assessed entirely through vendor self-reporting and internal testing.

2

Developing

Individual team members may monitor safety benchmark publications informally, but findings are not systematically incorporated into the vendor risk program.

3

Defined

A defined list of external safety indices and benchmark sources is monitored on a regular cadence. Material findings about vendors or models in the organization's portfolio are reviewed by the AI governance function and routed to the vendor risk assessment process when they indicate a change in the risk profile.

4

Managed

Safety index findings are documented in the vendor risk register. New publications are reviewed within a defined window. Material adverse findings trigger a re-assessment notification under PRC-008. The organization has a view of how its current vendors and models perform relative to the external safety landscape.

5

Optimizing

Safety index monitoring is integrated into the vendor scorecard. The organization uses safety benchmark performance as a factor in model selection. Safety benchmark trends across the AI market inform the organization's AI risk appetite review.

Evidence Requirements

What an auditor or assessor would expect to see for this control.

  • Defined monitoring list specifying external safety indices and benchmarks to be tracked, with review frequency.
  • Monitoring log documenting publications reviewed, material findings, and disposition for the past 12 months.
  • Evidence that material adverse findings were routed to the vendor re-assessment or risk register update process.

Implementation Notes

Why external safety intelligence matters for vendor governance

Vendor due diligence (PRC-001) and safety commitment verification (PRC-006) rely primarily on vendor self-reporting and contractual commitments. Vendors have an inherent incentive to present safety information favorably. External indices and benchmarks provide independent signals that vendors cannot control.

Key external sources to monitor:

Stanford HAI AI Index: Annual report with capability and safety trend data across frontier models. Published each spring. Relevant for understanding the broad safety landscape and identifying patterns across the model ecosystem.

Future of Life Institute AI Safety Index: Evaluates major AI labs on safety practices, transparency, and governance across multiple dimensions. Published periodically. High relevance for enterprise vendor governance: provides a consistent framework for comparing vendor safety practices.

ML Commons MLPerf Safety Benchmarks: Quantitative benchmarks for model capability, increasingly including safety-relevant dimensions. Useful for capability tracking alongside safety.

METR Autonomy Evaluations: Model evaluations specifically focused on agentic capability and autonomy. Relevant for organizations deploying AI agents or evaluating agentic vendors.

NIST AISIC Evaluation Results: As the AI Safety Institute Consortium matures its evaluation programs, published results will become authoritative external references for U.S. vendor assessments.

AIR-Bench and HELM Safety: Academic benchmarks covering safety-relevant model behaviors including refusal, bias, and robustness. These change frequently and should be tracked at version level.

Publisher red-team disclosures: When AI vendors publish red-team reports or safety evaluation summaries for their own models, these are important external signals even though they are vendor-produced, because they reflect the vendor's own assessment of their model's risk profile.

What to watch for

When reviewing external safety intelligence, flag:

  • Material changes in a vendor's safety ranking or score relative to prior periods.
  • Significant adverse findings about a specific model's behavior in the relevant capability domain.
  • Safety evaluation methodologies that suggest the vendor's prior self-assessments were incomplete.
  • New evaluation frameworks that cover capability dimensions not addressed in the organization's existing vendor assessments.
  • Capability uplift findings suggesting a model can assist in domains the organization has not assessed (CBRN, cyberoffense, manipulation).

Incorporating findings into vendor governance

Monitoring is only valuable if findings are acted upon. Define clear escalation criteria:

  • Finding indicates a model has material safety-relevant capabilities not disclosed by the vendor: trigger re-assessment notification (PRC-008).
  • Finding indicates the vendor's safety practices are rated materially worse than disclosed: escalate to vendor safety commitment verification (PRC-006).
  • Finding indicates industry-wide safety concern for a model category the organization uses: incorporate into AI risk register and risk appetite review.

Example Implementation

AI Safety Index Monitoring Log (excerpt)

Monitoring cadence: Quarterly review of scheduled publications; ad hoc review for unscheduled publications within 10 business days.

DateSourcePublicationVendors/models in portfolio coveredMaterial findingsAction taken
2026-01-15Stanford HAIAI Index 2026All major frontier model vendorsNo material adverse findings for portfolio vendors. Noted trend toward improved refusal on CBRN queries across ecosystem.No action required. Added trend note to vendor scorecard.
2026-03-08Future of Life InstituteAI Safety Index Q1 2026Anthropic (A), OpenAI (B+), Google DeepMind (A-)OpenAI rating declined from A- to B+ due to reduced transparency on pre-deployment evaluations.Vendor monitoring flag added to OpenAI entry. PRC-008 re-assessment for GPT-5.5 deployment accelerated.
2026-04-22METRAutonomy Evaluation ReportClaude 3.8, GPT-5.5 (both in portfolio)Both models showed increased agentic capability scores. Neither exceeded METR's critical autonomy threshold.Noted in agentic AI deployment readiness assessments. No re-assessment required.
2026-05-30Academic (arXiv)Benchmark: multi-turn jailbreak robustnessAll frontier modelsUnscheduled publication. One portfolio model showed materially lower multi-turn robustness than vendor-published figures.Immediate review. Vendor notification issued requesting updated red-team data. Re-assessment initiated under PRC-008.

Active monitoring sources:

  • Stanford HAI AI Index (annual, spring)
  • Future of Life Institute AI Safety Index (quarterly)
  • METR Autonomy Evaluations (periodic)
  • NIST AISIC Evaluation Results (as published)
  • Vendor red-team disclosures (as published — subscribed to vendor RSS/blogs)
  • arXiv AI safety track (weekly digest via monitoring service)

Control Details

Control ID
PRC-012
Typical owner
Chief AI Officer / Chief Risk Officer / AI Governance Committee
Implementation effort
Low effort
Agent-relevant
No

Tags

AI safety benchmarkssafety indicesexternal evaluationvendor monitoringthird-party assessmentbenchmark tracking