AI Safety Index and Benchmark Monitoring
Track external AI safety indices, benchmark ratings, and third-party evaluation results for AI vendors and models used by the organization, and incorporate material findings into the vendor risk assessment and re-assessment cycle.
Objective
Supplement internal vendor assessments with independent, externally-produced safety intelligence so that the organization's vendor risk posture reflects the current state of published safety research and does not depend solely on vendor self-reporting.
Maturity Levels
Initial
The organization is not systematically tracking external AI safety indices or benchmark publications. Vendor risk is assessed entirely through vendor self-reporting and internal testing.
Developing
Individual team members may monitor safety benchmark publications informally, but findings are not systematically incorporated into the vendor risk program.
Defined
A defined list of external safety indices and benchmark sources is monitored on a regular cadence. Material findings about vendors or models in the organization's portfolio are reviewed by the AI governance function and routed to the vendor risk assessment process when they indicate a change in the risk profile.
Managed
Safety index findings are documented in the vendor risk register. New publications are reviewed within a defined window. Material adverse findings trigger a re-assessment notification under PRC-008. The organization has a view of how its current vendors and models perform relative to the external safety landscape.
Optimizing
Safety index monitoring is integrated into the vendor scorecard. The organization uses safety benchmark performance as a factor in model selection. Safety benchmark trends across the AI market inform the organization's AI risk appetite review.
Evidence Requirements
What an auditor or assessor would expect to see for this control.
- —Defined monitoring list specifying external safety indices and benchmarks to be tracked, with review frequency.
- —Monitoring log documenting publications reviewed, material findings, and disposition for the past 12 months.
- —Evidence that material adverse findings were routed to the vendor re-assessment or risk register update process.
Implementation Notes
Why external safety intelligence matters for vendor governance
Vendor due diligence (PRC-001) and safety commitment verification (PRC-006) rely primarily on vendor self-reporting and contractual commitments. Vendors have an inherent incentive to present safety information favorably. External indices and benchmarks provide independent signals that vendors cannot control.
Key external sources to monitor:
Stanford HAI AI Index: Annual report with capability and safety trend data across frontier models. Published each spring. Relevant for understanding the broad safety landscape and identifying patterns across the model ecosystem.
Future of Life Institute AI Safety Index: Evaluates major AI labs on safety practices, transparency, and governance across multiple dimensions. Published periodically. High relevance for enterprise vendor governance: provides a consistent framework for comparing vendor safety practices.
ML Commons MLPerf Safety Benchmarks: Quantitative benchmarks for model capability, increasingly including safety-relevant dimensions. Useful for capability tracking alongside safety.
METR Autonomy Evaluations: Model evaluations specifically focused on agentic capability and autonomy. Relevant for organizations deploying AI agents or evaluating agentic vendors.
NIST AISIC Evaluation Results: As the AI Safety Institute Consortium matures its evaluation programs, published results will become authoritative external references for U.S. vendor assessments.
AIR-Bench and HELM Safety: Academic benchmarks covering safety-relevant model behaviors including refusal, bias, and robustness. These change frequently and should be tracked at version level.
Publisher red-team disclosures: When AI vendors publish red-team reports or safety evaluation summaries for their own models, these are important external signals even though they are vendor-produced, because they reflect the vendor's own assessment of their model's risk profile.
What to watch for
When reviewing external safety intelligence, flag:
- Material changes in a vendor's safety ranking or score relative to prior periods.
- Significant adverse findings about a specific model's behavior in the relevant capability domain.
- Safety evaluation methodologies that suggest the vendor's prior self-assessments were incomplete.
- New evaluation frameworks that cover capability dimensions not addressed in the organization's existing vendor assessments.
- Capability uplift findings suggesting a model can assist in domains the organization has not assessed (CBRN, cyberoffense, manipulation).
Incorporating findings into vendor governance
Monitoring is only valuable if findings are acted upon. Define clear escalation criteria:
- Finding indicates a model has material safety-relevant capabilities not disclosed by the vendor: trigger re-assessment notification (PRC-008).
- Finding indicates the vendor's safety practices are rated materially worse than disclosed: escalate to vendor safety commitment verification (PRC-006).
- Finding indicates industry-wide safety concern for a model category the organization uses: incorporate into AI risk register and risk appetite review.
Example Implementation
AI Safety Index Monitoring Log (excerpt)
Monitoring cadence: Quarterly review of scheduled publications; ad hoc review for unscheduled publications within 10 business days.
| Date | Source | Publication | Vendors/models in portfolio covered | Material findings | Action taken |
|---|---|---|---|---|---|
| 2026-01-15 | Stanford HAI | AI Index 2026 | All major frontier model vendors | No material adverse findings for portfolio vendors. Noted trend toward improved refusal on CBRN queries across ecosystem. | No action required. Added trend note to vendor scorecard. |
| 2026-03-08 | Future of Life Institute | AI Safety Index Q1 2026 | Anthropic (A), OpenAI (B+), Google DeepMind (A-) | OpenAI rating declined from A- to B+ due to reduced transparency on pre-deployment evaluations. | Vendor monitoring flag added to OpenAI entry. PRC-008 re-assessment for GPT-5.5 deployment accelerated. |
| 2026-04-22 | METR | Autonomy Evaluation Report | Claude 3.8, GPT-5.5 (both in portfolio) | Both models showed increased agentic capability scores. Neither exceeded METR's critical autonomy threshold. | Noted in agentic AI deployment readiness assessments. No re-assessment required. |
| 2026-05-30 | Academic (arXiv) | Benchmark: multi-turn jailbreak robustness | All frontier models | Unscheduled publication. One portfolio model showed materially lower multi-turn robustness than vendor-published figures. | Immediate review. Vendor notification issued requesting updated red-team data. Re-assessment initiated under PRC-008. |
Active monitoring sources:
- Stanford HAI AI Index (annual, spring)
- Future of Life Institute AI Safety Index (quarterly)
- METR Autonomy Evaluations (periodic)
- NIST AISIC Evaluation Results (as published)
- Vendor red-team disclosures (as published — subscribed to vendor RSS/blogs)
- arXiv AI safety track (weekly digest via monitoring service)
