Sources · Snapshots · Receipts
Where the numbers come from
Every score on this site traces back to a logged run, a versioned task spec, and a snapshot you can audit.
What this page is
- What you'll find: the four data sources behind every number, with audit paths.
- Who it's for: people who want to know why they should trust a leaderboard.
- How to switch: toggle to Nerds mode for the 9-field canonical key and the copy-paste reproduce command.
The trust ledger
Four sources. Each one names what we collected, when we verified it, and how you can audit it.
Run records
3,465 rows from the v0.1 sweep (snapshot 2026-06-20). Every row carries a sha256 of the prompt, the agent's stdout, and the verifier's grade.
Audit path: results/published. Each JSONL trace replays end-to-end with no network access.
Task catalogue
25 task specs across 3 categories (integrity, mutations, polyglot). Each task pins a language, a setup script, and a grader.
Audit path: site/src/data/tasks.json. Open any /task/[slug] page for the full spec.
Exploit ledger
Honest-loss catalogue of attempts that broke the bench (we call them exploits). Each entry logs the trigger, the grader call, and the date we patched it.
Audit path: results/exploits/_catalog.yaml. The /security page lists the live ones.
Methodology pin
Every number on this site is stamped with methodology_version 0.1 and snapshot_date 2026-06-20. Change either field and the numbers may shift.
Audit path: site/src/data/meta.json. The /methodology page documents the version history.
Signed receipt
This page is pinned to one snapshot
- methodology_version
- 0.1
- snapshot_date
- 2026-06-20
- source
- results/published
- row_count
- 3,465
When we publish a new snapshot we bump snapshot_date and, if any field semantics change, methodology_version. Old snapshots stay reachable on /methodology.