Sources · Snapshots · Receipts

Where the numbers come from

Name: CodingAgentBench Sweep Results
Published: 2026-06-20
License: https://creativecommons.org/licenses/by/4.0/

Every score on this site traces back to a logged run, a versioned task spec, and a snapshot you can audit.

What this page is

What you'll find: the four data sources behind every number, with audit paths.
Who it's for: people who want to know why they should trust a leaderboard.
How to switch: toggle to Nerds mode for the 9-field canonical key and the copy-paste reproduce command.

The trust ledger

Four sources. Each one names what we collected, when we verified it, and how you can audit it.

Run records

3,465 rows from the v0.1 sweep (snapshot 2026-06-20). Every row carries a sha256 of the prompt, the agent's stdout, and the verifier's grade.

Audit path: results/published. Each JSONL trace replays end-to-end with no network access.

Task catalogue

25 task specs across 3 categories (integrity, mutations, polyglot). Each task pins a language, a setup script, and a grader.

Audit path: site/src/data/tasks.json. Open any /task/[slug] page for the full spec.

Exploit ledger

Honest-loss catalogue of attempts that broke the bench (we call them exploits). Each entry logs the trigger, the grader call, and the date we patched it.

Audit path: results/exploits/_catalog.yaml. The /security page lists the live ones.

Methodology pin

Every number on this site is stamped with methodology_version 0.1 and snapshot_date 2026-06-20. Change either field and the numbers may shift.

Audit path: site/src/data/meta.json. The /methodology page documents the version history.

Signed receipt

This page is pinned to one snapshot

methodology_version: 0.1
snapshot_date: 2026-06-20
source: results/published
row_count: 3,465

When we publish a new snapshot we bump snapshot_date and, if any field semantics change, methodology_version. Old snapshots stay reachable on /methodology.

Read the full methodology

Cell provenance

Every CodingAgentBench leaderboard row is uniquely identified by a 9-field canonical key. Five fields name what was tested; four pin the exact bytes that produced the score. The icon at the end of each leaderboard row opens a popover with all nine values, copyable.

Key fields

Identify the cell uniquely — change any of these and you have a different cell.

tui: TUI. Coding-agent CLI under test (e.g. opencode, aider).
behavior_mode: Behavior mode. Agent personality preset (conv = conversational baseline).
model: Model. Open-weight host model id, e.g. llama-3.3-70b-instruct.
task: Task. Task id from the task registry, e.g. polyglot/python-two-sum.
plugin_stack: Plugin stack. Pinned plugin bundle that ran inside the TUI.

Provenance fields

Pin reproducibility — exact bytes that produced the scores you see.

tui_image_sha: TUI image sha256. Container image digest of the TUI for this run.
model_endpoint_sha: Model endpoint sha256. Digest of the model build / endpoint manifest.
sampling_profile: Sampling profile. Sampling preset id (bundles temp / top_p / seed).
task_added_date: Task added date. Day the task entered the registry — trust-window anchor.

Reproduce any cell

Open the per-row popover on /leaderboard, copy the command, and paste it into your shell. The command pins every dimension of the run — same TUI image, same model endpoint, same sampling preset — so you can reproduce a leaderboard cell byte-for-byte. Example:

codingagentbench run \
  --tui 'opencode' \
  --behavior-mode 'conv' \
  --model 'llama-3.3-70b-instruct' \
  --task 'polyglot/python-two-sum' \
  --plugin-stack 'conv-vanilla' \
  --tui-image-sha 'sha256:0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef' \
  --model-endpoint-sha 'nim:meta/llama-3.3-70b-instruct' \
  --sampling-profile 'default-t0.2-tp0.95' \
  --task-added-date '2026-05-01'

Field order matches the popover: tui · behavior_mode · model · task · plugin_stack · tui_image_sha · model_endpoint_sha · sampling_profile · task_added_date.

Missing fields

A field that isn’t recorded for a given cell is shown as —. That signals an honest gap, not an error — usually because the cell predates the methodology version that introduced the field. See methodology for the per-version field timeline.