Skip to main content
CodingAgentBench

Pairwise effects · 95% CI · BH-FDR

Which differences are real

Two CLIs can look different on the leaderboard and still be a coin flip. We test every pair and tell you which gaps survive.

What this page is

  • What you'll find: the three sharpest head-to-head gaps we can defend with statistics, not vibes.
  • Who it's for: people choosing between two CLIs and asking "is the difference real."
  • How to go deeper: switch to Nerds mode for the full forest plot of every pairwise comparison.

The three sharpest gaps

Back to the leaderboard

Fork-lineage DAG

Forest

7 roots · 1 fork

Every first-party TUI CodingAgentBench tracks is a tree root. Community forks attach as branches, indexed by lineage from their parent SHA. Each fork carries a paired-bootstrap Δ-PBS against its parent, an AST-divergence measurement, and an admission tier (T0 Sniff → T3 Champion). Submission details live on the Submit page.

T0 · Sniff 0
T1 · Quarantine 1
T2 · Public 0
T3 · Champion 0

aider

1 fork
  • _examplebranchT1 · QuarantineΔ -0.003(-0.019+0.014)AST12.7%

    An example fork manifest used by the test-suite and the site's

crush

0 forks
No community forks of crush yet. Be the first to submit one.

goose

0 forks
No community forks of goose yet. Be the first to submit one.

opencode

0 forks
No community forks of opencode yet. Be the first to submit one.

openhands

0 forks
No community forks of openhands yet. Be the first to submit one.

plandex

0 forks
No community forks of plandex yet. Be the first to submit one.

qwen-code

0 forks
No community forks of qwen-code yet. Be the first to submit one.
Legend. Δ-PBS chips show paired-bootstrap mean against parent with 95% CI in parentheses. AST ribbon width is mean AST divergence, capped at 100%. Bands: patch < 5%, branch 5–25%, hard > 25%.