Skip to main content
CodingAgentBench

CodingAgentBench · voice charter

We test in public. We will never rank in private.

CodingAgentBench is a measurement project. The five promises below are not aspirations; they are guardrails. If we break one, we publish what broke and what we changed.

Five we will never lines

  1. 01

    We will never accept payment to score a model or agent higher.

  2. 02

    We will never use an LLM as a judge; every scorer is a runner, a linter, or a diff.

  3. 03

    We will never silently re-rank — every change to methodology gets a version bump and a dated changelog.

  4. 04

    We will never publish a number without a sha256 lineage that lets a stranger rerun the row.

  5. 05

    We will never recommend a closed-weight model we cannot test under our own pinned container.

The 30-second elevator pitch

CodingAgentBench is the open benchmark for command-line coding agents. We test ten coding agents — Aider, Codex, Copilot, Crush, Goose, OpenHands, opencode, Pi, Plandex, Qwen-Code — on fourteen open-weight models across twenty-five real coding tasks.

We rank on three plain signals: Ships, Stays, Costs. Every cell ships with a sha256 receipt, a pinned container, and a public sweep log. You can rerun any row on your own laptop.

No LLM judges. No paid placement. No silent re-ranking. If you are choosing a coding agent for your team, start at codingagentbench.com, pick your stack, and see what actually finishes.

Voice contract

  • Sentences cap at twenty-two words. Break it before you break it.
  • Tense is present and imperative. We test. You pick.
  • No emoji in headers. No exclamation marks. No all-caps shouting.
  • Numbers carry units and a baseline. "9 points higher pass rate vs. Aider+DeepSeek" beats "9% faster".
  • Plain phrase first, codename in parens. "The instruction-fidelity check (L15)."
  • Words we use: discovery, picks, evidence, sweep, lineage, receipts, ledger, matrix, cell, rerun, diff, hold-out, pinned, hermetic, integrity, blast radius, refusal.
  • Words we avoid: revolutionary, unlock, democratize, 10x, magical, AGI, game-changer, seamless, leverage as a verb, empower.