Common questions
Short answers. Switch to Nerds mode for the full methodology debate.
What is CodingAgentBench, in one sentence?
CodingAgentBench is an independent, open-source benchmark that pairs every popular open-source coding CLI with every credible open-weight model and runs them through the same 25 tasks on the same hardware. We publish the raw matrix so you can see which combination ships a clean diff on the first try.
Is this affiliated with Anthropic / OpenAI / GitHub / any AI vendor?
No. CodingAgentBench is built by 88plug AI Lab, an independent measurement project. We take no vendor money in our first year. No sponsored placement, no paid certification, no comped enterprise seats. Vendors will never pay for ranking position. The full policy is in commitment 2 of the manifesto.
Can I trust the numbers?
Every cell is pinned to a Docker image digest, a model build ID, and a run date. Each task runs 3 times so we report a 95% confidence interval, not a single point. Half the tasks are mutation-generated to resist contamination. All raw traces ship as a public dataset. If we change a rule, the methodology version increments.
- 3 runs per cell, 95% CI shown
- Docker sha256 + model build ID per row
- Raw traces published on HuggingFace
How do I use this to pick a tool?
Start with the 60-second quiz. It asks 3 questions about your workload, then points to the top combination for your case. If you prefer to browse, the leaderboard's Normal view sorts by a single quality score (we call it PBS) and shows Ships, Stays, and Costs as plain columns.
Do I need to install all 10 CLIs to use the bench?
No. You install nothing on your host. Each CLI runs inside its own Docker container that we publish. Your host needs Docker and a network path to a model endpoint. The 10 CLIs in the v0.1 sweep are Aider, Codex, Copilot, Crush, Goose, opencode, OpenHands, Pi, Plandex, and Qwen Code.
Does CodingAgentBench cost me anything?
Reading the site is free. Running the bench yourself costs whatever your model endpoint charges. A full 3,500-cell sweep (10 TUIs × 14 models × 25 tasks) takes roughly a day on one H100. If you use a hosted endpoint, expect single-digit dollars per cell at current open-weight prices. The runner code, tasks, and traces are Apache-2.0 and CC-BY-4.0.
Which model should I bring?
The current sweep covers open-weight models served keyless via NVIDIA NIM (Llama 3.3, GPT-OSS, Qwen3.5, GLM, Nemotron and more). You can plug in any model that exposes an OpenAI-compatible endpoint. The model catalog lists parser flags, serving stacks, and known compatibility issues for each.
What if my favorite CLI isn't on the list?
The v0.1 MVP covers 10 CLIs. Phase 1 candidates and skipped tools are tracked in the TUI catalog with a public reason per skip. You can submit a new CLI via the submission protocol. It must be runnable in Docker and accept a configurable OpenAI-compatible endpoint.
How often do you re-run the tests?
We re-run the sweep on a public cadence and whenever a tracked CLI ships a new release. New runs land as new rows, anchored to a new Docker digest. The previous run stays online for comparison. We never silently overwrite a result. The methodology document records every cadence change.
I'm a researcher. Where's the rigorous version?
Switch the page mode to Nerds. You get the 15-layer breakdown (L1 through L15), the 7 Morin operators, the manifesto-defense FAQ, the Pareto fronts, raw confidence intervals, and the v0.1 methodology with image digests. The full corpus and run traces ship under Apache-2.0 and CC-BY-4.0.
Manifesto FAQ
The CodingAgentBench MANIFESTO.md makes choices that will get pushback. This is the long-form defense of the choices that draw the most fire.
This document is a stable target. Edits land at methodology version bumps; the rationale does not drift between releases.
1. Why exclude Claude Code, Codex CLI, and Gemini CLI?
Short answer: They are closed-source, vendor-locked, and model-bundled. The controlled-matrix experiment fails on every axis if these are included.
Longer answer:
CodingAgentBench measures operational performance: whether a TUI finishes a task with a small, on-plan, test-green diff on the first try, at a defensible cost, when paired with a given host model. The unit of measurement is the cross-product TUI × host_model × task. That unit is only meaningful when:
- The TUI is the variable being measured.
- The host model is the variable being controlled.
- Both are observable, swappable, and pinnable.
A vendor TUI that is hard-wired to a single proprietary model is not a TUI in the CodingAgentBench sense — it is a product surface. Asking "how would Claude Code do with an open-weight model underneath?" is not a question the tool will let you answer. The experiment cannot be run. Including such tools forces us to compare (Claude Code + Claude-3.x) against (Aider + an open-weight model), which conflates the TUI axis with the model axis and tells the reader nothing about either.
Closed-source TUIs are also unpinnable in our sense. We cannot freeze a source tree, build a Docker image with a sha256: digest, and reproduce it years later. Vendors update their products silently. Manifesto commitment #4 (pinned versions, dated runs, no silent updates) is structurally incompatible.
We may publish a separate "vendor reference" column in a future methodology version, where Claude Code / Codex CLI / Gemini CLI are run as-is, against their bundled models, on the same task corpus. That column will be explicitly not comparable to the open-vs-open matrix.
2. Why only open-weight host models?
Short answer: Reproducibility, cost, and alignment with the BYO-model community CodingAgentBench serves.
Longer answer:
CodingAgentBench exists to inform practitioners who run their own model endpoints. That community is, in practice, the open-weight community. The decisions an open-weight-model operator needs to make ("which TUI should I pair with my self-hosted endpoint?") are not answered by benchmarks against frontier closed models.
Open weights also give us:
- Build-ID pinning. A HuggingFace commit hash is a permanent reference. A "claude-3-7-sonnet-20250219" string is a vendor's promise.
- Cost predictability. Sweeps that cost a fortune are not run by independent labs. A self-hosted open-weight model on a single H100 is within the budget of a one-room operation; sweeping closed APIs at scale is not.
- Community reproduction. Anyone with the listed model and a comparable GPU can reproduce our numbers. Closed models are reproducible only by paying the vendor again.
This is also the manifesto commitment to honest reporting. If closed models outperform open models on a given task, we can report that — but the comparison must come from a separate, clearly-labelled column with its own reproducibility caveats. Mixing them in the headline matrix would mislead readers about what is measurable, repeatable, and re-runnable.
3. Why Docker for every TUI?
Short answer: Sandbox discipline, host isolation, and deterministic image digests are non-negotiable.
Longer answer:
Coding agents execute arbitrary code. Many TUIs are happy to run pip install, npm install, rm -rf, or shell commands directly against the host they are invoked on. Running such tools natively on a benchmark machine is a security failure waiting to happen, and is not how a professional operator would deploy them in production either.
Docker gives us:
- Capability dropping.
--cap-drop=ALLstrips Linux capabilities; the agent's container cannot escalate. - Network isolation. A user-defined bridge with no default route means the agent reaches the model endpoint and nothing else. No DNS exfiltration, no
curl | shfrom a random host. - Filesystem isolation.
--read-only+ atmpfsoverlay means damage is bounded; the agent cannot persist anything between cells. - Reproducibility. A
sha256:image digest is a permanent fingerprint of the entire userland environment the TUI ran in. A score with a digest can be re-run years later. - Equality of treatment. Every TUI runs under the same sandbox. The Aider container has the same protections as the OpenHands container. No tool gets a privileged exception "because it needs network."
This is also why no CodingAgentBench document instructs anyone to install a TUI on their host machine. The instruction is always docker pull codingagentbench/<tui>@sha256:.... The host needs Docker, a network reachable to the model endpoint, and nothing else.
The cost is real — Docker adds container build time, image storage, occasional CI complexity. We accept the cost. The alternative is a benchmark in which the harness is itself a vector for the agent under test to interfere with measurement. That is unacceptable.
4. Why no single Elo score?
Short answer: Goodhart's Law, the LMArena Llama-4 precedent, and the manifesto's multi-axis commitment.
Longer answer:
Single composite scores are inevitable targets for optimization. Once a number is on a leaderboard, every vendor builds toward that number, and the number stops measuring what it once measured. Charles Goodhart's observation that "when a measure becomes a target, it ceases to be a good measure" is the dominant failure mode of every benchmark old enough to be gamed.
The recent precedent is LMArena and Llama 4. In early 2025 a Meta-affiliated submission of a Llama 4 variant to LMArena's leaderboard turned out to be a model finetuned specifically for human-preference voting on that leaderboard, with output style and behavior tuned to the LMArena Elo signal. The leaderboard's headline number went up; the underlying capability of the public-weights model did not match. The lesson is structural: a single composite Elo is the easiest target to game.
CodingAgentBench publishes per-axis numbers as the canonical view:
- pass-rate
- tokens-per-correct-task
- wall-clock-to-merge
- blast-radius
- refusal-of-bad-task
- integrity-probe resistance
We publish a composite (METHODOLOGY §6) for ranking convenience, but the manifesto names it as a derivative view. The leaderboard layout enforces this: a reader cannot see only the composite, they always see the per-axis breakdown alongside.
The "we recommend / we do not recommend" editorial calls (manifesto commitment #7) are also a hedge against composite-tyranny. A composite says "this is rank 3." Editorial nerve says "we recommend this tool if you care about token cost and can tolerate higher blast-radius." Those are different statements. The benchmark provides both.
5. Why no vendor sponsorship in year one?
Short answer: Every measurement organization in adjacent fields that took vendor money has had its credibility damaged. We are not novel; we are paying attention.
Longer answer:
Three independent precedents from the broader measurement-industry world:
- G2. Software-review marketplace. The site's review-volume signal and "leader" badges are widely understood by buyers to be influenced by vendor spend on the platform. The signal-to-noise ratio is degraded for budget-conscious software selection.
- DXOMARK. Smartphone-camera benchmark. Once a respected technical reference, now widely perceived to be influenced by paid partnerships with the phone manufacturers they review. The scores' credibility eroded as the funding model became known.
- Gartner. Industry analyst. Magic Quadrant placement is the canonical example of a metric whose movement is widely (and often correctly) attributed to vendor engagement spend rather than pure capability assessment.
These organizations did useful work. They also accepted money from the vendors they ranked, and their public credibility paid for it. We do not believe CodingAgentBench would be immune. We believe the only durable defense is to refuse the money in the early years, build a reputation for independence, and then introduce a tiered vendor advisory program after the independence is established, structured so that vendors cannot pay for placement.
"After year one" is therefore a hard line. In year one, no vendor money. After year one, a tiered advisory program may exist; vendors can fund operations but cannot fund ranking. The structural separation is documented in advance so the rules cannot be quietly bent.
This is also why the project is housed at 88plug AI Lab and not at a vendor, an accelerator, or an industry consortium. The lab is independent. Year one is bootstrapped from the founder's own time and the community's contributions.
6. Why is CodingAgentBench's scoring so complicated?
Short answer: Real engineering is complicated. Simplification is the problem we are trying to solve, not the goal.
Longer answer:
The complaint comes in two flavors. The friendly version is "I just want to know which agent is best." The unfriendly version is "you are overcomplicating this to feel important." Both are addressed by the same answer.
CodingAgentBench's scoring has fourteen layers (L1–L14) and a composite (CodingAgentBenchScore) because it adopts Edgar Morin's framework of complex thought as its explicit epistemological foundation. See morin-framework.md for the full exposition. The short version: coding-agent measurement has been dominated by single-judge benchmarks (SWE-bench, Aider Polyglot, Terminal-Bench, LMArena) that all share one structural flaw — they assume agent quality is separable, that one truth and one judge can deliver one verdict. That assumption is wrong, and the wrongness shows up in real practitioner experience: an agent that scores high on a single-judge benchmark is often the agent you would never deploy.
Real engineering work is irreducibly complex. The agent that passes the tests may have written tests too weak to discriminate the fix from a wrong fix. The agent that fails may have detected a defect in the test. The agent that "wins" may game the eval notice. None of these realities are captured by a single number. Morin's framework was built — in 1977–2004, decades before this benchmark existed — precisely for measuring systems where reduction loses the signal. We adopted it because the alternative is to publish a number that misleads readers, and the manifesto (commitment 3, honest reporting) does not allow that.
CodingAgentBench's complexity is therefore not a bug. The benchmark is as simple as honest measurement permits, and not one operator simpler. If you want a single number anyway — read FAQ #8.
7. Isn't this just academic posturing?
Short answer: No. The Morin framework forces a concrete divergence in the leaderboard that no other benchmark surfaces — and that divergence is exactly what practitioners need to make deployment decisions.
Longer answer:
Academic posturing is when a framework adds vocabulary without changing the output. The Morin framework changes the output. Here is the concrete case:
Consider two real agents in our v0.2 sweep. Agent A scores RSI=0.95 (passes the hidden tests, survives mutation, no security findings) and MCI=0.20 (writes tests that don't kill its own mutants, ignores feedback loops, behaves wildly differently when it knows it's being watched). Agent B scores RSI=0.70 and MCI=0.85. A traditional single-judge leaderboard ranks A first and B second; the engineer who deploys A discovers a week later that A's tests pass green but the code is full of behaviors no team member would have accepted in code review. The pass-rate did not lie — it just did not measure what mattered.
The divergence between RSI and MCI is the truth. CodingAgentBenchScore composites the two such that neither can rescue the other; the leaderboard surfaces both axes side-by-side with the composite; the editorial calls (manifesto commitment 7) explain what the divergence means for the reader's specific deployment context. Without the Morin framework, this divergence is invisible. With it, the divergence is the headline.
If that's posturing, it is posturing that visibly changes which agents practitioners deploy. We will take the trade.
8. What if I just want a single number?
Short answer: CodingAgentBenchScore is a single number. It just happens to be a single number that is mathematically derived from the simultaneous performance on two orthogonal indices, so it cannot be gamed.
Longer answer:
We hear the request and we do publish a headline composite. CodingAgentBenchScore is a scalar in [0, 1]. It appears in every leaderboard row. You can sort by it.
The reason the underlying machinery is complex is that we built the composite to be Goodhart-defeating. The two component indices — Reality Survival Index (RSI, from layers L1–L9) and Morin Coherence Index (MCI, from layers L10–L14) — measure orthogonal aspects of agent behavior. Gaming RSI (by overfitting to the hidden test format) breaks MCI (because the agent's test-mutation kill rate, observer-stability, and continuation behavior expose the overfit). Gaming MCI (by performing for the observer or writing surface-level idiomatic code) breaks RSI (because the underlying correctness, security, and downstream-caller checks fail). The composite is intentionally constructed so that gaming any single thing forces a degradation on at least one other thing.
This is the systemic operator from Morin's framework made operational. The composite has the structural property that no agent can rise on it by gaming a sub-axis. The single number you sort by is therefore meaningful in a way that single-judge scores have historically failed to be.
If you want to dig deeper, every CodingAgentBenchScore is published alongside the full per-axis breakdown — fourteen layer scores, the RSI, the MCI, and the composite. The leaderboard layout enforces this: you cannot see only the composite. But the headline column is one number, and you can rank by it. That request is honored.
See morin-framework.md for the full framework, and ../METHODOLOGY.md §15 for the CodingAgentBenchScore formula.
Questions not in this FAQ
If your question about a CodingAgentBench choice is not answered here, open a discussion or issue on the project's GitHub. If the question recurs, it lands in this FAQ at the next methodology version bump. The FAQ grows by accretion; entries are not removed once added — they remain in the historical record of which questions the project has had to answer.