CodingAgentBench · about
Who runs CodingAgentBench
CodingAgentBench is the open benchmark for command-line coding agents on open-weight models. We rank ten CLIs across 14 models on twenty-five held-out tasks. Every cell ships with a receipt.
The lab
CodingAgentBench is built and operated by 88plug AI Lab. Andrew Mello leads the project. Contact: [email protected]. The lab self-funds infrastructure today; see funding disclosure for the full ledger.
We are not a vendor. We do not sell a CLI. We do not sell a model. We measure other people's tools and publish what we find.
What we do not take
Money buys nothing on CodingAgentBench. We refuse every form of vendor leverage over rankings or methodology.
- Payment to score a tool higher.
- Paid placement on the leaderboard.
- Embargoed favorable coverage.
- Vendor influence over methodology version bumps.
- Closed-weight models we cannot test in our own pinned container.
What we publish
Everything that drives a number is in the open. If a claim cannot be traced to an artifact in two clicks, it does not ship.
- Code
- Apache-2.0
- Dataset
- CC-BY-4.0
- Sweep ledger
- Public log of every run
- Container digests
- sha256 per row
- Citation handle
- Zenodo DOI
Who CodingAgentBench is for
Three audiences, three paths.
-
The Practitioner
Engineer or tech lead shopping for a CLI plus model combo. Land at the stack picker and pick in sixty seconds.
-
The Researcher
Paper author or evals lead. Start at methodology and cite via /cite.
-
The Lab
Model release team. Submit a sweep and embed a CodingAgentBench cell on your model card on launch day.