CodingAgentBench · about

Who runs CodingAgentBench

CodingAgentBench is the open benchmark for command-line coding agents on open-weight models. We rank ten CLIs across 14 models on twenty-five held-out tasks. Every cell ships with a receipt.

The lab

CodingAgentBench is built and operated by 88plug AI Lab. Andrew Mello leads the project. Contact: [email protected]. The lab self-funds infrastructure today; see funding disclosure for the full ledger.

We are not a vendor. We do not sell a CLI. We do not sell a model. We measure other people's tools and publish what we find.

What we do not take

Money buys nothing on CodingAgentBench. We refuse every form of vendor leverage over rankings or methodology.

Payment to score a tool higher.
Paid placement on the leaderboard.
Embargoed favorable coverage.
Vendor influence over methodology version bumps.
Closed-weight models we cannot test in our own pinned container.

What we publish

Everything that drives a number is in the open. If a claim cannot be traced to an artifact in two clicks, it does not ship.

Code: Apache-2.0
Dataset: CC-BY-4.0
Sweep ledger: Public log of every run
Container digests: sha256 per row
Citation handle: Zenodo DOI

Who CodingAgentBench is for

Three audiences, three paths.

The Practitioner

Engineer or tech lead shopping for a CLI plus model combo. Land at the stack picker and pick in sixty seconds.
The Researcher

Paper author or evals lead. Start at methodology and cite via /cite.
The Lab

Model release team. Submit a sweep and embed a CodingAgentBench cell on your model card on launch day.

Who runs CodingAgentBench

The lab

What we do not take

What we publish

Who CodingAgentBench is for

Keyboard shortcuts

Navigation

Command palette

Page