Skip to main content
CodingAgentBench

CodingAgentBench · about

Who runs CodingAgentBench

CodingAgentBench is the open benchmark for command-line coding agents on open-weight models. We rank ten CLIs across 14 models on twenty-five held-out tasks. Every cell ships with a receipt.

The lab

CodingAgentBench is built and operated by 88plug AI Lab. Andrew Mello leads the project. Contact: [email protected]. The lab self-funds infrastructure today; see funding disclosure for the full ledger.

We are not a vendor. We do not sell a CLI. We do not sell a model. We measure other people's tools and publish what we find.

What we do not take

Money buys nothing on CodingAgentBench. We refuse every form of vendor leverage over rankings or methodology.

  • Payment to score a tool higher.
  • Paid placement on the leaderboard.
  • Embargoed favorable coverage.
  • Vendor influence over methodology version bumps.
  • Closed-weight models we cannot test in our own pinned container.

What we publish

Everything that drives a number is in the open. If a claim cannot be traced to an artifact in two clicks, it does not ship.

Code
Apache-2.0
Dataset
CC-BY-4.0
Sweep ledger
Public log of every run
Container digests
sha256 per row
Citation handle
Zenodo DOI

Who CodingAgentBench is for

Three audiences, three paths.

  • The Practitioner

    Engineer or tech lead shopping for a CLI plus model combo. Land at the stack picker and pick in sixty seconds.

  • The Researcher

    Paper author or evals lead. Start at methodology and cite via /cite.

  • The Lab

    Model release team. Submit a sweep and embed a CodingAgentBench cell on your model card on launch day.