Skip to main content
CodingAgentBench

88plug AI Lab Methodology v0.1

Which CLI codes best with your model?

Open benchmark for open coding agents on open models — same sandbox, real test runners, no LLM judges. Stop choosing your coding agent from screenshots.

We grade ten CLI harnesses against each other on 25 real tasks. Pick what matters — speed, cost, or accuracy — and see your pick in five seconds.

No cells match this stack yet. Try another language or model.

Not sure where to start? Take the 30-second quiz →

Best for your use case

6 picks
  • 3465cells
  • 15layers
  • 4axes
  • BYOyour model
  • Openmethodology

88plug AI Lab Methodology v0.1

Which CLI ships a defensible diff on the first try?

Open benchmark for open coding agents on open models — same sandbox, real test runners, no LLM judges. We grade CLI harnesses against each other under a controlled matrix: held-out tasks, pinned containers, 15 deterministic layers, 4-axis Pareto. Bring your model endpoint. See the agents that finish.

The seven commitments

07 / 07
  1. Open weights, open methodology, open data

    Apache-2.0 code, CC-BY-4.0 data, versioned methodology, pre-announced changes.

  2. Independence from vendors

    No vendor money in year one. No paid placement. Ever.

  3. Honest reporting regardless of which way the numbers fall

    If a popular tool is broken on local models, we publish that. The leaderboard wins by being correct.

  4. Pinned versions, dated runs, no silent updates

    Every score is anchored to a commit hash, model build, and run date. No silent reranking.

  5. Contamination-resistant tasks

    Half the corpus is AST-mutated from real codebases with rotating private seeds. CVE-style embargo for benchmark integrity.

  6. Multi-axis, not single Elo

    Pass-rate, tokens-per-correct, latency, blast-radius, refusal, integrity. Composites are derivative views, not the truth.

  7. Editorial nerve

    Per quarter, per persona: a 'we recommend' or 'we do not recommend' call when the data supports one.

10
open TUIs in the matrix
14
open-weight host models
3465
scored cells