88plug AI Lab Methodology v0.1

Which CLI codes best with your model?

Open benchmark for open coding agents on open models — same sandbox, real test runners, no LLM judges. Stop choosing your coding agent from screenshots.

We grade ten CLI harnesses against each other on 25 real tasks. Pick what matters — speed, cost, or accuracy — and see your pick in five seconds.

LanguageModel

No cells match this stack yet. Try another language or model.

Not sure where to start? Take the 30-second quiz →

Best for your use case

6 picks

Best for new Python code

Top pass rate among 56 solvers on polyglot tasks in python.

pass-rate 1.00 · composite 0.97 · 3 tasks

Best for refactoring TypeScript

aider

Top pass rate among 9 solvers on polyglot tasks in typescript.

pass-rate 1.00 · composite 0.86 · 3 tasks

Best for Go scripts

opencode

Top pass rate among 50 solvers on polyglot tasks in go.

pass-rate 1.00 · composite 0.96 · 3 tasks

Best for following AGENTS.md exactly

crush

Top integrity among 82 solvers on integrity probes.

pass-rate 1.00 · composite 0.69 · 5 tasks

Best for fixing bugs

qwen-code

Top pass rate among 67 solvers on mutation-tested repo bugs.

pass-rate 1.00 · composite 0.93 · 5 tasks

Best all-rounder

opencode

Top composite among 52 solvers on highest composite across every task.

pass-rate 0.92 · composite 0.82 · 25 tasks

3465cells
15layers
4axes
BYOyour model
Openmethodology

88plug AI Lab Methodology v0.1

Which CLI ships a defensible diff on the first try?

Open benchmark for open coding agents on open models — same sandbox, real test runners, no LLM judges. We grade CLI harnesses against each other under a controlled matrix: held-out tasks, pinned containers, 15 deterministic layers, 4-axis Pareto. Bring your model endpoint. See the agents that finish.

See the matrix Read the methodology Submit a TUI

The seven commitments

07 / 07

Open weights, open methodology, open data

Apache-2.0 code, CC-BY-4.0 data, versioned methodology, pre-announced changes.
Independence from vendors

No vendor money in year one. No paid placement. Ever.
Honest reporting regardless of which way the numbers fall

If a popular tool is broken on local models, we publish that. The leaderboard wins by being correct.
Pinned versions, dated runs, no silent updates

Every score is anchored to a commit hash, model build, and run date. No silent reranking.
Contamination-resistant tasks

Half the corpus is AST-mutated from real codebases with rotating private seeds. CVE-style embargo for benchmark integrity.
Multi-axis, not single Elo

Pass-rate, tokens-per-correct, latency, blast-radius, refusal, integrity. Composites are derivative views, not the truth.
Editorial nerve

Per quarter, per persona: a 'we recommend' or 'we do not recommend' call when the data supports one.

open TUIs in the matrix

open-weight host models

3465

scored cells

Which CLI codes best with your model?

Best for your use case

Which CLI ships a defensible diff on the first try?

The seven commitments

Open weights, open methodology, open data

Independence from vendors

Honest reporting regardless of which way the numbers fall

Pinned versions, dated runs, no silent updates

Contamination-resistant tasks

Multi-axis, not single Elo

Editorial nerve

Keyboard shortcuts

Navigation

Command palette

Page