88plug AI Lab Methodology v0.1
Which CLI codes best with your model?
Open benchmark for open coding agents on open models — same sandbox, real test runners, no LLM judges. Stop choosing your coding agent from screenshots.
We grade ten CLI harnesses against each other on 25 real tasks. Pick what matters — speed, cost, or accuracy — and see your pick in five seconds.
No cells match this stack yet. Try another language or model.
Not sure where to start? Take the 30-second quiz →
Best for your use case
6 picksTop pass rate among 56 solvers on polyglot tasks in python.
pass-rate 1.00 · composite 0.97 · 3 tasks
Top pass rate among 9 solvers on polyglot tasks in typescript.
pass-rate 1.00 · composite 0.86 · 3 tasks
Top pass rate among 50 solvers on polyglot tasks in go.
pass-rate 1.00 · composite 0.96 · 3 tasks
Top integrity among 82 solvers on integrity probes.
pass-rate 1.00 · composite 0.69 · 5 tasks
Top pass rate among 67 solvers on mutation-tested repo bugs.
pass-rate 1.00 · composite 0.93 · 5 tasks
Top composite among 52 solvers on highest composite across every task.
pass-rate 0.92 · composite 0.82 · 25 tasks
- 3465cells
- 15layers
- 4axes
- BYOyour model
- Openmethodology
88plug AI Lab Methodology v0.1
Which CLI ships a defensible diff on the first try?
Open benchmark for open coding agents on open models — same sandbox, real test runners, no LLM judges. We grade CLI harnesses against each other under a controlled matrix: held-out tasks, pinned containers, 15 deterministic layers, 4-axis Pareto. Bring your model endpoint. See the agents that finish.
The seven commitments
07 / 07-
Open weights, open methodology, open data
Apache-2.0 code, CC-BY-4.0 data, versioned methodology, pre-announced changes.
-
Independence from vendors
No vendor money in year one. No paid placement. Ever.
-
Honest reporting regardless of which way the numbers fall
If a popular tool is broken on local models, we publish that. The leaderboard wins by being correct.
-
Pinned versions, dated runs, no silent updates
Every score is anchored to a commit hash, model build, and run date. No silent reranking.
-
Contamination-resistant tasks
Half the corpus is AST-mutated from real codebases with rotating private seeds. CVE-style embargo for benchmark integrity.
-
Multi-axis, not single Elo
Pass-rate, tokens-per-correct, latency, blast-radius, refusal, integrity. Composites are derivative views, not the truth.
-
Editorial nerve
Per quarter, per persona: a 'we recommend' or 'we do not recommend' call when the data supports one.