CodingAgentBench · preview
The real sweep ships soon
You are looking at a sample preview. The numbers render but they are not the final ranks. We email you once when the real sweep lands. Nothing else.
One email when it lands
We send one email, ever, and never share the list. No newsletter, no drip, no upsell.
In the sample preview today
- Seven CLIs and three models wired into the matrix.
- Thirty-five tasks live in the corpus, but seeds are not yet rotated.
- Pareto, leaderboard, and cube pages render synthetic preview cells.
- Methodology v0.1 paper drafts and the 15-layer Reality Engine spec.
- All trust artifacts: code, dataset schema, container digest format.
In the real sweep
- Five-hundred-plus cells from a full pinned-container sweep.
- Confidence intervals from k=3 reruns per cell, BCa bootstrap.
- Public sha256 lineage for every row, container, and seed.
- Held-out task seeds with a CVE-style embargo schedule.
- Day-1 model launch numbers in lab partner blog posts.
- Zenodo DOI cut against the v0.1 methodology pin.