Skip to main content
CodingAgentBench

OLAP cube · 3,465 cells

Slice the data any way you want

All 3,465 cells. Pick a TUI, a model, a task family. Get instant answers.

What this page is

  • What you'll find: every (CLI x model x task) cell scored on quality, latency, and cost.
  • Who it's for: people picking a CLI and model combo that wins on their workload.
  • How to switch: toggle to Nerds mode for the full pivot table, heatmap, and brush filters.

Top three winners

  1. opencode + openai/gpt-oss-120b wins 8/25 tasks at $0.0000 per task and 14.8s wall time (strongest on polyglot: 7).
  2. pi + openai/gpt-oss-120b wins 5/25 tasks at $0.0000 per task and 12.6s wall time (strongest on polyglot: 4).
  3. copilot + mistralai/mistral-small-4-119b-2603 wins 2/25 tasks at $0.0000 per task and 10.7s wall time (strongest on mutations: 1).

A combo wins a task when its composite score (we call it PBS) is the highest of any CLI on that task. Cost is a $1.0 / MTok proxy on tokens-per-correct (we call it the flat-rate roll-up); real per-token prices land via harness/tracing/cost.py.

Where the wins land

cli / model Qwen3.5 GPT-OSS 120B Llama 3.3 70B
Aider
Crush
Goose
opencode
OpenHands
Plandex
Qwen-Code
clis
7
models
3
tasks
25
peak
1

Each cell is one (CLI, model) pair. Darker cells win more tasks. The full cube has 25 tasks across three categories (polyglot, mutations, integrity).

Take me back home

OLAP cube · DuckDB-Wasm · 3,465 cells

Cube Explorer

Slice the benchmark cube on TUI × model × task × plugin-stack. Switch between table, heatmap, scatter, and parallel coordinates. The URL is the source of truth — every view is shareable.

v0.1 · 2026-06-20
Methodology v0.1 | Pinned to image digests as of 2026-06-20
3465 rows · 10 TUIs · 14 models · 3 categories