Field guide

The brains we tested

14 open-weight models. You bring one; we plug every agent into it and report what happened — same prompts, same harness.

Gpt-Oss-120b

huggingface.co/openai

OpenAI's open-weight model for general coding tasks.

120B-class openai's open-weight gpt model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Step-3.5-Flash

huggingface.co/stepfun-ai

StepFun's open-weight model for everyday coding tasks.

open-weight open-weight model evaluated on codingagentbench, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Qwen3-Next-80b

huggingface.co/Qwen

Alibaba's open-weight model for multilingual coding.

80B-class alibaba's open-weight qwen model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Qwen3.5-397b

huggingface.co/Qwen

Alibaba's open-weight model for multilingual coding.

397B-class alibaba's open-weight qwen model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Minimax-M2.7

huggingface.co/minimaxai

minimaxai's open-weight model for everyday coding tasks.

open-weight open-weight model evaluated on codingagentbench, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Nemotron-3-Ultra-550b

huggingface.co/nvidia

NVIDIA's open-weight model for reasoning-heavy coding tasks.

550B-class nvidia's open-weight reasoning model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Nemotron-3-Super-120b

huggingface.co/nvidia

NVIDIA's open-weight model for reasoning-heavy coding tasks.

120B-class nvidia's open-weight reasoning model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Mistral-Small-4-119b-2603

huggingface.co/mistralai

Mistral AI's open-weight model for everyday coding tasks.

119B-class mistral's open-weight model, served keyless via NVIDIA NIM and evaluated across 230 (TUI×task) cells on the board.

Llama-3.3-70b

huggingface.co/meta-llama

Meta's open-weight model for everyday coding tasks.

70B-class meta's open-weight llama model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Minimax-M3

huggingface.co/minimaxai

minimaxai's open-weight model for everyday coding tasks.

open-weight open-weight model evaluated on codingagentbench, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Llama-4-Maverick-17b-128e

huggingface.co/meta-llama

Meta's open-weight model for everyday coding tasks.

17B-class meta's open-weight llama model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Glm-5.1

huggingface.co/zai-org

Zhipu AI's open-weight model for multilingual coding.

open-weight zhipu's open-weight glm model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Llama-3.3-Nemotron-Super-49b-V1

huggingface.co/nvidia

NVIDIA's open-weight model for reasoning-heavy coding tasks.

49B-class nvidia's open-weight reasoning model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.

Qwen3.5-122b

huggingface.co/Qwen

Alibaba's open-weight model for multilingual coding.

122B-class alibaba's open-weight qwen model, served keyless via NVIDIA NIM and evaluated across 235 (TUI×task) cells on the board.

Host Model Catalog

CodingAgentBench scores the ten CLI coding agents head-to-head on a shared matrix of open-weight host models. Every model below is open-weight per manifesto commitment #1. Closed-weight models may appear in a future "reference" column but never as a row in the open-vs-open matrix.

All host models in the current sweep are served keyless through NVIDIA NIM, fronted by a local LiteLLM proxy so every TUI talks to one OpenAI-compatible endpoint. That keeps the serving layer identical across agents — the only thing that varies within a model column is the CLI wrapper, which is the comparison CodingAgentBench exists to make.

For each model the board records, per (TUI × task) cell:

Pass rate — fraction of held-out tasks whose runner/lint/diff checks pass. Never an LLM judge.
Composite — the derivative 4-axis score (quality, cost, latency, blast-radius).
Evidence — how many cells that model appears in. Models with more cells carry more of the board.

Numbers below are read directly from the live leaderboard and refresh on every methodology version bump.

Host models in the current sweep

The board spans 14 open-weight models from DeepSeek, Meta, Mistral AI, Moonshot AI, NVIDIA, Alibaba (Qwen), StepFun, and Zhipu AI — each present in every (TUI × task) cell. Because coverage is uniform, every column is equally load-bearing.

Model	Provider	Class
`deepseek-ai/deepseek-v4-flash`	DeepSeek	MoE, fast
`deepseek-ai/deepseek-v4-pro`	DeepSeek	MoE, flagship
`meta/llama-3.3-70b-instruct`	Meta	Dense 70B
`meta/llama-4-maverick-17b-128e-instruct`	Meta	MoE
`mistralai/mistral-small-4-119b-2603`	Mistral AI	Dense
`moonshotai/kimi-k2.6`	Moonshot AI	MoE
`nvidia/llama-3.3-nemotron-super-49b-v1`	NVIDIA	Dense, reasoning
`nvidia/nemotron-3-super-120b-a12b`	NVIDIA	MoE
`openai/gpt-oss-120b`	OpenAI	Dense
`qwen/qwen3-next-80b-a3b-instruct`	Alibaba (Qwen)	MoE
`qwen/qwen3.5-122b-a10b`	Alibaba (Qwen)	MoE
`stepfun-ai/step-3.5-flash`	StepFun	MoE, fast
`stepfun-ai/step-3.7-flash`	StepFun	MoE
`z-ai/glm-5.1`	Zhipu AI	MoE

The live leaderboard is the source of truth for every model's measured pass rate and composite. Provider home pages: DeepSeek, Meta, Mistral AI, Moonshot AI, NVIDIA, Qwen, StepFun, Zhipu AI.

Pass rate alone does not crown a model here — CodingAgentBench compares agents, holding the model fixed down each column. A model with a modest pass rate is still a valid, fair substrate for asking "which CLI got the most out of it."

Why these, and why served this way

One serving layer, many agents. Routing every TUI through the same NIM-backed proxy removes serving-stack variance as a confound. Differences within a model column are attributable to the agent, not to a different vLLM flag or a different host.
Architectural spread. The roster mixes dense and Mixture-of-Experts models across several labs so the matrix does not collapse onto a single family — single-family overfit cannot dominate the ranking.
Keyless and reproducible. Because the host endpoint is keyless NIM, a community runner can reproduce a cell without negotiating private credentials; the per-cell receipt records the model id and endpoint identity it ran against.

Candidates evaluated and held

Self-hosted very-large coders (400B+ class)

Too expensive to serve for the published sweep at the cadence CodingAgentBench reruns cells. May appear in a community-run row where the runner absorbs the cost. (An earlier 480B-class Qwen-coder host was retired from the roster when its NIM endpoint reached end-of-life.)

Closed-weight frontier models

Out of scope for the open-vs-open matrix by manifesto commitment #1. A separate "reference" column may be added later, clearly labeled and never mixed into the open ranking.

Models without a documented tool-call format

The TUIs in the matrix expect structured tool calls. Forcing a model into "imitate JSON tool calls in free-text" mode degrades operational performance in ways that are not the agent's fault and corrupts the controlled-experiment design. Only models with a parser-supported tool-call format are hosted.

How to nominate a new host model

Open an issue with the model-request label including:

HuggingFace slug or other canonical URL
License (must be open-weight)
Parameter count and architecture class (dense / MoE)
Tool-call format + parser availability
Why it should be added (architectural diversity / capability ceiling / cost frontier)

Methodology version bumps are the merge window for new host models. See METHODOLOGY.md on versioning.

Model facts in this catalog are re-verified at each methodology version bump; the prior snapshot is archived alongside the methodology version it was scored under. Measured pass rates and composites always come from the live board, not from this page.

The brains we tested

Gpt-Oss-120b

Step-3.5-Flash

Qwen3-Next-80b

Qwen3.5-397b

Minimax-M2.7

Nemotron-3-Ultra-550b

Nemotron-3-Super-120b

Mistral-Small-4-119b-2603

Llama-3.3-70b

Minimax-M3

Llama-4-Maverick-17b-128e

Glm-5.1

Llama-3.3-Nemotron-Super-49b-V1

Qwen3.5-122b

Host Model Catalog

Host models in the current sweep

Why these, and why served this way

Candidates evaluated and held

Self-hosted very-large coders (400B+ class)

Closed-weight frontier models

Models without a documented tool-call format

How to nominate a new host model

Keyboard shortcuts

Navigation

Command palette

Page