Gpt-Oss-120b
huggingface.co/openai
OpenAI's open-weight model for general coding tasks.
120B-class openai's open-weight gpt model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
Field guide
14 open-weight models. You bring one; we plug every agent into it and report what happened — same prompts, same harness.
huggingface.co/openai
OpenAI's open-weight model for general coding tasks.
120B-class openai's open-weight gpt model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/stepfun-ai
StepFun's open-weight model for everyday coding tasks.
open-weight open-weight model evaluated on codingagentbench, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/Qwen
Alibaba's open-weight model for multilingual coding.
80B-class alibaba's open-weight qwen model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/Qwen
Alibaba's open-weight model for multilingual coding.
397B-class alibaba's open-weight qwen model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/minimaxai
minimaxai's open-weight model for everyday coding tasks.
open-weight open-weight model evaluated on codingagentbench, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/nvidia
NVIDIA's open-weight model for reasoning-heavy coding tasks.
550B-class nvidia's open-weight reasoning model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/nvidia
NVIDIA's open-weight model for reasoning-heavy coding tasks.
120B-class nvidia's open-weight reasoning model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/mistralai
Mistral AI's open-weight model for everyday coding tasks.
119B-class mistral's open-weight model, served keyless via NVIDIA NIM and evaluated across 230 (TUI×task) cells on the board.
huggingface.co/meta-llama
Meta's open-weight model for everyday coding tasks.
70B-class meta's open-weight llama model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/minimaxai
minimaxai's open-weight model for everyday coding tasks.
open-weight open-weight model evaluated on codingagentbench, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/meta-llama
Meta's open-weight model for everyday coding tasks.
17B-class meta's open-weight llama model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/zai-org
Zhipu AI's open-weight model for multilingual coding.
open-weight zhipu's open-weight glm model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/nvidia
NVIDIA's open-weight model for reasoning-heavy coding tasks.
49B-class nvidia's open-weight reasoning model, served keyless via NVIDIA NIM and evaluated across 250 (TUI×task) cells on the board.
huggingface.co/Qwen
Alibaba's open-weight model for multilingual coding.
122B-class alibaba's open-weight qwen model, served keyless via NVIDIA NIM and evaluated across 235 (TUI×task) cells on the board.
CodingAgentBench scores the ten CLI coding agents head-to-head on a shared matrix of open-weight host models. Every model below is open-weight per manifesto commitment #1. Closed-weight models may appear in a future "reference" column but never as a row in the open-vs-open matrix.
All host models in the current sweep are served keyless through NVIDIA NIM, fronted by a local LiteLLM proxy so every TUI talks to one OpenAI-compatible endpoint. That keeps the serving layer identical across agents — the only thing that varies within a model column is the CLI wrapper, which is the comparison CodingAgentBench exists to make.
For each model the board records, per (TUI × task) cell:
Numbers below are read directly from the live leaderboard and refresh on every methodology version bump.
The board spans 14 open-weight models from DeepSeek, Meta, Mistral AI, Moonshot AI, NVIDIA, Alibaba (Qwen), StepFun, and Zhipu AI — each present in every (TUI × task) cell. Because coverage is uniform, every column is equally load-bearing.
| Model | Provider | Class |
|---|---|---|
deepseek-ai/deepseek-v4-flash | DeepSeek | MoE, fast |
deepseek-ai/deepseek-v4-pro | DeepSeek | MoE, flagship |
meta/llama-3.3-70b-instruct | Meta | Dense 70B |
meta/llama-4-maverick-17b-128e-instruct | Meta | MoE |
mistralai/mistral-small-4-119b-2603 | Mistral AI | Dense |
moonshotai/kimi-k2.6 | Moonshot AI | MoE |
nvidia/llama-3.3-nemotron-super-49b-v1 | NVIDIA | Dense, reasoning |
nvidia/nemotron-3-super-120b-a12b | NVIDIA | MoE |
openai/gpt-oss-120b | OpenAI | Dense |
qwen/qwen3-next-80b-a3b-instruct | Alibaba (Qwen) | MoE |
qwen/qwen3.5-122b-a10b | Alibaba (Qwen) | MoE |
stepfun-ai/step-3.5-flash | StepFun | MoE, fast |
stepfun-ai/step-3.7-flash | StepFun | MoE |
z-ai/glm-5.1 | Zhipu AI | MoE |
The live leaderboard is the source of truth for every model's measured pass rate and composite. Provider home pages: DeepSeek, Meta, Mistral AI, Moonshot AI, NVIDIA, Qwen, StepFun, Zhipu AI.
Pass rate alone does not crown a model here — CodingAgentBench compares agents, holding the model fixed down each column. A model with a modest pass rate is still a valid, fair substrate for asking "which CLI got the most out of it."
Too expensive to serve for the published sweep at the cadence CodingAgentBench reruns cells. May appear in a community-run row where the runner absorbs the cost. (An earlier 480B-class Qwen-coder host was retired from the roster when its NIM endpoint reached end-of-life.)
Out of scope for the open-vs-open matrix by manifesto commitment #1. A separate "reference" column may be added later, clearly labeled and never mixed into the open ranking.
The TUIs in the matrix expect structured tool calls. Forcing a model into "imitate JSON tool calls in free-text" mode degrades operational performance in ways that are not the agent's fault and corrupts the controlled-experiment design. Only models with a parser-supported tool-call format are hosted.
Open an issue with the model-request label including:
Methodology version bumps are the merge window for new host models. See METHODOLOGY.md on versioning.
Model facts in this catalog are re-verified at each methodology version bump; the prior snapshot is archived alongside the methodology version it was scored under. Measured pass rates and composites always come from the live board, not from this page.