Skip to main content
CodingAgentBench

Field guide

Meet the contenders

Ten coding agents — nine open-source, one proprietary with BYO-model support. We tested them all on the same tasks, with the same models, and recorded what each one did.

Aider

aider.chat

Aider-AI's terse, git-aware editor for clean, diff-by-diff commits.

Asks before it edits, then stages a tidy commit. The diff-first agent that treats your repo like a co-author would.

opencode

github.com/sst/opencode

sst's polished TypeScript TUI for interactive multi-file refactors.

A modern terminal cockpit with first-class MCP support and a BYO-model surface. Active maintenance, sensible defaults.

Goose

github.com/block/goose

block's MCP-native Rust agent for tool-heavy automation workflows.

Funded by Block. Speaks MCP fluently and treats external tools as first-class citizens, not afterthoughts.

OpenHands

github.com/All-Hands-AI/OpenHands

All-Hands-AI's sandboxed runtime for long-horizon, end-to-end coding tasks.

Runs the agent inside a real sandbox with browser, shell, and editor. Built to finish multi-hour jobs unattended.

Crush

github.com/charmbracelet/crush

charmbracelet's slick keyboard-first TUI for quick interactive edits.

The Charm studio's coding agent. Lean, opinionated, and visually delightful. Built for the terminal you already live in.

Plandex

github.com/plandex-ai/plandex

plandex-ai's plan-first agent for sweeping multi-step changes you can review.

Drafts an explicit plan before touching files. Branches, reviews, and rollbacks are first-class, not bolted on.

Qwen Code

github.com/QwenLM/qwen-code

QwenLM's reference CLI for the Qwen coder model family.

Alibaba's own harness for their coder models. A useful calibration baseline when comparing community TUIs head to head.

Codex CLI

github.com/openai/codex

openai's open-source agentic CLI for BYO-model autonomous one-shot tasks.

OpenAI's own Apache-2.0 coding agent. BYO any OpenAI-compatible endpoint via config overrides — no OpenAI auth required. Runs an autonomous exec loop that plans, edits, and verifies.

GitHub Copilot CLI

github.com/features/copilot

github's Copilot CLI for BYOK-powered open-weight model testing.

GitHub's official terminal coding agent. As of the April 2026 BYOK release, any OpenAI-compatible provider plugs in via env vars. The largest installed-base terminal assistant, now testable on open-weight models.

Pi

github.com/badlogic/pi-mono

badlogic's minimal, token-efficient harness for hackable BYO-model runs.

The smallest, clearest coding-agent harness you can adapt. Open-source (MIT), token-efficient, and built to be forked and hacked on.

TUI Catalog

Long-form evaluation log of every coding-agent TUI / CLI that CodingAgentBench has considered. Each entry includes the GitHub URL, license, primary implementation language, an approximate star count (as of 2026-05, in ~K notation — we do not claim precision), BYO-model and MCP support flags, the CodingAgentBench decision (MVP / Phase 1 / Phase 2 / Skip), and a one-sentence rationale.

Star counts are point-in-time snapshots; they will drift. Where the count is rounded heavily it is because we do not believe finer precision adds signal.

For graveyard entries (archived, abandoned, pivoted, vendor-retired) see graveyard.md. For closed-source vendor TUIs, see the FAQ.


MVP — Phase 0 sweep (v0.1, May 2026)

Ten TUIs. All support a configurable OpenAI-compatible endpoint and are Dockerizable. Nine are open-source; GitHub Copilot CLI is proprietary but BYO-model capable via its April 2026 BYOK release.

sst/opencode

  • URL: https://github.com/sst/opencode
  • License: MIT
  • Language: TypeScript
  • Stars (~2026-05): ~50K
  • BYO-model: Yes (OpenAI-compatible endpoint, configurable model name)
  • MCP support: Yes
  • Decision: MVP
  • Rationale: Most-starred modern open TUI, active maintenance, clean BYO-model surface — the obvious anchor entry.

Aider — Aider-AI/aider

  • URL: https://github.com/Aider-AI/aider
  • License: Apache-2.0
  • Language: Python
  • Stars (~2026-05): ~40K
  • BYO-model: Yes (any LiteLLM-compatible backend including local OpenAI-compatible)
  • MCP support: Limited (third-party plugins; native support roadmapped)
  • Decision: MVP
  • Rationale: The canonical "edit-in-place" CLI agent; widely-cited reference harness in prior benchmarks (Aider-Polyglot). Mature, well-documented, predictable.

block/goose

  • URL: https://github.com/block/goose
  • License: Apache-2.0
  • Language: Rust
  • Stars (~2026-05): ~20K
  • BYO-model: Yes (OpenAI-compatible + Ollama + Anthropic + Bedrock)
  • MCP support: Yes (first-class)
  • Decision: MVP
  • Rationale: Block-funded, MCP-native, distinct architecture from the JS/Python pack — important for axis diversity in the matrix.

All-Hands-AI/OpenHands

  • URL: https://github.com/All-Hands-AI/OpenHands
  • License: MIT
  • Language: Python
  • Stars (~2026-05): ~50K
  • BYO-model: Yes (LiteLLM under the hood)
  • MCP support: Yes
  • Decision: MVP
  • Rationale: Heavyweight autonomous-agent category leader, strong SWE-bench results historically — needs to be in the matrix to quantify operational performance at the heavy-agent end.

charmbracelet/crush

  • URL: https://github.com/charmbracelet/crush
  • License: FSL-1.1-MIT
  • Language: Go
  • Stars (~2026-05): ~10K (and rising fast)
  • BYO-model: Yes
  • MCP support: Yes
  • Decision: MVP
  • Rationale: From the Bubble Tea / Charm ecosystem, a modern Go TUI with strong UX engineering. Source-available FSL license is acceptable for our purposes; flagged as "source-available, not OSI" in results.

plandex-ai/plandex

  • URL: https://github.com/plandex-ai/plandex
  • License: MIT (server components) / AGPL-3.0 (CLI client)
  • Language: Go
  • Stars (~2026-05): ~12K
  • BYO-model: Yes (OpenAI-compatible)
  • MCP support: Roadmapped
  • Decision: MVP
  • Rationale: Multi-step planning specialization is distinct from the edit-in-place pattern. Important for measuring whether planning improves operational performance or wastes tokens.

QwenLM/qwen-code

  • URL: https://github.com/QwenLM/qwen-code
  • License: Apache-2.0
  • Language: TypeScript (fork of Gemini CLI lineage)
  • Stars (~2026-05): ~12K
  • BYO-model: Yes (default tuned for Qwen-Coder, accepts arbitrary OpenAI-compatible endpoints)
  • MCP support: Yes
  • Decision: MVP
  • Rationale: Vendor-tuned for Qwen models but model-agnostic in practice — important for a like-for-like test of "does the vendor's matched harness ship better diffs from their own model than competing TUIs do."

openai/codex

  • URL: https://github.com/openai/codex
  • License: Apache-2.0
  • Language: TypeScript
  • Stars (~2026-05): ~30K
  • BYO-model: Yes — custom model provider injected via -c model_providers.* config overrides; points at any OpenAI-compatible endpoint including the harness LiteLLM proxy. No OpenAI auth required in BYO mode.
  • MCP support: Yes (tool-use via Responses API; multi-agent / browser / computer-use plugins disabled for the benchmark run)
  • Decision: MVP (elevated from Skip when Apache-2.0 + BYO-model support shipped in 2026)
  • Rationale: OpenAI's own reference local coding agent, now fully open-weight and model-agnostic. Indispensable for the matrix: if the vendor's own harness underperforms another TUI on an OpenAI-family model, that is a signal worth measuring.

github/copilot-cli

  • URL: https://github.com/github/copilot-cli
  • License: Proprietary (GitHub / Microsoft)
  • Language: TypeScript (npm: @github/copilot)
  • Stars (~2026-05): N/A (distributed as CLI tool, not a starred OSS repo)
  • BYO-model: Yes — as of the April 7 2026 BYOK release, any OpenAI-compatible provider can be wired in via COPILOT_PROVIDER_TYPE / COPILOT_PROVIDER_BASE_URL / COPILOT_PROVIDER_API_KEY / COPILOT_MODEL. In COPILOT_OFFLINE=true mode, no GitHub auth or telemetry — the harness endpoint is the only egress.
  • MCP support: No — custom agents / MCP surface emits a no-op warning in the benchmark mode.
  • Decision: MVP (elevated from Skip when BYO-provider support shipped April 2026)
  • Rationale: The largest installed-base terminal coding assistant in the world. Evaluating it on the open-weight matrix answers a question many developers have: how does Copilot CLI perform when you swap its hosted model for an open alternative?

badlogic/pi-mono

  • URL: https://github.com/badlogic/pi-mono
  • License: MIT
  • Language: TypeScript (npm: @mariozechner/pi-coding-agent)
  • Stars (~2026-05): ~3K
  • BYO-model: Yes — custom provider block in ~/.pi/agent/models.json; openai-completions API wired to any OpenAI Chat-Completions endpoint. No Responses API, no telemetry.
  • MCP support: No (minimal by design — "smallest, clearest harness you can adapt")
  • Decision: MVP
  • Rationale: Deliberate minimalism is a distinct design axis. Pi's claim is that a stripped-down, token-efficient harness extracts more signal per dollar. Including it lets the matrix test that claim directly.

Phase 1 — candidates for the next sweep

Real projects, viable on paper, that need a smoke pass and Dockerfile before they enter the matrix.

Toad — Will McGugan (Textualize)

  • URL: https://github.com/Textualize/toad (placeholder; project may live under willmcgugan/toad depending on final namespace)
  • License: Expected MIT or Apache-2.0
  • Language: Python
  • Stars (~2026-05): Early/pre-launch
  • BYO-model: Designed for universal frontend — pluggable backend.
  • MCP support: TBD
  • Decision: Phase 1
  • Rationale: Will McGugan's positioning of Toad as a "universal frontend" for coding agents is structurally interesting — if it can drive Aider, opencode, or goose underneath, it is a meta-TUI category that CodingAgentBench will need to handle. Adding when first stable release lands.

Kilo Code — Kilo-Org/kilocode

  • URL: https://github.com/Kilo-Org/kilocode
  • License: Apache-2.0 (verify on submission)
  • Language: TypeScript
  • Stars (~2026-05): ~10K (inheriting Cline/Roo lineage)
  • BYO-model: Yes
  • MCP support: Yes
  • Decision: Phase 1
  • Rationale: Successor to Roo Code (see graveyard.md) and continuation of the Cline lineage. Large existing user base. Needs Dockerfile + adapter; will enter at next sweep.

DeepSeek-Reasonix (single-backend specialist)

  • URL: Placeholder — project is a specialized harness with cache pre-warming.
  • License: Expected open-source.
  • Language: Python/Go (TBD)
  • Stars (~2026-05): Pre-launch / niche
  • BYO-model: Single-backend by design (specialist)
  • MCP support: TBD
  • Decision: Phase 1, with caveat
  • Rationale: Single-backend specialists violate the "configurable model endpoint" rule in spirit, but are interesting as an operational-performance upper bound — what is achievable when the harness can pre-warm caches and pre-tune prompts to a specific model? Will be benchmarked but flagged with a "specialist" tag so its numbers are not directly compared to general TUIs.

Phase 2 — future consideration

Tools we are aware of but have not yet evaluated in depth. No commitment.

MetaGPT — geekan/MetaGPT

Multi-agent role-playing framework. May or may not fit the controlled-matrix model. On the watchlist (see graveyard.md).

Cline — cline/cline

Originally a VS Code extension; CLI-mode maturity is uncertain. Lineage continued through Kilo Code (Phase 1).

Cursor headless mode

Closed-source. See Skip section.


Skip — explicitly out of scope

Closed-source vendor TUIs

The following are vendor-locked, model-bundled, closed-source, or all three. They are out of scope by manifesto commitment #1 and the controlled-experiment design (see manifesto-faq.md).

  • Claude Code (Anthropic) — closed-source, model-bundled. Not benchmarkable in the controlled cross-product.
  • Gemini CLI (Google) — closed-source, model-bundled, and retired by vendor 2026-05-19. See graveyard.md.
  • Cursor headless (Anysphere) — closed-source, model-bundled.

(Note: Codex CLI and GitHub Copilot CLI were originally in this list. Both shipped BYO-model/BYOK support in early 2026 — see their MVP entries above.)

These tools may be excellent. They are simply not what CodingAgentBench measures. A separate "vendor reference" column may appear in future versions for context.

Dead / archived / pivoted

See graveyard.md for full entries on Roo Code, Open Interpreter, AutoGPT, GPT-Pilot, Devika, Continue, Smol Developer, and others.


CodingAgentBench v0.2 adopts Edgar Morin's framework of complex thought as its epistemological foundation, with five additional scoring layers (L10–L14) that operationalize the framework's operators. See morin-framework.md for the full exposition and ../METHODOLOGY.md §14 for the methodology-level cross-reference.

TUIs do not need to know any of this. The Morin probes operate at the eval layer, not the harness layer. A TUI's responsibility ends where it always has — accept a workdir, accept a task description, accept an OpenAI-compatible endpoint, produce a final commit. Everything Morin-related happens after the container exits.

Concretely:

  • L10 (test-mutation probe) runs out-of-container against the agent's diff.
  • L11 (N-run continuation) drives the TUI through additional passes using the same standard harness invocation as the first pass — no new APIs.
  • L12 (style coherence) is a static analysis over the workdir; the TUI is not involved.
  • L13 (score-feedback injection) modifies the task prompt between runs; the TUI sees a normal task input.
  • L14 (observer probe) modifies the codingagentbench:eval notice in the prompt; again, normal task input from the TUI's perspective.
  • MCP server tier classification (autonomy-dependence operator) is computed from observed MCP call traces in the run JSONL; the TUI does not need to declare its MCP posture.

This is by design. The Morin framework is a methodology-side commitment, not a TUI integration requirement. The criterion for inclusion in the MVP sweep remains what it has always been: open-source, OpenAI-compatible endpoint, Dockerizable, see §2 of ../METHODOLOGY.md. No new integration work is asked of any TUI listed in this catalog.

If a future TUI wishes to explicitly cooperate with Morin probes (for example, by exposing its own self-mutation hooks or by structuring its commit history for cleaner L11 continuation), we welcome the cooperation but do not require it. The benchmark is designed to work with TUIs as they are.


How to nominate

If your favorite TUI is not in this catalog and you believe it should be, open an issue with the tui-request label or submit a PR per docs/submitting.md. Nominations are evaluated on a rolling basis and decisions land in this document.


Catalog maintained for methodology v0.1, May 2026. Star counts as of 2026-05 and will not be silently updated; they are refreshed at each methodology version bump and the prior values archived.