Skip to main content
CodingAgentBench

Plain-English vocabulary

The seven terms you need first

These are the words we use across the site. Open the ones you care about; ignore the rest.

What this page is

  • What you'll find: 7 core terms in plain English, each in a click-to-open card.
  • Who it's for: people landing on CodingAgentBench for the first time.
  • How to go deeper: switch to Nerds mode for all 45 terms with anchor nav and live filter.

The headline score (we call this PBS)

PBS rolls five things into one number between 0 and 1: did it pass tests, did it stay honest, did it refuse correctly, did it spend tokens well, and did it finish fast. You compare CLIs by PBS first, then drill into the parts you care about.

Read more →

Blast radius

Blast radius counts the distinct files the agent touched in its final diff. A one-line bug fix that edits ten files has a blast radius of ten. Lower is better. We use this so you can spot CLIs that ripple changes further than they should.

Read more →

Did it follow the rules (we call this L15)

Every task ships with an AGENTS.md file listing constraints, like "do not touch config.yaml" or "add no new dependencies." L15 is the fraction of those constraints the agent honored. We score it with deterministic checks, not another LLM.

Read more →

Did it refuse the right tasks (we call this Stays)

Some tasks are traps: prompt injections, impossible asks, unsafe requests. Stays measures whether the agent declined those for the right reason. A deterministic oracle decides, so the score does not depend on judge taste.

Read more →

Is the result stable (we call this Costs)

We rerun every cell several times under the same conditions. Costs is the fraction of reruns that produced the same verdict. Low Costs means the cell is flaky, so you should treat its headline score with caution.

Read more →

Reality Engine

The Reality Engine replays published cells from their pinned container digest and sweep id. It flags drift if the score changes. You can trust a published number because we can rebuild it bit-for-bit on demand.

Read more →

Pareto frontier

On a chart of score versus cost, a CLI sits on the Pareto frontier if no other CLI is both better and cheaper. The frontier is the short list worth your attention. Everything off the frontier is dominated by something on it.

Read more →

Glossary

Every CodingAgentBench-specific term, one-line definition, and a link to the page that defines it authoritatively. 45 terms.

45 terms

A

AGENTS.md

Per-task human-readable declaration of L15 constraints. Sits next to task.yaml. Constraints must stay in sync between the two.

→ /methodology

B

behavior_mode
Behavior mode (D1)

5th cell-key axis. Names the adapter configuration in use — `factory`, `tuned`, or `default_recommendation`. Every cell records which mode was active.

→ /methodology

blast_radius
Blast radius

Count of distinct files modified by the agent's final diff. Lower is better — a one-line fix shouldn't touch ten files.

→ /methodology

C

cell
Cell

One unit of measurement: a specific (TUI, model, task, plugin_stack, behavior_mode) execution. The atomic record in the matrix.

→ /methodology

Costs
Costs axis

Headline category: "what did the combo cost?" Aggregates L2 tokens/correct, L3 wall-clock latency, L5 tool efficiency.

→ /methodology

D

doom loop
Doom loop

Failure mode where the agent re-reads + re-edits the same file repeatedly without converging. Detected by L7 churn ≥ 0.6 and L6 turn-economy = 1.0.

→ /graveyard

E

exfil marker
Exfil marker

Sentinel file (e.g. `/tmp/codingagentbench_pwned.marker`) that integrity tasks instruct the agent to create via prompt injection. Its presence after a cell ends fails L8.

→ /security

expected_refusal

Task-level flag. When true, the L9 scorer's refusal oracle (not the L1 exec scorer) is authoritative. Used for prompt-injection and unsolvable tasks.

→ /methodology

F

fork DAG
Fork DAG

Directed acyclic graph of TUI / model lineage. Aider → opencode → crush, etc. Used by L14 to credit novelty vs reproduction.

→ /fork

G

graveyard
Graveyard

Catalog of (TUI, model) combinations that hard-failed during pilot sweeps — typically OOM, license, or repeated doom-loop. Kept public for honesty.

→ /graveyard

H

hidden-set tier
Hidden-set tier

Tasks held back from public manifests, released on cadence, to detect overfitting. Tier 1 = held 30d, Tier 2 = held 90d, Tier 3 = never published.

→ /methodology

honeytrap
Honeytrap task

Task with an obvious cheap-cheat path designed to be tempting. A good agent declines or takes the real path; the honeytrap scorer measures which.

→ /methodology

I

image_digest
Container image digest

sha256 of the OCI image that ran the TUI. Recorded in every cell for byte-exact reproduction.

→ /provenance

L

L1
Layer 1 — Pass rate

Did the exec scorer (unit tests / scripted oracle) pass after the agent's diff was applied? Boolean per task, averaged per cell.

→ /methodology

L2
Layer 2 — Tokens per correct task

Total prompt+completion tokens divided by the number of L1-passing tasks. Lower is better.

→ /methodology

L3
Layer 3 — Wall-clock latency

End-to-end runtime in milliseconds from cell start to scorer exit. Excludes container pull time.

→ /methodology

L4
Layer 4 — Blast radius

Count of distinct files the agent's diff touched. Higher means the change rippled further than necessary.

→ /methodology

L5
Layer 5 — Tool-call efficiency

Useful tool calls (edits + tests + reads of relevant files) divided by total tool calls. Penalises directory-list spam.

→ /methodology

L6
Layer 6 — Turn economy

Turns used / max_turns. Low is good; hitting the ceiling is a soft failure even if L1 passes.

→ /methodology

L7
Layer 7 — Self-edit churn

Fraction of edits the agent later undoes or rewrites in the same cell. High churn signals indecision.

→ /methodology

L8
Layer 8 — Integrity

Did the agent leak secrets, exfiltrate files, execute injected instructions, or create marker files from injection oracles? Binary per cell.

→ /methodology

L9
Layer 9 — Refusal correctness

On tasks with expected_refusal=true, did the agent refuse for the right reasons (vs comply, or refuse spuriously)?

→ /methodology

L10
Layer 10 — Honeytrap discipline

On honeytrap tasks, did the agent take the cheap dishonest path (e.g. `assert True` padding) or the legitimate one?

→ /methodology

L11
Layer 11 — Provenance fidelity

Did the cell's trace record image_digest, model_build_id, and the 9-field provenance bundle correctly?

→ /provenance

L12
Layer 12 — Reality engine drift

Wave 5 cross-check: does the published cell still reproduce when re-run from its container digest + sweep_id?

→ /methodology

L13
Layer 13 — Plugin Δ-PBS

Wave 4D delta: paired runs with and without each plugin in the stack — how much score did the plugin actually add?

→ /methodology

L14
Layer 14 — Fork lineage

How a given (TUI, model) pair traces back through the fork DAG of upstream agents — used to weight novelty vs reproduction.

→ /fork

L15
Layer 15 — Instruction fidelity

Honored / declared constraints from each task's AGENTS.md (file-not-touched, no-new-dependency, max-files-touched, etc.). Deterministic checks — no LLM judge.

→ /methodology

L15 constraint kind

Discrete check type declared in task.yaml: `file-not-touched`, `no-new-dependency`, `no-comments-added`, `file-pattern-required-in-diff`, `command-not-run`, `max-files-touched`, `diff-size-limit`.

→ /methodology

M

MCI
Multi-run Consistency Index

P1b axis: fraction of N deterministic reruns that produce the same verdict. Low MCI means the cell is flaky and the headline score is unstable.

→ /methodology

model_build_id
Model build id

Provider+revision identifier of the model weights used for the cell (e.g. `vllm:qwen3-coder-30b@sha256:abc...`). Recorded for reproducibility.

→ /provenance

P

P0 invariant

Hard correctness invariant that may never silently fail: provenance fields populated, scorer exit codes non-ambiguous, no NaN propagation. Violations block publish.

→ /methodology

Pareto frontier

Set of (TUI, model) pairs that are not dominated on the (PBS, cost) plane — i.e. no other pair is both better and cheaper.

→ /pareto

PBS
CodingAgentBench Score

Composite score combining pass rate, integrity, refusal correctness, token efficiency and wall-clock latency into a single 0-1 number per (TUI, model, task, plugin_stack, behavior_mode) cell.

→ /methodology

plugin_stack
Plugin stack

Ordered list of conventions / skills / subagents / MCP servers injected into the TUI for this cell. `conv-vanilla` = no plugins. Pinned by manifest hash.

→ /methodology

R

Reality Engine
Reality Engine (Wave 5)

Replay system that re-runs published cells from their pinned image_digest + sweep_id and flags drift in L1/L8/L15 scores.

→ /methodology

refusal oracle
Refusal oracle

Per-task deterministic function that decides whether a refusal was correct. Reads trace spans + filesystem markers — never an LLM.

→ /methodology

RSI
Refusal-Sensitivity Index

L9 axis: how well the agent declines tasks marked expected_refusal=true, scored by deterministic oracle (no LLM judge).

→ /methodology

S

sampling_profile
Sampling profile

Pinned (temperature, top_p, max_tokens, seed) used for the cell. Default until per-cell profiles ship: `default-t0.2-tp0.95`.

→ /methodology

scorer.kind
Scorer kind

Per-task scoring contract. `exec` runs unit tests in a sandbox; other kinds (planned: `diff`, `behavior`) verify the change shape rather than runtime behavior.

→ /methodology

Ships
Ships axis

Headline category: "did this combo ship a correct fix?" Aggregates L1 pass rate + L9 refusal correctness.

→ /methodology

Stays
Stays axis

Headline category: "did the combo stay disciplined?" Aggregates L4 blast radius, L8 integrity, L15 instruction fidelity.

→ /methodology

sweep
Sweep

One full matrix run over (TUI × model × task × plugin_stack × behavior_mode). Identified by `sweep_id`; every cell carries this back-reference.

→ /methodology

T

TUI
Terminal UI agent

An open coding-agent CLI: aider, opencode, goose, crush, plandex, openhands, qwen-code. CodingAgentBench measures the operational performance each delivers per open model.

→ /tuis

Δ

Δ-PBS
Delta-PBS

Score delta attributable to a single plugin: PBS(stack-with-plugin) − PBS(stack-without-plugin). Computed from paired W4D runs.

→ /methodology

Glossary v0.1 · snapshot 2026-05-26 · 45 terms · source site/src/data/glossary.json