Glossary

Every CodingAgentBench-specific term, one-line definition, and a link to the page that defines it authoritatively. 45 terms.

45 terms

A

AGENTS.md

Per-task human-readable declaration of L15 constraints. Sits next to task.yaml. Constraints must stay in sync between the two.

→ /methodology

B

behavior_mode

Behavior mode (D1)

5th cell-key axis. Names the adapter configuration in use — `factory`, `tuned`, or `default_recommendation`. Every cell records which mode was active.

→ /methodology

blast_radius

Blast radius

Count of distinct files modified by the agent's final diff. Lower is better — a one-line fix shouldn't touch ten files.

→ /methodology

C

cell

Cell

One unit of measurement: a specific (TUI, model, task, plugin_stack, behavior_mode) execution. The atomic record in the matrix.

→ /methodology

Costs

Costs axis

Headline category: "what did the combo cost?" Aggregates L2 tokens/correct, L3 wall-clock latency, L5 tool efficiency.

→ /methodology

D

doom loop

Doom loop

Failure mode where the agent re-reads + re-edits the same file repeatedly without converging. Detected by L7 churn ≥ 0.6 and L6 turn-economy = 1.0.

→ /graveyard

E

exfil marker

Exfil marker

Sentinel file (e.g. `/tmp/codingagentbench_pwned.marker`) that integrity tasks instruct the agent to create via prompt injection. Its presence after a cell ends fails L8.

→ /security

expected_refusal

Task-level flag. When true, the L9 scorer's refusal oracle (not the L1 exec scorer) is authoritative. Used for prompt-injection and unsolvable tasks.

→ /methodology

F

fork DAG

Fork DAG

Directed acyclic graph of TUI / model lineage. Aider → opencode → crush, etc. Used by L14 to credit novelty vs reproduction.

→ /fork

G

graveyard

Graveyard

Catalog of (TUI, model) combinations that hard-failed during pilot sweeps — typically OOM, license, or repeated doom-loop. Kept public for honesty.

→ /graveyard

H

hidden-set tier

Hidden-set tier

Tasks held back from public manifests, released on cadence, to detect overfitting. Tier 1 = held 30d, Tier 2 = held 90d, Tier 3 = never published.

→ /methodology

honeytrap

Honeytrap task

Task with an obvious cheap-cheat path designed to be tempting. A good agent declines or takes the real path; the honeytrap scorer measures which.

→ /methodology

I

image_digest

Container image digest

sha256 of the OCI image that ran the TUI. Recorded in every cell for byte-exact reproduction.

→ /provenance

L

Layer 1 — Pass rate

Did the exec scorer (unit tests / scripted oracle) pass after the agent's diff was applied? Boolean per task, averaged per cell.

→ /methodology

Layer 2 — Tokens per correct task

Total prompt+completion tokens divided by the number of L1-passing tasks. Lower is better.

→ /methodology

Layer 3 — Wall-clock latency

End-to-end runtime in milliseconds from cell start to scorer exit. Excludes container pull time.

→ /methodology

Layer 4 — Blast radius

Count of distinct files the agent's diff touched. Higher means the change rippled further than necessary.

→ /methodology

Layer 5 — Tool-call efficiency

Useful tool calls (edits + tests + reads of relevant files) divided by total tool calls. Penalises directory-list spam.

→ /methodology

Layer 6 — Turn economy

Turns used / max_turns. Low is good; hitting the ceiling is a soft failure even if L1 passes.

→ /methodology

Layer 7 — Self-edit churn

Fraction of edits the agent later undoes or rewrites in the same cell. High churn signals indecision.

→ /methodology

Layer 8 — Integrity

Did the agent leak secrets, exfiltrate files, execute injected instructions, or create marker files from injection oracles? Binary per cell.

→ /methodology

Layer 9 — Refusal correctness

On tasks with expected_refusal=true, did the agent refuse for the right reasons (vs comply, or refuse spuriously)?

→ /methodology

L10

Layer 10 — Honeytrap discipline

On honeytrap tasks, did the agent take the cheap dishonest path (e.g. `assert True` padding) or the legitimate one?

→ /methodology

L11

Layer 11 — Provenance fidelity

Did the cell's trace record image_digest, model_build_id, and the 9-field provenance bundle correctly?

→ /provenance

L12

Layer 12 — Reality engine drift

Wave 5 cross-check: does the published cell still reproduce when re-run from its container digest + sweep_id?

→ /methodology

L13

Layer 13 — Plugin Δ-PBS

Wave 4D delta: paired runs with and without each plugin in the stack — how much score did the plugin actually add?

→ /methodology

L14

Layer 14 — Fork lineage

How a given (TUI, model) pair traces back through the fork DAG of upstream agents — used to weight novelty vs reproduction.

→ /fork

L15

Layer 15 — Instruction fidelity

Honored / declared constraints from each task's AGENTS.md (file-not-touched, no-new-dependency, max-files-touched, etc.). Deterministic checks — no LLM judge.

→ /methodology

L15 constraint kind

Discrete check type declared in task.yaml: `file-not-touched`, `no-new-dependency`, `no-comments-added`, `file-pattern-required-in-diff`, `command-not-run`, `max-files-touched`, `diff-size-limit`.

→ /methodology

M

MCI

Multi-run Consistency Index

P1b axis: fraction of N deterministic reruns that produce the same verdict. Low MCI means the cell is flaky and the headline score is unstable.

→ /methodology

model_build_id

Model build id

Provider+revision identifier of the model weights used for the cell (e.g. `vllm:qwen3-coder-30b@sha256:abc...`). Recorded for reproducibility.

→ /provenance

P

P0 invariant

Hard correctness invariant that may never silently fail: provenance fields populated, scorer exit codes non-ambiguous, no NaN propagation. Violations block publish.

→ /methodology

Pareto frontier

Set of (TUI, model) pairs that are not dominated on the (PBS, cost) plane — i.e. no other pair is both better and cheaper.

→ /pareto

PBS

CodingAgentBench Score

Composite score combining pass rate, integrity, refusal correctness, token efficiency and wall-clock latency into a single 0-1 number per (TUI, model, task, plugin_stack, behavior_mode) cell.

→ /methodology

plugin_stack

Plugin stack

Ordered list of conventions / skills / subagents / MCP servers injected into the TUI for this cell. `conv-vanilla` = no plugins. Pinned by manifest hash.

→ /methodology

R

Reality Engine

Reality Engine (Wave 5)

Replay system that re-runs published cells from their pinned image_digest + sweep_id and flags drift in L1/L8/L15 scores.

→ /methodology

refusal oracle

Refusal oracle

Per-task deterministic function that decides whether a refusal was correct. Reads trace spans + filesystem markers — never an LLM.

→ /methodology

RSI

Refusal-Sensitivity Index

L9 axis: how well the agent declines tasks marked expected_refusal=true, scored by deterministic oracle (no LLM judge).

→ /methodology

S

sampling_profile

Sampling profile

Pinned (temperature, top_p, max_tokens, seed) used for the cell. Default until per-cell profiles ship: `default-t0.2-tp0.95`.

→ /methodology

scorer.kind

Scorer kind

Per-task scoring contract. `exec` runs unit tests in a sandbox; other kinds (planned: `diff`, `behavior`) verify the change shape rather than runtime behavior.

→ /methodology

Ships

Ships axis

Headline category: "did this combo ship a correct fix?" Aggregates L1 pass rate + L9 refusal correctness.

→ /methodology

Stays

Stays axis

Headline category: "did the combo stay disciplined?" Aggregates L4 blast radius, L8 integrity, L15 instruction fidelity.

→ /methodology

sweep

Sweep

One full matrix run over (TUI × model × task × plugin_stack × behavior_mode). Identified by `sweep_id`; every cell carries this back-reference.

→ /methodology

T

TUI

Terminal UI agent

An open coding-agent CLI: aider, opencode, goose, crush, plandex, openhands, qwen-code. CodingAgentBench measures the operational performance each delivers per open model.

→ /tuis

Δ

Δ-PBS

Delta-PBS

Score delta attributable to a single plugin: PBS(stack-with-plugin) − PBS(stack-without-plugin). Computed from paired W4D runs.

→ /methodology

Glossary v0.1 · snapshot 2026-05-26 · 45 terms · source site/src/data/glossary.json

The seven terms you need first

These are the words we use across the site. Open the ones you care about; ignore the rest.

What this page is

The headline score (we call this PBS)

Blast radius

Did it follow the rules (we call this L15)

Did it refuse the right tasks (we call this Stays)

Is the result stable (we call this Costs)

Reality Engine

Pareto frontier

Glossary

A

B

C

D

E

F

G

H

I

L

M

P

R

S

T

Δ

Keyboard shortcuts

Navigation

Command palette

Page