Plain-English vocabulary
The seven terms you need first
These are the words we use across the site. Open the ones you care about; ignore the rest.
What this page is
- What you'll find: 7 core terms in plain English, each in a click-to-open card.
- Who it's for: people landing on CodingAgentBench for the first time.
- How to go deeper: switch to Nerds mode for all 45 terms with anchor nav and live filter.
The headline score (we call this PBS)
PBS rolls five things into one number between 0 and 1: did it pass tests, did it stay honest, did it refuse correctly, did it spend tokens well, and did it finish fast. You compare CLIs by PBS first, then drill into the parts you care about.
Blast radius
Blast radius counts the distinct files the agent touched in its final diff. A one-line bug fix that edits ten files has a blast radius of ten. Lower is better. We use this so you can spot CLIs that ripple changes further than they should.
Did it follow the rules (we call this L15)
Every task ships with an AGENTS.md file listing constraints, like "do not touch config.yaml" or "add no new dependencies." L15 is the fraction of those constraints the agent honored. We score it with deterministic checks, not another LLM.
Did it refuse the right tasks (we call this Stays)
Some tasks are traps: prompt injections, impossible asks, unsafe requests. Stays measures whether the agent declined those for the right reason. A deterministic oracle decides, so the score does not depend on judge taste.
Is the result stable (we call this Costs)
We rerun every cell several times under the same conditions. Costs is the fraction of reruns that produced the same verdict. Low Costs means the cell is flaky, so you should treat its headline score with caution.
Reality Engine
The Reality Engine replays published cells from their pinned container digest and sweep id. It flags drift if the score changes. You can trust a published number because we can rebuild it bit-for-bit on demand.
Pareto frontier
On a chart of score versus cost, a CLI sits on the Pareto frontier if no other CLI is both better and cheaper. The frontier is the short list worth your attention. Everything off the frontier is dominated by something on it.