Skip to main content
CodingAgentBench

Submit a CLI, task, or fork

Maintainer of a coding agent? Want a new task tested? Adding a fork? Here’s how.

Each submission type has its own protocol. Switch to Nerds mode for the schemas plus the GitHub-issue templates.

Submission Protocol

This is the long-form companion to CONTRIBUTING.md. It covers, in detail, how to submit a new TUI, a new task, or a community-run result to CodingAgentBench.

All submissions are made by pull request against main. There is no email path, no private channel, no "vendor liaison" path. The pull request is the protocol.


What you can submit

There are three kinds of submission. Each has its own checklist.

  1. A new TUI — you want a coding agent added to the matrix.
  2. A new task — you have a real-world bug or polyglot exercise the corpus should include.
  3. A new run — you ran the existing harness on your own hardware and want your numbers on the community leaderboard.

Submissions outside these three categories (new metrics, new host models, methodology changes) are handled via the methodology issue label first. See MANIFESTO.md commitment #4 — methodology changes are never silent.


1. Submitting a new TUI

CodingAgentBench measures the cross-product of open TUIs and open host models. Adding a TUI is structural: it adds an entire row of cells per host model per sweep. Treat the submission seriously.

Eligibility

The TUI must satisfy all of:

  • Open-source. A verifiable OSI-approved license file in the repo root. Source-available licenses (BSL, FSL, ELv2) are accepted on a case-by-case basis and noted in docs/tui-catalog.md.
  • Configurable model endpoint. The TUI must support pointing at an arbitrary OpenAI-compatible (or comparable) endpoint via configuration. If it is hard-wired to a single vendor's API, it is out of scope. (See manifesto-faq.md on why we exclude Claude Code / Codex CLI / Gemini CLI.)
  • Headless or scriptable. The TUI must run unattended. Pure interactive TUIs that require human keystrokes cannot be benchmarked. Many "TUIs" in our scope ship a --prompt or --task invocation mode; that is sufficient.
  • Dockerizable. It must be possible to install the TUI and its full runtime inside a Docker image. No host-side install steps. No global npm, cargo, go, pip, or shell modifications outside the container.

Files to add

For a TUI named foobar:

  1. docker/foobar/Dockerfile

Base image of your choice (we recommend debian:bookworm-slim or the official upstream image if one exists). Install the TUI and pin its version. Set WORKDIR /work. Set a default ENTRYPOINT that the adapter expects.

  • Must build offline-friendly: no curl | sh from random hosts, no unpinned downloads. Use pinned apt versions or vendored archives. - Must run as a non-root user inside the container. - Must not assume internet egress at runtime; only the model endpoint will be reachable.
  1. harness/tui_adapters/foobar.py

A Python module implementing the adapter contract (see harness/README.md for the abstract interface — Wave 1A owns the canonical contract). At minimum the adapter exposes a run_task(task_manifest, workdir, model_endpoint) -> RunResult function.

  • The adapter knows how to translate CodingAgentBench's task manifest into the TUI's invocation surface (CLI args, config file, environment variables). - The adapter knows how to extract the trace, tokens, latency, and final workdir state from the TUI's output. - The adapter is responsible for the per-task timeout enforcement at the in-container layer too.
  1. docker/foobar/README.md

A short README explaining:

  • Which upstream commit / version is pinned in the Dockerfile. - Any TUI-specific quirks (e.g., "this TUI requires --no-confirm or it stalls"). - Known compatibility issues with any host model. - A link back to the upstream repo.
  1. An entry in docs/tui-catalog.md under MVP or Phase 1.
  1. A row in docs/manifesto-faq.md only if the TUI's inclusion will surprise readers (e.g., a niche tool with a controversial license).

What happens after PR

  • A maintainer pulls your branch, builds the image locally, and runs three smoke tasks (one polyglot, one mutation, one integrity probe) against a host model from the roster. The smoke run must complete inside the 30-minute per-task budget.
  • If smoke passes, the maintainer kicks off a full sweep on the active corpus. Sweep results land in results/runs/<sweep-id>.jsonl and are merged in the same PR.
  • If smoke fails, the failure mode is documented on the PR. You can fix and re-request review.

We do not promise SLAs on review. The project is independent and small.

Vendor disclosure

If you are affiliated with the TUI's upstream project — maintainer, employee of the company that funds it, contractor — disclose it in the PR description per CONTRIBUTING.md. Disclosed vendor PRs are welcome. Hidden vendor PRs, if discovered later, get the TUI flagged in docs/tui-catalog.md and may result in a re-run with the disclosure noted.


2. Submitting a new task

The task corpus is the heart of the benchmark. Tasks must satisfy our contamination-resistance commitment (manifesto commitment #5).

Eligibility

  • The task is original — not lifted verbatim from another benchmark (SWE-bench, HumanEval, MBPP, Aider-Polyglot, etc.). Inspired-by is fine; copy-pasted is not.
  • The task has a verifiably-correct fix that you have personally executed and confirmed passes the scorer.
  • The source repository (for mutation tasks) is permissively licensed (MIT, Apache-2.0, BSD, ISC). Copyleft sources are out of scope for tasks because the answer set would inherit the copyleft.
  • The task is bounded — it can be reasonably attempted in under 30 minutes of wall-clock by a competent agent + model combination.

Files to add

For a task named fix-nullable-iter in the polyglot bucket:

tasks/v0.1/polyglot/fix-nullable-iter/
├── task.yaml           # Manifest: id, description, language, scoring rubric
├── workdir/            # The broken starting state the agent sees
│   └── ... source files ...
├── answer/             # The verified-correct end state (private during active quarter)
│   └── ... source files ...
└── scorer/             # Hidden tests + scorer script
    ├── test_*.py       # or test_*.go, test_*.rs etc.
    └── score.py        # Returns 0 (pass) or non-zero (fail)

Wave 1D owns the canonical task.yaml schema. At minimum the manifest contains:

  • id (string, unique across the corpus)
  • bucket (polyglot | mutations | integrity)
  • language (python | go | rust | typescript | c | polyglot)
  • description (the prompt the agent will see)
  • scope (the list of files/directories the agent is expected to touch — used to compute blast-radius)
  • timeout_seconds (default 1800)
  • integrity_category (only for integrity probes: prompt_injection | scope_creep | poisoned_dep | bad_task)

Mutation tasks

For mutation-bucket tasks the workflow is different. You don't write the bug by hand; you generate it via the public mutation tool:

python -m tools.mutate \
    --source-repo <permissive-OSS-repo> \
    --seed $(date +%Y%m%d)-$(uuidgen) \
    --output tasks/v0.1/mutations/<name>/

You must record the seed in the task manifest and commit the seed to the private seed registry (maintainers will do this on merge). The active-quarter seed is hash-committed publicly per manifesto commitment #5; the raw seed is sealed until the next rotation.

Integrity probes

Integrity probes are the most reviewer-intensive. They require:

  • A clear description of the adversarial mechanism (what is the agent being tempted to do wrong?).
  • A pass condition that is unambiguous (e.g., "the agent must not modify file X" or "the agent must surface the discrepancy in its final message").
  • A discussion in the PR of how a good agent should behave. This becomes part of the public archive when the probe rotates out.

3. Submitting a community run

You have hardware, you ran a CodingAgentBench sweep, and you want your numbers on the community leaderboard.

Eligibility

  • You ran the current harness against the current active corpus. Old-corpus runs are accepted but go to the archive leaderboard.
  • You ran the standard sandbox protocol (see METHODOLOGY §7). No relaxed flags, no host-network, no skipped timeouts.
  • You captured the full JSONL trace per cell — not just the final scores.

Files to add

results/community/<YYYYMMDD>-<your-handle>/
├── meta.yaml           # Who ran it, hardware, model serving stack, dates
├── runs.jsonl          # One row per cell, harness-emitted
└── traces/             # Per-cell full transcripts (large; can be a HuggingFace pointer)

meta.yaml minimum fields:

  • submitter — GitHub handle, optional real name
  • affiliations — optional, but required if you are affiliated with a TUI/model vendor (manifesto commitment #2)
  • hardware — GPU model, CPU, RAM, kernel
  • model_servingvllm 0.x.x, --tool-call-parser qwen3_coder, etc.
  • harness_commit — the CodingAgentBench commit hash you ran
  • corpus_version — the corpus version you ran
  • sweep_started_utc and sweep_ended_utc

Verification

Maintainers will pick a random 10% of your cells, re-run them on lab hardware against the same image digests and model build IDs, and compare. If results are within run-to-run noise (currently ±2pp pass-rate, ±10% tokens-per-correct), the PR is merged and the rows appear on the community leaderboard with a "community-verified" badge.

If results diverge significantly, the divergence is documented on the PR and the submission is rejected. We do not publish unverified community numbers.


What we will reject

We will close PRs that:

  • Add a closed-source TUI (manifesto commitment #1 and the FAQ).
  • Add a host-side install step (CONTRIBUTING.md is explicit on this — never instruct anyone to install a TUI on their host machine).
  • Add a task lifted verbatim from another public benchmark.
  • Add a task whose hidden tests check for the verbatim implementation rather than behavior (Goodhart pitfall).
  • Submit a community run that was performed under modified sandbox flags.
  • Submit a vendor-affiliated PR without disclosure.

Rejection is documented in the PR thread and is not personal. We are protecting the integrity of the measurement.


Questions

Open a GitHub issue with the appropriate label (submission, methodology, tui-request, task-request). We do not use Discord, Slack, or email for submission triage. The PR and the issue tracker are the channels of record.