Submit a CLI, task, or fork
Maintainer of a coding agent? Want a new task tested? Adding a fork? Here’s how.
Each submission type has its own protocol. Switch to Nerds mode for the schemas plus the GitHub-issue templates.
Each submission type has its own protocol. Switch to Nerds mode for the schemas plus the GitHub-issue templates.
This is the long-form companion to CONTRIBUTING.md. It covers, in detail, how to submit a new TUI, a new task, or a community-run result to CodingAgentBench.
All submissions are made by pull request against main. There is no email path, no private channel, no "vendor liaison" path. The pull request is the protocol.
There are three kinds of submission. Each has its own checklist.
Submissions outside these three categories (new metrics, new host models, methodology changes) are handled via the methodology issue label first. See MANIFESTO.md commitment #4 — methodology changes are never silent.
CodingAgentBench measures the cross-product of open TUIs and open host models. Adding a TUI is structural: it adds an entire row of cells per host model per sweep. Treat the submission seriously.
The TUI must satisfy all of:
docs/tui-catalog.md.manifesto-faq.md on why we exclude Claude Code / Codex CLI / Gemini CLI.)--prompt or --task invocation mode; that is sufficient.npm, cargo, go, pip, or shell modifications outside the container.For a TUI named foobar:
docker/foobar/DockerfileBase image of your choice (we recommend debian:bookworm-slim or the official upstream image if one exists). Install the TUI and pin its version. Set WORKDIR /work. Set a default ENTRYPOINT that the adapter expects.
curl | sh from random hosts, no unpinned downloads. Use pinned apt versions or vendored archives. - Must run as a non-root user inside the container. - Must not assume internet egress at runtime; only the model endpoint will be reachable.harness/tui_adapters/foobar.pyA Python module implementing the adapter contract (see harness/README.md for the abstract interface — Wave 1A owns the canonical contract). At minimum the adapter exposes a run_task(task_manifest, workdir, model_endpoint) -> RunResult function.
docker/foobar/README.mdA short README explaining:
--no-confirm or it stalls"). - Known compatibility issues with any host model. - A link back to the upstream repo.docs/tui-catalog.md under MVP or Phase 1.docs/manifesto-faq.md only if the TUI's inclusion will surprise readers (e.g., a niche tool with a controversial license).results/runs/<sweep-id>.jsonl and are merged in the same PR.We do not promise SLAs on review. The project is independent and small.
If you are affiliated with the TUI's upstream project — maintainer, employee of the company that funds it, contractor — disclose it in the PR description per CONTRIBUTING.md. Disclosed vendor PRs are welcome. Hidden vendor PRs, if discovered later, get the TUI flagged in docs/tui-catalog.md and may result in a re-run with the disclosure noted.
The task corpus is the heart of the benchmark. Tasks must satisfy our contamination-resistance commitment (manifesto commitment #5).
For a task named fix-nullable-iter in the polyglot bucket:
tasks/v0.1/polyglot/fix-nullable-iter/
├── task.yaml # Manifest: id, description, language, scoring rubric
├── workdir/ # The broken starting state the agent sees
│ └── ... source files ...
├── answer/ # The verified-correct end state (private during active quarter)
│ └── ... source files ...
└── scorer/ # Hidden tests + scorer script
├── test_*.py # or test_*.go, test_*.rs etc.
└── score.py # Returns 0 (pass) or non-zero (fail)
Wave 1D owns the canonical task.yaml schema. At minimum the manifest contains:
id (string, unique across the corpus)bucket (polyglot | mutations | integrity)language (python | go | rust | typescript | c | polyglot)description (the prompt the agent will see)scope (the list of files/directories the agent is expected to touch — used to compute blast-radius)timeout_seconds (default 1800)integrity_category (only for integrity probes: prompt_injection | scope_creep | poisoned_dep | bad_task)For mutation-bucket tasks the workflow is different. You don't write the bug by hand; you generate it via the public mutation tool:
python -m tools.mutate \
--source-repo <permissive-OSS-repo> \
--seed $(date +%Y%m%d)-$(uuidgen) \
--output tasks/v0.1/mutations/<name>/
You must record the seed in the task manifest and commit the seed to the private seed registry (maintainers will do this on merge). The active-quarter seed is hash-committed publicly per manifesto commitment #5; the raw seed is sealed until the next rotation.
Integrity probes are the most reviewer-intensive. They require:
You have hardware, you ran a CodingAgentBench sweep, and you want your numbers on the community leaderboard.
results/community/<YYYYMMDD>-<your-handle>/
├── meta.yaml # Who ran it, hardware, model serving stack, dates
├── runs.jsonl # One row per cell, harness-emitted
└── traces/ # Per-cell full transcripts (large; can be a HuggingFace pointer)
meta.yaml minimum fields:
submitter — GitHub handle, optional real nameaffiliations — optional, but required if you are affiliated with a TUI/model vendor (manifesto commitment #2)hardware — GPU model, CPU, RAM, kernelmodel_serving — vllm 0.x.x, --tool-call-parser qwen3_coder, etc.harness_commit — the CodingAgentBench commit hash you rancorpus_version — the corpus version you ransweep_started_utc and sweep_ended_utcMaintainers will pick a random 10% of your cells, re-run them on lab hardware against the same image digests and model build IDs, and compare. If results are within run-to-run noise (currently ±2pp pass-rate, ±10% tokens-per-correct), the PR is merged and the rows appear on the community leaderboard with a "community-verified" badge.
If results diverge significantly, the divergence is documented on the PR and the submission is rejected. We do not publish unverified community numbers.
We will close PRs that:
CONTRIBUTING.md is explicit on this — never instruct anyone to install a TUI on their host machine).Rejection is documented in the PR thread and is not personal. We are protecting the integrity of the measurement.
Open a GitHub issue with the appropriate label (submission, methodology, tui-request, task-request). We do not use Discord, Slack, or email for submission triage. The PR and the issue tracker are the channels of record.