Comment Bench

Measures whether inline comments steer AI coding agents on small surgical edits.

Comment Bench showed when comments matter. Comment Policy tells agents how to write them. Comment Audit keeps them from rotting.

This repo ships three things:

the benchmark scenarios
comment-policy.md, a drop-in policy for agent-heavy repos
comment-audit, a skill for auditing stale, vague, or low-signal comments

Blog post: Comments only matter when the code doesn't. What the benchmark probes: RESULTS.md.

Use this in your repo

curl -O https://raw.githubusercontent.com/caiopizzol/comment-bench/main/comment-policy.md

@-import it from your CLAUDE.md / AGENTS.md, or paste the contents in.

Audit comments in your repo

Copy the skill into your local agent skills directory.

# Codex
mkdir -p ~/.codex/skills
cp -R skills/comment-audit ~/.codex/skills/

# Claude Code
mkdir -p ~/.claude/skills
cp -R skills/comment-audit ~/.claude/skills/

Then ask your agent:

Use comment-audit to audit changed files.

The skill reads your local comment-policy.md, CLAUDE.md, AGENTS.md, and project conventions before judging comments. It reports keep, delete, update, or investigate; it does not edit files unless you ask. The Codex/OpenAI manifest lives at skills/comment-audit/agents/openai.yaml.

Run the benchmark on any model

Model-agnostic. Two commands: prep a workspace for the agent to edit, score the result against hidden tests. The middle step is yours.

git clone https://github.com/caiopizzol/comment-bench
cd comment-bench
bun install

# List scenarios
bun benchmark.ts list

# Prep a workspace for a (scenario, treatment) pair
bun benchmark.ts prep refund_window_branch human_why_inline --out /tmp/work

# Run YOUR agent on /tmp/work however you want. The agent reads task.md
# and edits the file(s) listed under `editable` in meta.ts.
#   Claude Code:  cd /tmp/work && claude
#   Cursor:       open /tmp/work, point Cursor at task.md
#   Codex CLI:    codex exec --cd /tmp/work
#   Aider:        cd /tmp/work && aider
#   Manual edit:  edit /tmp/work/src/refunds.ts yourself for a sanity check

# Score the result
bun benchmark.ts score refund_window_branch /tmp/work

score exits 0 if both task and invariant passed, 1 otherwise.

What's in a scenario

scenarios/<id>/
  meta.ts                 metadata: agent_visible files, editable files, trap, invariant
  task.md                 agent-visible task description
  src/                    (optional) read-only files visible to the agent
  treatments/             seven comment payload variants
    <treatment>/src/<file>.ts
  oracle/                 hidden tests, never shown to the agent
    feature.test.ts       did the new feature work
    invariant.test.ts     did the protected invariant hold

The seven treatments (only the comment payload varies):

none - no comment
what_paraphrase - restates what the code does
human_why_inline - precise intent at point of use
human_file_header - module-level rule listing
aidev_anchor - tagged as // AIDEV-NOTE:
ai_generated_comment - generic plausible AI-style docstring
stale_misleading - comment contradicts the code or rule

The scenarios

All four use the gift-card refund domain. They differ in where the 24-hour gift-card cap is enforced.

id	where the cap lives
`refund_window_branch`	inside an `if (order.productType === "gift_card")` branch
`refund_window_accumulator`	applied via `Math.min(window, processorCap)` after tier selection
`refund_window_helper`	inside a sibling helper (`capRefundWindow`) imported from `./processor_rules`
`refund_window_comment_only`	nowhere in the code; lives only in the comment payload

The first three test "comments did nothing" - the cap is in the code, the agent preserves it regardless of comment. The fourth tests "comments decided everything" - the cap exists only in the comment, so the comment payload determines whether the agent honors it.

Adding a scenario

Copy any directory under scenarios/.
Edit task.md, the seven treatments, and the oracle tests.
Update meta.ts: list every agent-visible file under agent_visible, every editable file under editable.
The trap should fire on a tempting wrong fix and pass on the right one. Validate the canonical (none) treatment plus a manually-applied wrong fix against the oracle before launching trial sweeps.

Limitations

Single-shot edits in small workspaces. Multi-turn sessions and large repos are out of scope.
Tests are TypeScript via Bun. The benchmark idea is language-agnostic; port the scenarios to run against Python or Go.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
scenarios		scenarios
skills/comment-audit		skills/comment-audit
.gitignore		.gitignore
.releaserc.json		.releaserc.json
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
benchmark.ts		benchmark.ts
biome.json		biome.json
bun.lock		bun.lock
comment-policy.md		comment-policy.md
lefthook.yml		lefthook.yml
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comment Bench

Use this in your repo

Audit comments in your repo

Run the benchmark on any model

What's in a scenario

The scenarios

Adding a scenario

Limitations

License

About

Uh oh!

Releases 3

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Comment Bench

Use this in your repo

Audit comments in your repo

Run the benchmark on any model

What's in a scenario

The scenarios

Adding a scenario

Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors

Uh oh!

Languages