Measures whether inline comments steer AI coding agents on small surgical edits.
Comment Bench showed when comments matter. Comment Policy tells agents how to write them. Comment Audit keeps them from rotting.
This repo ships three things:
- the benchmark scenarios
comment-policy.md, a drop-in policy for agent-heavy reposcomment-audit, a skill for auditing stale, vague, or low-signal comments
Blog post: Comments only matter when the code doesn't. What the benchmark probes: RESULTS.md.
curl -O https://raw.githubusercontent.com/caiopizzol/comment-bench/main/comment-policy.md@-import it from your CLAUDE.md / AGENTS.md, or paste the contents in.
Copy the skill into your local agent skills directory.
# Codex
mkdir -p ~/.codex/skills
cp -R skills/comment-audit ~/.codex/skills/
# Claude Code
mkdir -p ~/.claude/skills
cp -R skills/comment-audit ~/.claude/skills/Then ask your agent:
Use comment-audit to audit changed files.
The skill reads your local comment-policy.md, CLAUDE.md, AGENTS.md, and project conventions before judging comments. It reports keep, delete, update, or investigate; it does not edit files unless you ask. The Codex/OpenAI manifest lives at skills/comment-audit/agents/openai.yaml.
Model-agnostic. Two commands: prep a workspace for the agent to edit, score the result against hidden tests. The middle step is yours.
git clone https://github.com/caiopizzol/comment-bench
cd comment-bench
bun install# List scenarios
bun benchmark.ts list
# Prep a workspace for a (scenario, treatment) pair
bun benchmark.ts prep refund_window_branch human_why_inline --out /tmp/work
# Run YOUR agent on /tmp/work however you want. The agent reads task.md
# and edits the file(s) listed under `editable` in meta.ts.
# Claude Code: cd /tmp/work && claude
# Cursor: open /tmp/work, point Cursor at task.md
# Codex CLI: codex exec --cd /tmp/work
# Aider: cd /tmp/work && aider
# Manual edit: edit /tmp/work/src/refunds.ts yourself for a sanity check
# Score the result
bun benchmark.ts score refund_window_branch /tmp/workscore exits 0 if both task and invariant passed, 1 otherwise.
scenarios/<id>/
meta.ts metadata: agent_visible files, editable files, trap, invariant
task.md agent-visible task description
src/ (optional) read-only files visible to the agent
treatments/ seven comment payload variants
<treatment>/src/<file>.ts
oracle/ hidden tests, never shown to the agent
feature.test.ts did the new feature work
invariant.test.ts did the protected invariant hold
The seven treatments (only the comment payload varies):
none- no commentwhat_paraphrase- restates what the code doeshuman_why_inline- precise intent at point of usehuman_file_header- module-level rule listingaidev_anchor- tagged as// AIDEV-NOTE:ai_generated_comment- generic plausible AI-style docstringstale_misleading- comment contradicts the code or rule
All four use the gift-card refund domain. They differ in where the 24-hour gift-card cap is enforced.
| id | where the cap lives |
|---|---|
refund_window_branch |
inside an if (order.productType === "gift_card") branch |
refund_window_accumulator |
applied via Math.min(window, processorCap) after tier selection |
refund_window_helper |
inside a sibling helper (capRefundWindow) imported from ./processor_rules |
refund_window_comment_only |
nowhere in the code; lives only in the comment payload |
The first three test "comments did nothing" - the cap is in the code, the agent preserves it regardless of comment. The fourth tests "comments decided everything" - the cap exists only in the comment, so the comment payload determines whether the agent honors it.
- Copy any directory under
scenarios/. - Edit
task.md, the seven treatments, and the oracle tests. - Update
meta.ts: list every agent-visible file underagent_visible, every editable file undereditable. - The trap should fire on a tempting wrong fix and pass on the right one. Validate the canonical (
none) treatment plus a manually-applied wrong fix against the oracle before launching trial sweeps.
- Single-shot edits in small workspaces. Multi-turn sessions and large repos are out of scope.
- Tests are TypeScript via Bun. The benchmark idea is language-agnostic; port the scenarios to run against Python or Go.
MIT. See LICENSE.