Skip to content

Replace CLI orchestration loops with agent workflow prompts#118

Open
alanzabihi wants to merge 202 commits intomainfrom
hybrid-agent-cli
Open

Replace CLI orchestration loops with agent workflow prompts#118
alanzabihi wants to merge 202 commits intomainfrom
hybrid-agent-cli

Conversation

@alanzabihi
Copy link
Copy Markdown
Contributor

@alanzabihi alanzabihi commented Apr 22, 2026

Summary

  • Replace the multi-step state machines in contribute.rs (443 lines) and lead.rs (367 lines) with agent workflow prompts that call CLI subcommands as shell tools
  • Rewrite experiment.md from 9 lines to 43 lines with iteration support, harness-only metric rules, and explicit leave-changes-in-place instructions
  • Add contribute-workflow.md (75 lines) and lead-workflow.md (63 lines) as the agent outer loops
  • Add spawn_workflow_agent to agent.rs for spawning workflow agents
  • Net: 272 insertions, 627 deletions (-355 lines)

Architecture change

The CLI no longer contains orchestration loops. polyresearch contribute and polyresearch lead now:

  1. Setup (clone, init, preflight)
  2. Spawn a workflow agent with the appropriate prompt
  3. The agent calls CLI subcommands (polyresearch claim, polyresearch submit, etc.) as tools

All 15+ CLI subcommands stay exactly as-is. The agent composes them into workflows adaptively.

Lead/contribute independence

Lead and contribute are now fully independent agent sessions:

  • Lead stays on the repo root (default branch). Does sync, decide, policy-check, generate.
  • Contributor creates worktrees for each thesis. Does claim, experiment, attempt, submit, release.
  • Each runs git operations sequentially. No concurrent async loops sharing git state.
  • Strict role separation enforced in prompts: contributor never runs lead commands, lead never runs contributor commands.

This eliminates the race condition class (bugs 11, 20, 28, 29, 35, 37, 38) by design.

Motivation

See agent-vs-cli-balance.md for the full analysis. In short: agents fail at precision, CLIs fail at adaptation. The CLI keeps the precision-critical protocol primitives. The agent handles the adaptive multi-step workflows where the combinatorial state space caused 42 bugs in the rigid Rust loops.

Test plan

  • 175 unit tests pass (cargo test --lib)
  • Clean build, zero warnings
  • Manual test: polyresearch contribute on a real project
  • Manual test: polyresearch lead on a real project
  • Manual test: both running simultaneously from the same directory

Note

Medium Risk
Large behavioral refactor of the CLI’s core lead/contribute execution path plus new git/GitHub automation (sync retries, worktree management, staging/commit). Failures could block or mis-sequence protocol actions, so end-to-end manual runs on real repos are important.

Overview
Shifts polyresearch lead and polyresearch contribute from Rust orchestration loops to prompt-driven workflow agents. Adds new contribute-workflow.md and lead-workflow.md prompts (plus a rewritten experiment.md) and introduces agent::spawn_workflow_agent plus prompt helpers to dynamically inject --once, sleep, max-parallel, and capacity guidance.

Introduces new protocol primitives to support the agent-driven flow. Adds polyresearch resume to recreate/verify a thesis worktree, sync .polyresearch-node.toml, and rewrite .polyresearch/thesis.md; adds polyresearch commit to stage only editable-surface changes and block protected paths before committing; claim/batch-claim now gate on a new duties::claim_gate and seed worktrees with thesis context.

Hardens repo hygiene and coordination. Bootstrap now tracks untracked files created by the setup agent, normalizes PROGRAM/PREPARE line endings, ensures .gitignore ignores .polyresearch-node.toml, and can force-add agent-created helper files. Sync is rewritten to pull/rebase safely and retry pushes on non-fast-forward races, with logic to discard “sync-only” local commits when needed. Thesis generation rejects duplicate titles via normalization, prune can remove worktrees for resolved/rejected theses, submit refuses PRs with no diff vs default branch, and GitHub handling adds enable_issues plus improved rate-limit retry classification/backoff.

Adds a --once cycle guard. New cycle_guard enforces “exactly one thesis cycle” runs by preventing additional claims after a release/submit marks the guard done.

Reviewed by Cursor Bugbot for commit 305dc9e. Bugbot is set up for automated code reviews on this repo. Configure here.

Folds all coordination logic into three high-level commands per the v2 spec:
- `bootstrap <url>`: clone/fork, write templates, init node, spawn setup agent
- `lead`: sync ledger, policy-check PRs, decide PRs, generate theses
- `contribute [url]`: auto-submit, hardware-aware parallelism, claim/resume, dispatch workers

New modules: agent.rs (agent runner + recovery), worker.rs (ThesisWorker lifecycle with
setup/run/record/cleanup phases). Updated NodeConfig with [agent] section, ProtocolConfig
with default_branch, main.rs with deferred setup for bootstrap/contribute.

159 tests passing (98 unit + 61 e2e).
The v2 CLI encodes the full coordination protocol as deterministic
behavior in bootstrap, lead, and contribute. Agents no longer need
the protocol spec or skill file.
…sha None handling

Root cause 1: Recovery functions (recover_from_logs, run_harness_directly)
returned ExperimentResult with fabricated observations. Now they return
RecoveredMetric (raw data only) and the worker classifies using
MetricDirection from WorkerContext. Log recovery without a baseline is
conservatively classified as no_improvement.

Root cause 2: contribute passed the deferred-setup placeholder AppContext
to duties::check, which read default config values. Now contribute builds
a local_ctx with the real ProtocolConfig and ProgramSpec after loading them.

Standalone: env_sha comparison in both decide.rs and lead.rs used
filter_map to skip None values, treating None and Some("x") as equal.
Fixed to compare Option<String> directly so mixed environments trigger
Disagreement.

166 tests passing (103 unit + 63 e2e).
…tional log recovery

contribute <url>: after cloning, re-discover RepoRef and rebuild
GitHubClient so API calls use the correct owner/name. Only done when
a URL is provided; without a URL the existing ctx is already correct.

commit_editable_surface: reset user-configured protected_globs from
PROGRAM.md in addition to the four hardcoded runtime paths. Previously
only .polyresearch/, .polyresearch-node.toml, PROGRAM.md, and
PREPARE.md were reset, silently allowing commits to user-declared
protected paths.

recover_from_logs: sort log files by name and take the metric from the
last file instead of always picking the max. This avoids encoding a
directional assumption (max is wrong for lower_is_better projects).

169 tests passing (104 unit + 65 e2e).
…neys

New test infrastructure:
- ScenarioGitHub: stateful mock that mutates in response to API calls so
  multi-step flows see the effects of prior steps
- mock_agent.sh: deterministic agent controlled by MOCK_AGENT_RESULT env
  var (improved, no_improvement, crashed, fail)

7 scenario tests covering complete user journeys:
- scenario_bootstrap_fresh: templates + node config created with goal text
- scenario_bootstrap_idempotent: existing PROGRAM.md preserved, missing
  sections appended
- scenario_contribute_improved: claim + worker dispatch with mock agent
- scenario_contribute_no_improvement: full flow with no_improvement result
- scenario_contribute_agent_failure: agent exit 1 handled gracefully
- scenario_lead_accept_pr: sync + decide accepted + merge + close thesis
- scenario_lead_reject_non_improvement: decide non_improvement + close PR,
  thesis stays open

Also fixes bootstrap clone_if_needed to not hard-fail on git fetch when
no remote exists (best-effort sync).

176 tests passing (104 unit + 65 e2e + 7 scenarios).
…arallelism contract

Root cause: lead.rs reimplemented compute_decision and post-decision
actions from decide.rs. The two had already diverged (NonImprovement
handling, helper function usage). Extracted execute_decision as a shared
pub function in decide.rs, made decide_without_peer_review and
decide_with_peer_review pub(crate). lead.rs now calls the shared
functions instead of maintaining its own copy. decide.rs also uses
execute_decision for its own run().

fork_and_clone: detect whether fork_owner matches the current gh user.
Only pass --org when targeting an org account; personal forks use
plain gh repo fork without --org.

calculate_parallelism: removed .max(1) so the function returns 0 when
available_work is 0, matching the documented contract. The caller
already guards the zero case.

181 tests passing (105 unit + 65 e2e + 11 scenarios).
…uard, and deduplicate node init

ExperimentResult.baseline is now Option<f64> so the log-recovery path no
longer fabricates a 0.0 baseline that could mislead the decision system.
find_harness returns a relative path resolved per work_dir, so the baseline
measurement uses its own copy of the harness script instead of the candidate's.
Lead loop uses a shared is_pr_decidable helper from decide.rs that includes
the maintainer_rejected check. Node initialization extracted to a single
ensure_node_config in commands/mod.rs.
…e variant

Auto-submit in contribute.rs now logs warnings and skips submitted_any
when push or PR creation fails instead of silently swallowing errors.
commit_editable_surface uses git diff --cached --quiet to detect staged
changes only, avoiding false positives from unstaged protected-file
modifications. execute() returns WorkerOutcome::Failed for crashed and
infra_failure results instead of misclassifying them as NoImprovement.
…f, and validate auto-submit branch

decide_ready_prs loads the ledger once before the loop and skips the
iteration if it is stale in zero-conf mode, matching the guard used by
the decide CLI command. policy_check_open_prs skips PRs that have no
thesis_number instead of silently falling through both action branches.
Auto-submit validates the worktree branch starts with the expected
thesis/{issue}- prefix before pushing.
…ueue depth

Policy check and sync may close PRs or push commits, so re-derive
repository state before decide_ready_prs to avoid acting on stale
snapshots. Use saturating_sub for min_queue_depth arithmetic to
eliminate any possibility of usize underflow.
CLI v0.5.0: bootstrap, lead, and contribute orchestration commands
Bump cli_version in PROGRAM.md to 0.5.0
Root README now walks through the actual usage flow: bootstrap a
project (which creates the coordination files), run the lead, run
contributors. Removes references to the deleted POLYRESEARCH.md and
skill file. CLI README adds a quick-start section with the three
high-level commands and organizes the command summary by role.
Update READMEs for v0.5.0 bootstrap/lead/contribute workflow
Tighten README for first-time visitors
…m the repo URL

Both commands were cloning directly into cwd when run outside a git repo,
which fails if the directory already exists. Now they behave like git clone
and create a child directory named after the repository.
Fix bootstrap and contribute to clone into repo-named subdirectory
- Add --capacity, --api-budget, --request-delay, --agent-command flags
  to contribute, lead, init, and bootstrap via shared NodeOverrides struct
- For contribute/lead these are pure runtime overrides (no file writes)
- For init/bootstrap they write initial values to .polyresearch-node.toml
- Bootstrap now auto-forks when the user lacks push access to the target
  repo; add --no-fork to skip the check and clone directly
- Add RepoRef::parse_url for URL-based owner/repo extraction
- Update both READMEs with new flag documentation

Closes #65
- Deduplicate URL stripping between parse_remote and parse_url into a
  shared strip_github_url function
- Add conflicts_with = "fork" to --no-fork so clap rejects contradictory
  flags at parse time
Move trim_end_matches('/') to the end of the strip_github_url chain so
it runs after prefix stripping, not before. The previous ordering ate the
slash that is part of the https://github.com/ prefix pattern, causing the
prefix strip to fail on inputs like https://github.com/owner/repo/.
Replace strip_github_url (trim_start_matches, returns &str) with
strip_github_prefix (strip_prefix, returns Option). The old helper
silently passed through non-GitHub URLs; the new one returns None when
no known prefix matches. parse_url now also rejects URLs with extra
path segments (e.g. /tree/main) and owner-only URLs.
Add CLI flags for node config overrides and auto-fork in bootstrap
Simplify README usage examples to one command per section
Move all agent prompts and document templates into separate .md files
under cli/prompts/, embedded at compile time via include_str!. Enrich
prompt content with autoresearch quality principles: simplicity
criterion, history awareness, trust boundary framing, crash judgment.
…through

Fix lead policy-pass without decide follow-through
…solved

Add regression coverage for resolved thesis claims
Fix secondary rate limit misclassification and shorten retry backoff
…ets a focused retry and a real failure signal.
Fix lead queue refill when generation is skipped
Comment thread cli/src/commands/lead.rs
eprintln!("Warning: could not verify queue depth after agent run: {err}");
}
}
break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lead and contribute loops exit after agent success

High Severity

In continuous mode (once=false), both the lead and contribute outer loops break when the workflow agent exits successfully. The agent prompt says "LOOP FOREVER," but agents will inevitably exit with code 0 due to context window limits. When that happens, the process silently terminates instead of restarting the agent for the next iteration. In contribute.rs, Ok(()) => break exits immediately; in lead.rs, the break at the end of the Ok arm has the same effect once post-checks pass. The Err path correctly restarts, but the Ok path does not, making continuous operation impossible once the agent's context window is exhausted.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 75e7ab7. Configure here.

Comment thread cli/src/commands/commit.rs
Reject duplicate thesis titles in polyresearch generate
…node

Fix duplicate claim with inconsistent node name (#156)
…agents cannot claim a second thesis, and cover the guard with unit, e2e, and scenario tests.
…the once-cycle guard branch rebases cleanly.
Fix contribute --once after the first thesis cycle
".polyresearch-node.toml",
"PROGRAM.md",
"PREPARE.md",
];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit command missing results.tsv in protected files

Medium Severity

The ALWAYS_PROTECTED list in the new commit.rs omits results.tsv, a critical protocol file that tracks the experiment ledger. If an agent accidentally modifies results.tsv in a worktree, polyresearch commit would include the change. Whether this is caught depends entirely on the PROGRAM.md "cannot_modify" globs being set up correctly — a fragile assumption given that bootstrap is agent-driven.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6fd5553. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 7 total unresolved issues (including 6 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 305dc9e. Configure here.

Comment thread cli/src/commands/claim.rs
&thesis.issue.title,
thesis.issue.body.as_deref().unwrap_or(""),
&prior_attempts,
)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claim missing node config sync to worktree

High Severity

claim creates the thesis worktree and writes thesis context, but unlike resume (which calls sync_node_config_to_worktree), it never copies .polyresearch-node.toml into the new worktree. When the workflow agent later CDs into the worktree and runs polyresearch commit, polyresearch attempt, or polyresearch submit, those commands call read_node_id(&ctx.repo_root) which looks for the config file in the worktree directory — and fails because it doesn't exist.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 305dc9e. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant