Replace CLI orchestration loops with agent workflow prompts#118
Replace CLI orchestration loops with agent workflow prompts#118alanzabihi wants to merge 202 commits intomainfrom
Conversation
Folds all coordination logic into three high-level commands per the v2 spec: - `bootstrap <url>`: clone/fork, write templates, init node, spawn setup agent - `lead`: sync ledger, policy-check PRs, decide PRs, generate theses - `contribute [url]`: auto-submit, hardware-aware parallelism, claim/resume, dispatch workers New modules: agent.rs (agent runner + recovery), worker.rs (ThesisWorker lifecycle with setup/run/record/cleanup phases). Updated NodeConfig with [agent] section, ProtocolConfig with default_branch, main.rs with deferred setup for bootstrap/contribute. 159 tests passing (98 unit + 61 e2e).
The v2 CLI encodes the full coordination protocol as deterministic behavior in bootstrap, lead, and contribute. Agents no longer need the protocol spec or skill file.
…sha None handling
Root cause 1: Recovery functions (recover_from_logs, run_harness_directly)
returned ExperimentResult with fabricated observations. Now they return
RecoveredMetric (raw data only) and the worker classifies using
MetricDirection from WorkerContext. Log recovery without a baseline is
conservatively classified as no_improvement.
Root cause 2: contribute passed the deferred-setup placeholder AppContext
to duties::check, which read default config values. Now contribute builds
a local_ctx with the real ProtocolConfig and ProgramSpec after loading them.
Standalone: env_sha comparison in both decide.rs and lead.rs used
filter_map to skip None values, treating None and Some("x") as equal.
Fixed to compare Option<String> directly so mixed environments trigger
Disagreement.
166 tests passing (103 unit + 63 e2e).
…tional log recovery contribute <url>: after cloning, re-discover RepoRef and rebuild GitHubClient so API calls use the correct owner/name. Only done when a URL is provided; without a URL the existing ctx is already correct. commit_editable_surface: reset user-configured protected_globs from PROGRAM.md in addition to the four hardcoded runtime paths. Previously only .polyresearch/, .polyresearch-node.toml, PROGRAM.md, and PREPARE.md were reset, silently allowing commits to user-declared protected paths. recover_from_logs: sort log files by name and take the metric from the last file instead of always picking the max. This avoids encoding a directional assumption (max is wrong for lower_is_better projects). 169 tests passing (104 unit + 65 e2e).
…neys New test infrastructure: - ScenarioGitHub: stateful mock that mutates in response to API calls so multi-step flows see the effects of prior steps - mock_agent.sh: deterministic agent controlled by MOCK_AGENT_RESULT env var (improved, no_improvement, crashed, fail) 7 scenario tests covering complete user journeys: - scenario_bootstrap_fresh: templates + node config created with goal text - scenario_bootstrap_idempotent: existing PROGRAM.md preserved, missing sections appended - scenario_contribute_improved: claim + worker dispatch with mock agent - scenario_contribute_no_improvement: full flow with no_improvement result - scenario_contribute_agent_failure: agent exit 1 handled gracefully - scenario_lead_accept_pr: sync + decide accepted + merge + close thesis - scenario_lead_reject_non_improvement: decide non_improvement + close PR, thesis stays open Also fixes bootstrap clone_if_needed to not hard-fail on git fetch when no remote exists (best-effort sync). 176 tests passing (104 unit + 65 e2e + 7 scenarios).
…arallelism contract Root cause: lead.rs reimplemented compute_decision and post-decision actions from decide.rs. The two had already diverged (NonImprovement handling, helper function usage). Extracted execute_decision as a shared pub function in decide.rs, made decide_without_peer_review and decide_with_peer_review pub(crate). lead.rs now calls the shared functions instead of maintaining its own copy. decide.rs also uses execute_decision for its own run(). fork_and_clone: detect whether fork_owner matches the current gh user. Only pass --org when targeting an org account; personal forks use plain gh repo fork without --org. calculate_parallelism: removed .max(1) so the function returns 0 when available_work is 0, matching the documented contract. The caller already guards the zero case. 181 tests passing (105 unit + 65 e2e + 11 scenarios).
…uard, and deduplicate node init ExperimentResult.baseline is now Option<f64> so the log-recovery path no longer fabricates a 0.0 baseline that could mislead the decision system. find_harness returns a relative path resolved per work_dir, so the baseline measurement uses its own copy of the harness script instead of the candidate's. Lead loop uses a shared is_pr_decidable helper from decide.rs that includes the maintainer_rejected check. Node initialization extracted to a single ensure_node_config in commands/mod.rs.
…e variant Auto-submit in contribute.rs now logs warnings and skips submitted_any when push or PR creation fails instead of silently swallowing errors. commit_editable_surface uses git diff --cached --quiet to detect staged changes only, avoiding false positives from unstaged protected-file modifications. execute() returns WorkerOutcome::Failed for crashed and infra_failure results instead of misclassifying them as NoImprovement.
…f, and validate auto-submit branch
decide_ready_prs loads the ledger once before the loop and skips the
iteration if it is stale in zero-conf mode, matching the guard used by
the decide CLI command. policy_check_open_prs skips PRs that have no
thesis_number instead of silently falling through both action branches.
Auto-submit validates the worktree branch starts with the expected
thesis/{issue}- prefix before pushing.
…ueue depth Policy check and sync may close PRs or push commits, so re-derive repository state before decide_ready_prs to avoid acting on stale snapshots. Use saturating_sub for min_queue_depth arithmetic to eliminate any possibility of usize underflow.
CLI v0.5.0: bootstrap, lead, and contribute orchestration commands
Bump cli_version in PROGRAM.md to 0.5.0
Root README now walks through the actual usage flow: bootstrap a project (which creates the coordination files), run the lead, run contributors. Removes references to the deleted POLYRESEARCH.md and skill file. CLI README adds a quick-start section with the three high-level commands and organizes the command summary by role.
Update READMEs for v0.5.0 bootstrap/lead/contribute workflow
Update READMEs for v0.5.0
Tighten README for first-time visitors
…m the repo URL Both commands were cloning directly into cwd when run outside a git repo, which fails if the directory already exists. Now they behave like git clone and create a child directory named after the repository.
Fix bootstrap and contribute to clone into repo-named subdirectory
- Add --capacity, --api-budget, --request-delay, --agent-command flags to contribute, lead, init, and bootstrap via shared NodeOverrides struct - For contribute/lead these are pure runtime overrides (no file writes) - For init/bootstrap they write initial values to .polyresearch-node.toml - Bootstrap now auto-forks when the user lacks push access to the target repo; add --no-fork to skip the check and clone directly - Add RepoRef::parse_url for URL-based owner/repo extraction - Update both READMEs with new flag documentation Closes #65
- Deduplicate URL stripping between parse_remote and parse_url into a shared strip_github_url function - Add conflicts_with = "fork" to --no-fork so clap rejects contradictory flags at parse time
Move trim_end_matches('/') to the end of the strip_github_url chain so
it runs after prefix stripping, not before. The previous ordering ate the
slash that is part of the https://github.com/ prefix pattern, causing the
prefix strip to fail on inputs like https://github.com/owner/repo/.
Replace strip_github_url (trim_start_matches, returns &str) with strip_github_prefix (strip_prefix, returns Option). The old helper silently passed through non-GitHub URLs; the new one returns None when no known prefix matches. parse_url now also rejects URLs with extra path segments (e.g. /tree/main) and owner-only URLs.
Add CLI flags for node config overrides and auto-fork in bootstrap
Simplify README usage examples to one command per section
Move all agent prompts and document templates into separate .md files under cli/prompts/, embedded at compile time via include_str!. Enrich prompt content with autoresearch quality principles: simplicity criterion, history awareness, trust boundary framing, crash judgment.
…claim a thesis under a different name.
…through Fix lead policy-pass without decide follow-through
…solved Add regression coverage for resolved thesis claims
Fix secondary rate limit misclassification and shorten retry backoff
…ets a focused retry and a real failure signal.
Fix lead queue refill when generation is skipped
…exhausted work is not proposed again.
… scenario test conflicts.
…tent and resume prep cannot drift.
| eprintln!("Warning: could not verify queue depth after agent run: {err}"); | ||
| } | ||
| } | ||
| break; |
There was a problem hiding this comment.
Lead and contribute loops exit after agent success
High Severity
In continuous mode (once=false), both the lead and contribute outer loops break when the workflow agent exits successfully. The agent prompt says "LOOP FOREVER," but agents will inevitably exit with code 0 due to context window limits. When that happens, the process silently terminates instead of restarting the agent for the next iteration. In contribute.rs, Ok(()) => break exits immediately; in lead.rs, the break at the end of the Ok arm has the same effect once post-checks pass. The Err path correctly restarts, but the Ok path does not, making continuous operation impossible once the agent's context window is exhausted.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 75e7ab7. Configure here.
Reject duplicate thesis titles in polyresearch generate
… refreshed scenario test conflicts.
…node Fix duplicate claim with inconsistent node name (#156)
…freshed PR conflicts.
…agents cannot claim a second thesis, and cover the guard with unit, e2e, and scenario tests.
…w task itself aborts before returning.
…the once-cycle guard branch rebases cleanly.
Fix contribute --once after the first thesis cycle
| ".polyresearch-node.toml", | ||
| "PROGRAM.md", | ||
| "PREPARE.md", | ||
| ]; |
There was a problem hiding this comment.
Commit command missing results.tsv in protected files
Medium Severity
The ALWAYS_PROTECTED list in the new commit.rs omits results.tsv, a critical protocol file that tracks the experiment ledger. If an agent accidentally modifies results.tsv in a worktree, polyresearch commit would include the change. Whether this is caught depends entirely on the PROGRAM.md "cannot_modify" globs being set up correctly — a fragile assumption given that bootstrap is agent-driven.
Reviewed by Cursor Bugbot for commit 6fd5553. Configure here.
… by unrelated orchestration work.
Fix contributor resume flow for stale claims
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 7 total unresolved issues (including 6 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 305dc9e. Configure here.
| &thesis.issue.title, | ||
| thesis.issue.body.as_deref().unwrap_or(""), | ||
| &prior_attempts, | ||
| )?; |
There was a problem hiding this comment.
Claim missing node config sync to worktree
High Severity
claim creates the thesis worktree and writes thesis context, but unlike resume (which calls sync_node_config_to_worktree), it never copies .polyresearch-node.toml into the new worktree. When the workflow agent later CDs into the worktree and runs polyresearch commit, polyresearch attempt, or polyresearch submit, those commands call read_node_id(&ctx.repo_root) which looks for the config file in the worktree directory — and fails because it doesn't exist.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 305dc9e. Configure here.


Summary
contribute.rs(443 lines) andlead.rs(367 lines) with agent workflow prompts that call CLI subcommands as shell toolsexperiment.mdfrom 9 lines to 43 lines with iteration support, harness-only metric rules, and explicit leave-changes-in-place instructionscontribute-workflow.md(75 lines) andlead-workflow.md(63 lines) as the agent outer loopsspawn_workflow_agenttoagent.rsfor spawning workflow agentsArchitecture change
The CLI no longer contains orchestration loops.
polyresearch contributeandpolyresearch leadnow:polyresearch claim,polyresearch submit, etc.) as toolsAll 15+ CLI subcommands stay exactly as-is. The agent composes them into workflows adaptively.
Lead/contribute independence
Lead and contribute are now fully independent agent sessions:
This eliminates the race condition class (bugs 11, 20, 28, 29, 35, 37, 38) by design.
Motivation
See agent-vs-cli-balance.md for the full analysis. In short: agents fail at precision, CLIs fail at adaptation. The CLI keeps the precision-critical protocol primitives. The agent handles the adaptive multi-step workflows where the combinatorial state space caused 42 bugs in the rigid Rust loops.
Test plan
cargo test --lib)polyresearch contributeon a real projectpolyresearch leadon a real projectNote
Medium Risk
Large behavioral refactor of the CLI’s core
lead/contributeexecution path plus new git/GitHub automation (sync retries, worktree management, staging/commit). Failures could block or mis-sequence protocol actions, so end-to-end manual runs on real repos are important.Overview
Shifts
polyresearch leadandpolyresearch contributefrom Rust orchestration loops to prompt-driven workflow agents. Adds newcontribute-workflow.mdandlead-workflow.mdprompts (plus a rewrittenexperiment.md) and introducesagent::spawn_workflow_agentplus prompt helpers to dynamically inject--once, sleep, max-parallel, and capacity guidance.Introduces new protocol primitives to support the agent-driven flow. Adds
polyresearch resumeto recreate/verify a thesis worktree, sync.polyresearch-node.toml, and rewrite.polyresearch/thesis.md; addspolyresearch committo stage only editable-surface changes and block protected paths before committing;claim/batch-claimnow gate on a newduties::claim_gateand seed worktrees with thesis context.Hardens repo hygiene and coordination. Bootstrap now tracks untracked files created by the setup agent, normalizes PROGRAM/PREPARE line endings, ensures
.gitignoreignores.polyresearch-node.toml, and can force-add agent-created helper files. Sync is rewritten to pull/rebase safely and retry pushes on non-fast-forward races, with logic to discard “sync-only” local commits when needed. Thesis generation rejects duplicate titles via normalization, prune can remove worktrees for resolved/rejected theses, submit refuses PRs with no diff vs default branch, and GitHub handling addsenable_issuesplus improved rate-limit retry classification/backoff.Adds a
--oncecycle guard. Newcycle_guardenforces “exactly one thesis cycle” runs by preventing additional claims after a release/submit marks the guard done.Reviewed by Cursor Bugbot for commit 305dc9e. Bugbot is set up for automated code reviews on this repo. Configure here.