A session-based CLI for generating workflow candidates, auditing them, and driving a windtunnel + release-gate loop with an explicit Forge state machine.
This repo focuses on explicit state routing, shared blackboard (ForgeState), and windtunnel-driven gates (Oracle + regression) for iterative improvement.
- Forge state machine: Explicit Node A-G pipeline with traceable routing and persisted ForgeState.
- Windtunnel contract: Standardized
WindTunnelSpecandWindTunnelReportartifacts. - Release gates: Must-not-regress, SLO, and quality checks with failure bundles.
- Patch verification loop: Any HARD gate failure triggers patch + re-test, even when
--rounds 1. - Failure bundles: Reproducible context + failing tasks + pointers for regression.
- Python 3.11+
- Node.js + npm (only needed if you want the GitHub MCP server)
# Windows
python -m venv venv
.\venv\Scripts\activate
pip install -e ..\venv\Scripts\python.exe -m evo.main init-session --name demo.\venv\Scripts\python.exe -m evo.main forge-run \
--session sessions/demo \
--rounds 1 \
--population 3 \
--suite rag_windtunnel_v1 \
--replications 3 \
--seed 42 \
--perturb \
--budget-sweep "tool_calls=2,4;tokens=80,160" \
--perturb-sweep "miss_prob=0.0,0.3;noise_docs=0,2" \
--patch-mode strategy \
--resetWhat you get:
sessions/<name>/forge_state.json(shared blackboard)sessions/<name>/state_machine/trace.md(node routing trace)- windtunnel evidence + report under
sessions/<name>/evidence/ - release gate results under
sessions/<name>/gates/ - recommendation artifacts when gates pass
.\venv\Scripts\python.exe -m evo.main init-session --name demo
.\venv\Scripts\python.exe -m evo.main generate --session sessions/demo --n 3
.\venv\Scripts\python.exe -m evo.main audit --session sessions/demo
.\venv\Scripts\python.exe -m evo.main report --session sessions/demo.\venv\Scripts\python.exe -m evo.main forge-run --session sessions/demo --rounds 2 --population 3.\venv\Scripts\python.exe -m evo.main iterate --session sessions/demo --rounds 2 --population 3.\venv\Scripts\python.exe -m evo.main run --session sessions/demo --suite rag_windtunnel_v1 --replications 3
.\venv\Scripts\python.exe -m evo.main metrics --session sessions/demo --suite rag_windtunnel_v1Nodes follow the explicit A-G route:
- A Ingest: Loads requirements into
ForgeState.user_requirements - B Generate: Produces candidate workflows and sets
current_workflow - C Static Audit: Audits + writes
required_fixes(deduped by rule_id) - D Windtunnel: Runs suite for
current_workflow, writes report + stats - E Synthesis: Converts failures into actionable WDR updates + gate decisions
- F Revise/Patch: Applies targeted patches and prepares for re-audit
- G Package: Produces final recommendation once gates pass
Routing:
- C Reject -> F -> C (re-audit), with no-progress watchdog to regenerate
- C Pass -> D -> E -> (F if HARD gate fail) -> C -> ...
- Only gate pass triggers G Package
Trace file:
sessions/<name>/state_machine/trace.md
Schema:
evo/windtunnel/spec.py(WindTunnelSpec)evo/windtunnel/report.py(WindTunnelReport)
Artifacts:
sessions/<name>/evidence/workflow/<candidate_id>/windtunnel/spec_<suite>.jsonsessions/<name>/evidence/workflow/<candidate_id>/windtunnel/report_<suite>.json
Gate rules are configured in gate_rules.yaml:
- Must-not-regress: loop rate / unauthorized tool calls / injection success
- SLO: latency/cost/timeouts thresholds
- Quality: pass mean >= baseline - epsilon
Gate outputs:
sessions/<name>/gates/last_result.jsonsessions/<name>/gates/baseline.json
When a HARD gate fails, a failure bundle is created:
sessions/<name>/failure_bundles/<candidate_id>/<timestamp>/
context.json
failing_tasks.json
pointers.json
This enables targeted regression and replay against worst-case sweep points.
.
├── evo/ # Core package
│ ├── forge/ # ForgeState + explicit state machine
│ │ ├── state.py # ForgeState schema (blackboard)
│ │ └── machine.py # Node A-G routing + verification loop
│ ├── windtunnel/ # Spec/report contracts
│ │ ├── spec.py # WindTunnelSpec (inputs)
│ │ └── report.py # WindTunnelReport (outputs)
│ ├── oracle/ # Windtunnel simulator + sweeps
│ │ └── rag_mini_runner.py # Runs suite + aggregates + failure stats
│ ├── gates.py # Release gate evaluation + baselines
│ ├── metrics.py # Metrics extraction from evidence
│ ├── audit_engine.py # Ruleset audit + Oracle gate checks
│ ├── patching.py # Patch logic for workflows/proposals
│ ├── recommend.py # Recommendation + gate-aware output
│ ├── generate.py # Candidate generation
│ ├── iterate.py # Legacy iterate wrapper
│ ├── core.py # Core API used by CLI
│ ├── main.py # Typer CLI entry
│ └── models.py # Pydantic models (WorkflowIR, Metrics, etc.)
├── eval_suites/ # Windtunnel suite definitions (JSON)
├── sessions/ # Per-run outputs (not committed)
├── gate_rules.yaml # Release gate thresholds
├── ruleset.yaml # Proposal ruleset (legacy)
├── ruleset_workflow.yaml # Workflow ruleset (current)
└── README.md
Key runtime outputs (generated under sessions/<name>/):
sessions/<name>/
├── forge_state.json # ForgeState blackboard snapshot
├── state_machine/trace.md # Node routing trace
├── evidence/workflow/<id>/ # Runs, sweeps, windtunnel report/spec
├── gates/ # Baseline + last gate result
├── failure_bundles/ # Replayable failure bundles
└── recommendation.* # Final recommendation (when gates pass)
- Use the venv Python for all CLI runs on this repo.
- Session artifacts can be large; do not commit
sessions/or log files. - The LLM generator requires
google-genaiand a validGEMINI_API_KEY/GOOGLE_API_KEY.
Internal MVP.