Skip to content

Commit a6d22cd

Browse files
ankushdesaiaishu-jcursoragent
authored
Dev/p chat bot (#938)
* Initial PChatBot Commit * modify context files, sanity_check and compiler_analysis * add multiple sanity checks, clean-up * PChatBot: Major architecture redesign with MCP integration This commit introduces a comprehensive redesign of the PChatBot architecture: ## New Architecture Components ### 1. LLM Backend Abstraction (src/core/llm/) - Abstract LLMProvider base class with unified interface - Support for multiple backends: Snowflake Cortex, AWS Bedrock, Anthropic Direct - Factory pattern with auto-detection from environment variables - Configurable model parameters (temperature, max_tokens, etc.) ### 2. Service Layer (src/core/services/) - GenerationService: UI-agnostic code generation with two-stage machine generation - CompilationService: P compiler integration with structured error parsing - FixerService: Intelligent error fixing with retry logic and human-in-the-loop ### 3. Workflow Engine (src/core/workflow/) - Step-based workflow execution with retry and skip logic - Event-driven architecture for observability - Support for pause/resume with human intervention - Pre-built workflows: full_generation, compile_and_fix, full_verification ### 4. MCP Server Integration (src/ui/mcp/) - Full MCP server for Cursor IDE integration - Tools: generate_*, compile, check, fix_*, syntax_help, workflows - Preview-then-save workflow for code review before saving - New fix_buggy_program tool for automatic PChecker error analysis and fixing ### 5. Validation Pipeline (src/core/validation/) - Input validators for design documents - P code validators for syntax and semantic checks - Composable validation pipeline ### 6. Compilation Utilities (src/core/compilation/) - PCompilerErrorParser: Structured parsing of P compiler output - PErrorFixer: Specialized fixes for common P errors - CheckerErrorParser: Parse PChecker traces for debugging - CheckerFixer: Auto-fix common runtime errors (null_target, unhandled_event) ## Configuration - env.template: Template for environment configuration - cursor-mcp-settings.example.json: Example MCP settings for Cursor - mcp-config.json: MCP server configuration with env var references - .gitignore: Exclude secrets and local configuration ## Documentation - Updated README with new architecture overview - DESIGN_DOCUMENT.md with detailed component specifications - CLAUDE.md for AI assistant context * Refactor PChatBot services for improved error handling and validation This commit enhances the PChatBot architecture by introducing refined error handling mechanisms and validation processes across various services. Key updates include: - Enhanced error handling in GenerationService and CompilationService. - Improved input validation in the validation pipeline. - Updates to documentation reflecting these changes. These improvements aim to increase the robustness and reliability of the PChatBot system. * Improve PChatBot Cursor/MCP architecture, contracts, and CI * Add workflow persistence diagnostics and resume reliability * MCP improvements: fixed model defaults, robust code extraction, error parsing, RAG examples - Snowflake provider: removed auto-discovery, default to claude-sonnet-4-5 - Factory: simplified Snowflake config, removed CORTEX_USE_LATEST_MODEL - Generation: robust _extract_p_code with 4 fallback strategies (XML, markdown, bare blocks) - Compilation: added [Error:][file.p:line:col] parser pattern (Pattern 3) - Fixer: resolve relative error paths, parse combined stdout+stderr - Compilation: clear error message when PChecker finds no test declarations - Workflow factory: improved machine name extraction from numbered component lists - Added curated RAG examples (73 files): tutorials + generated Paxos/2PC - Added MCP E2E protocol test harness (scripts/mcp_e2e_protocols.py) - Added Snowflake model selection tests * MCP quality improvements: retry logic, env fixes, regression suite with 5 protocols - load_dotenv(override=True) in all entrypoints (MCP, Streamlit, CLI) - Removed os.chdir() from server.py and cli/app.py - Fixed TypeConsistencyChecker: replaced broken check_project with working cross-file check - Generation retry: all generation methods retry up to 2x on extraction failure - Fixer: cross-file context passed to LLM, spiral detection (3x same error = stop) - Error messages: strip ~~ [PTool] ~~ trailer before passing to fixer - Post-processor: bare halt; -> raise halt; fix - Post-processor: forbidden keyword detection in spec monitors (this/send/new) - Post-processor: Timer(this) wiring warning in PTst scenario machines - Instructions: machines must handle/ignore/defer all receivable events - Instructions: specs cannot use this/send/new/announce/receive - Instructions: test files must not re-declare PSrc machines or invent events - Instructions: scenario machines must be simple launchers, correct wiring order - Regression suite: 5 protocols (Paxos, 2PC, MessageBroker, DistributedLock, Hotel) - Regression suite: scoring 0-100, baseline comparison, protocol-level retry - Suppress Streamlit warnings in non-Streamlit mode - Baseline: 450/500 (90%) — 3 protocols at 100/100 * Architecture improvements: parallel gen, incremental regen, spec/doc validation - Parallel machine generation: generate_machines_parallel() with ThreadPoolExecutor Uses shared context snapshot; wired into generate_complete_project - Incremental regeneration: when fixer detects spiral (3x same error), rewrites the failing file from scratch with all project files as context - Spec validation: validate_spec_events() checks that all events in spec 'observes' clauses exist in the types file; runs in generate_complete_project - Design doc validation: validate_design_doc() checks required sections, component extraction, scenario count; blocks generation on invalid docs - Added concurrent.futures import for parallel execution * Remove hardcoded filenames: use actual LLM-returned filenames throughout - generate_complete_project: uses types_result.filename, spec_result.filename, test_result.filename instead of hardcoded Enums_Types_Events.p/Safety.p/TestDriver.p - Workflow steps: types_events_filename propagated through context dict - GenerateMachineStep/GenerateSpecStep: read types filename from context - SaveGeneratedFilesStep._collect_all_context: uses dynamic types filename - GenerateTypesEventsStep.can_skip: checks for any .p in PSrc, not hardcoded name - Post-processor in generate_complete_project: detects PTst files by path - All hardcoded names converted to fallback defaults (or 'X.p') only used when LLM doesn't provide a filename * Remove local cache and corpus artifacts * Ignore local caches and corpus * Update PChatBot: improved design doc, RAG examples, checker fixers, and generation pipeline * Rename PChatBot to PeasyAI across the entire codebase - Renamed Src/PChatBot/ directory to Src/PeasyAI/ - Renamed report/PChatBot_Analysis_Report.md to PeasyAI_Analysis_Report.md - Renamed evaluate_chatbot.py to evaluate_peasyai.py - Updated all PChatBot/pchatbot/p-chatbot/P-ChatBot references in source code - Updated MCP server name, UI titles, documentation, and config files - Replaced generic 'chatbot' references with 'AI assistant' where appropriate - Updated CLAUDE.md with new paths and naming * Fix failing PeasyAI tests: RAG import path, langchain import, and chunk test - Fix sys.path in tests/rag/test_rag_index.py to point to src/rag/ - Add pytest.importorskip for faiss graceful skip when not installed - Update create_rag_index.py import from langchain.text_splitter to langchain_text_splitters (current package layout) - Fix test_create_chunks to use text longer than chunk_size (500 chars) so the splitter actually produces multiple chunks * Fix CI workflow: update PChatBot references to PeasyAI - Update path triggers from Src/PChatBot/** to Src/PeasyAI/** - Update working-directory from Src/PChatBot to Src/PeasyAI - Update workflow name to PeasyAI Contract Tests The directory was renamed but the workflow still referenced the old name, so CI would never trigger and would fail if run manually. * Package PeasyAI MCP server with ~/.peasyai/settings.json config - Add ~/.peasyai/settings.json config system (like ~/.claude/settings.json) replacing .env for LLM provider credentials and settings - Add pyproject.toml: pip-installable package with peasyai-mcp CLI entry point - Add src/core/config.py: config loader with env var fallback - Add src/ui/mcp/entry.py: CLI with init, config, and serve sub-commands - Add .peasyai-schema.json for IDE autocomplete in settings file - Update MCP server to load config from ~/.peasyai/settings.json - Update env validation tool to check for settings.json - Update ResourceLoader to find bundled resources in installed wheels - Update README with install steps for Cursor and Claude Code - Deprecate env.template in favor of peasyai-mcp init * Clean up stale/unnecessary files from PeasyAI Delete: - Config (Amazon Brazil build config, not relevant to open-source) - cursor-mcp-settings.example.json, mcp-config.json (superseded by peasyai-mcp CLI) - env.template (deprecated in favor of ~/.peasyai/settings.json) - analyze-checker-errors.py, analyze-errors.py, compute_metrics.py, visualize-pk-vs-tokens.py (one-off analysis scripts at project root) - resources/pipeline.json (old pipeline config, unused) - src/resources/p_syntax_rules.txt (unused duplicate) - src/rag/ (old faiss-based RAG scripts, replaced by src/core/rag/) Move: - evaluate_peasyai.py → scripts/evaluate_peasyai.py Update .gitignore: - Add generated_projects/, .peasyai_workflows.json - Remove stale cursor-mcp-settings.json entry * Improve PeasyAI MCP generation quality and regression coverage Major improvements to the code generation pipeline based on regression analysis across 9 protocols (Paxos, 2PC, MessageBroker, DistributedLock, HotelManagement, ClientServer, FailureDetector, EspressoMachine, Raft). Generation pipeline: - Add p_code_utils.py with brace-balanced extraction replacing fragile regex for function bodies, state bodies, and LLM response parsing - LLM-based machine name extraction from design docs (replaces brittle regex that misidentified "Front Desk" as "Front", "Lock Server" as "Lock", etc.) - Auto-inject Common_Timer template when design doc mentions timers, heartbeats, or timeouts — prevents LLM from reinventing Timer machine - Pass expected_name to code extraction for reliable filenames - Inject spec monitor names into test generation context - Enrich RAG facet derivation from already-generated context files RAG retrieval: - Add timer/heartbeat/appliance/leader-election pattern facets - Cross-derive related facets (failure-detector → timer-timeout, raft → broadcast + leader-election + timer-timeout) - Fix timer hint to reference CreateTimer/StartTimer/CancelTimer API Ensemble scoring: - Add compile-check verification for top-3 candidates (+50 bonus) - Penalize illegal var init, redeclared events/types, forbidden keywords in specs - Cap defer/ignore scoring to prevent verbose-but-wrong candidates PChecker fix loop: - Feed trace analysis back into targeted regeneration as checker_feedback - Add LLM-based fallback fixer when specialized fixer fails - Add assertion_failure support to PCheckerErrorFixer - Rank traces by error category priority across all failing tests - Increase re-check schedules from 20 to 50 - Improve spiral detection with error message normalization - Add build_checker_feedback() as core utility Post-processor: - Broaden single-field tuple fix to all contexts (new, raise, type annotations, function params) not just send statements - Broaden _ensure_test_declarations to detect scenario machines by send/name patterns, with fallback to all machines Prompts and design docs: - Strengthen spec generation with assert requirement, empty function ban, and working MutualExclusion example - Strengthen test generation checklist (assert SpecName required) - Add tuple construction guidance to machine generation prompt - Add 4 new regression protocols with design docs - Improve 2PC design doc with explicit constructor signatures, timer module guidance, and state handling instructions Co-authored-by: Cursor <cursoragent@cursor.com> * Add RAG portfolio examples, enhanced corpus indexing, and cleanup - Add 6 portfolio RAG examples (BenOr, ChangRoberts, DAO, German, Streamlet, TokenRing) for broader protocol coverage - Add p_documentation_reference.txt for comprehensive P language docs - Enhance p_corpus.py with faceted indexing, multi-lane retrieval, and richer metadata extraction for RAG examples - Update MCP tools (compilation, query, rag_tools, workflows) with improved error handling and API consistency - Update workflow p_steps with enhanced step implementations - Update regression baseline with latest results - Update embeddings with caching improvements - Remove stale CMakeLists.txt and PeasyAI_Analysis_Report.md Co-authored-by: Cursor <cursoragent@cursor.com> * Fix str.format() KeyError in instruction templates and improve error handling The generate_machine tool was failing silently with an opaque ' ' error because P code examples in instruction templates contained unescaped curly braces that collided with Python's str.format(). Added _safe_format() helpers that fall back to manual substitution when str.format() raises KeyError/ValueError. Also improved all exception handlers across the service layer (generation, compilation, fixer) to include the exception type name and full traceback in logs, preventing opaque error messages. Includes design doc migration from .txt to .md, post-processor enhancements, validation updates, and a verified BasicPaxos tutorial generated end-to-end via the MCP tools. Co-authored-by: Cursor <cursoragent@cursor.com> * Rename MCP tools to use peasy-ai-* prefix for consistent namespacing All MCP tool names now follow the peasy-ai-<action> convention (e.g., peasy-ai-compile, peasy-ai-gen-machine, peasy-ai-fix-compile-error) to avoid collisions with other MCP servers and improve discoverability. Updated tool definitions, descriptions, cross-references, docs, and tests. Co-authored-by: Cursor <cursoragent@cursor.com> * Add GitHub Release workflow for PeasyAI distribution Add a GitHub Actions workflow that builds and attaches PeasyAI wheels to GitHub Releases on peasyai-v* tags, so developers can install via pip without cloning the full P repo. Also adds the LICENSE file referenced by pyproject.toml and updates the README install instructions. Co-authored-by: Cursor <cursoragent@cursor.com> * Update PeasyAI README with prerequisites and installation improvements Point users to the P installation guide for .NET SDK 8.0, Java, and the P compiler. Add Quick Start section, troubleshooting table, upgrade instructions, and reorganize development sections. Co-authored-by: Cursor <cursoragent@cursor.com> * Overhaul PeasyAI validation pipeline and improve code generation quality - Replace ad-hoc code review with unified 4-stage validation pipeline: Stage 1: PCodePostProcessor (deterministic regex auto-fixes) Stage 2: Structured validator chain (13 validators with auto-fix) Stage 3: LLM wiring review for test files (circular deps, init order) Stage 4: LLM spec correctness review (observes completeness, assertions) - Add NamedTupleConstructionValidator for cross-file type checking - Add extraneous-semicolon auto-fix in SyntaxValidator - Fix ValidationPipeline context merging for preview-time cross-file validation - Update Timer template with bounded delays for liveness property support - Update FailureDetector design doc: liveness property, hot states, Timer as component - Add review_test_wiring and review_spec_correctness LLM review prompts - Improve generation prompts: named tuples, circular dependency patterns, helper fns - Fix streamlit lazy-import issue for non-UI contexts - Remove stale tools (simulator, trace_explorer) and dead code - Increase PChecker per-test timeout from 20s to 300s - Update CLAUDE.md with pipeline architecture docs and fix stale .env references - Add regression test support for wiring_fixes and spec_fixes Made-with: Cursor * Update README install URLs to PeasyAI v0.2.0 GitHub release Made-with: Cursor * Replace regex-based documentation with LLM-based code documentation review The old approach copied text verbatim from design docs into comments, which was redundant. The new approach uses an LLM review step (Stage 5) that reads both the generated code and design doc, then writes insightful comments explaining invariants, protocol steps, and design rationale. - Remove ~500 lines of regex documentation methods from PCodePostProcessor - Add GenerationService.review_code_documentation() as new LLM review step - Add review_code_documentation.txt instruction prompt - Wire into all 4 MCP generation tools and all workflow steps - Update tests and CLAUDE.md pipeline documentation Made-with: Cursor * Clean up dead code, fix var-order detection bug, and expand test coverage to 217 tests - Remove ~400 lines of dead/deprecated code across regex_utils, compile_utils, file_utils, string_utils, log_utils, generate_p_code, and pipelines - Delete dead modules: interactive.py, DesignDocInputMode.py, pipelining/examples.py - Fix broken InteractiveMode reference in app.py (would crash at runtime) - Remove debug print statements and unused imports from pipelines.py, pchecker_mode.py - Fix var-declaration-order detection bug in VarDeclarationOrderValidator and PCodePostProcessor: the detection loop broke early, missing cases where vars appeared after statements or were interleaved with statements - Add test_config.py: 21 tests for settings loading, env var overrides, defaults, malformed config, provider aliases, all 3 provider types - Add test_validators_extended.py: 49 tests covering all 7 previously-untested validators (InlineInit, VarDeclOrder, CollectionOps, SpecObservesConsistency, DuplicateDecl, SpecForbiddenKeyword, PayloadField, TestFile) plus 10 post-processor fix categories - Add test_error_parsers.py: 29 tests for compiler error parsing, categorization, CompilationResult, checker trace parsing, MachineState, EventInfo - Update CI workflow to run unit tests alongside contract tests in parallel jobs - Update release workflow to gate on full test suite (unit + contract) - Fix stale test assertion for Snowflake default model (claude-opus-4-6) Made-with: Cursor * Fix doc review silent failures and raise Snowflake token limit to 20k The LLM-based code documentation review (Stage 5) was silently failing for every generation call due to three root causes: 1. Snowflake provider capped max_tokens at 8192, causing response truncation before the closing </documented_code> tag 2. GenerateTypesEventsParams was missing the context_files field, causing an AttributeError swallowed by except Exception 3. No visibility into failures — callers had no way to distinguish "doc review succeeded with no comments" from "doc review crashed" Changes: - Raise Snowflake provider token cap from 8192 to 20000 - Raise doc review request from 8192 to 16384 tokens - Add retry on truncation (doubles max_tokens, up to 20k cap) - Return structured Dict with status/code/reason instead of Optional[str] - Surface doc_review_status field in all MCP generation responses - Extract shared _run_doc_review() helper to replace 4 duplicated try/except blocks - Add context_files field to GenerateTypesEventsParams - Add 2 new truncation test cases (53 total tests pass) Made-with: Cursor --------- Co-authored-by: Aishwarya Jagarapu <ajagara1@asu.edu> Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent c04f4e8 commit a6d22cd

File tree

77 files changed

+7003
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+7003
-0
lines changed

CLAUDE.md

Lines changed: 490 additions & 0 deletions
Large diffs are not rendered by default.

Src/PeasyAI/.cursorrules

Whitespace-only changes.
Lines changed: 339 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,339 @@
1+
# PeasyAI Cursor + MCP Improvement Design
2+
3+
**Status:** Proposed
4+
**Date:** 2026-02-09
5+
**Scope:** `Src/PeasyAI` (MCP server, service layer, Streamlit/CLI integration, testing, developer UX)
6+
7+
---
8+
9+
## 1) Why this document
10+
11+
PeasyAI now has strong building blocks: service-layer architecture, workflow engine, MCP tools/resources, and multi-provider support.
12+
The next step is to improve the end-to-end Cursor experience so the agent is:
13+
14+
- More reliable across environments
15+
- Easier to reason about and debug
16+
- Faster in iterative workflows
17+
- Better at human-in-the-loop interactions
18+
19+
This design focuses on concrete engineering changes that reduce failures and increase day-to-day usability in Cursor.
20+
21+
---
22+
23+
## 2) Current-state observations
24+
25+
### Strengths
26+
27+
- Service layer (`GenerationService`, `CompilationService`, `FixerService`) enables shared logic across interfaces.
28+
- MCP server offers broad tool coverage (generation, compile/check, fix, workflow, RAG, resources).
29+
- Preview-then-save workflow is aligned with Cursor review and user control.
30+
- Workflow abstractions exist for orchestrated generation/verification.
31+
32+
### Gaps
33+
34+
- MCP tool schemas and result shapes are not yet uniformly versioned or validated at the boundary.
35+
- Workflow/session persistence is limited for long-running or interrupted work.
36+
- Limited operational telemetry for production-quality debugging in IDE context.
37+
- End-to-end MCP tests are light compared with unit tests, especially for failure/human-guidance paths.
38+
- Some docs and examples can drift from actual tool signatures as the API evolves.
39+
40+
---
41+
42+
## 3) Product goals
43+
44+
1. **Cursor-first reliability**
45+
Minimize tool-call ambiguity and environment failures.
46+
47+
2. **Human-in-the-loop clarity**
48+
Make guidance requests explicit, structured, and resumable.
49+
50+
3. **Deterministic orchestration**
51+
Provide reproducible workflow behavior and resumability.
52+
53+
4. **Faster iteration loops**
54+
Improve compile/fix/check turnaround with better caching and context handling.
55+
56+
5. **Confidence through testing**
57+
Add robust MCP integration and contract tests.
58+
59+
---
60+
61+
## 4) Design principles
62+
63+
- **Strict contracts at boundaries:** every MCP tool should have stable, explicit input/output contracts.
64+
- **Progressive enhancement:** keep existing tools working while layering better metadata and orchestration.
65+
- **Small, composable modules:** split large server logic into focused registries and shared helpers.
66+
- **Observable by default:** log enough to diagnose issues without dumping sensitive payloads.
67+
- **Fail soft, recover quickly:** meaningful error categories + suggested next actions in every failure path.
68+
69+
---
70+
71+
## 5) Proposed improvements
72+
73+
## A. MCP Contract & API Quality
74+
75+
### A1. Introduce tool response envelope standard (v1)
76+
77+
All tools should return:
78+
79+
- `success`
80+
- `error` (if any)
81+
- `metadata`
82+
- `tool`
83+
- `operation_id`
84+
- `timestamp`
85+
- `provider`
86+
- `model`
87+
- `token_usage`
88+
89+
**Status:** Partially implemented (metadata added).
90+
**Next:** Add contract tests to guarantee this envelope remains stable.
91+
92+
### A2. Add `api_version` and deprecation notices
93+
94+
Add fields:
95+
96+
- `api_version: "1.0"`
97+
- `deprecation_warning` (only when applicable)
98+
99+
This prevents silent breakage in Cursor prompts/workflows when tools evolve.
100+
101+
### A3. Normalize error categories
102+
103+
All tools should include:
104+
105+
- `error_category` (e.g., `environment`, `validation`, `compilation`, `checker`, `llm_provider`, `internal`)
106+
- `retryable` boolean
107+
- `next_actions` list
108+
109+
This gives Cursor/Claude Code better branching behavior during autonomous tool chains.
110+
111+
---
112+
113+
## B. Cursor Interaction Model
114+
115+
### B1. Session-aware operation state
116+
117+
Add session/project correlation fields:
118+
119+
- `session_id` (from caller or generated)
120+
- `project_id` (derived from path)
121+
- `workflow_id` (for workflow-enabled calls)
122+
123+
This simplifies multi-step and multi-project activity in a single Cursor chat.
124+
125+
### B2. Improve human-guidance protocol
126+
127+
For `needs_guidance` responses, standardize:
128+
129+
- `guidance_request.id`
130+
- `guidance_request.context`
131+
- `guidance_request.questions[]`
132+
- `guidance_request.attempted_fixes[]`
133+
- `guidance_request.resume_tool`
134+
- `guidance_request.resume_payload_template`
135+
136+
This avoids ambiguity when the agent asks user questions and resumes later.
137+
138+
### B3. Add “preflight” recommendation mode
139+
140+
Expand `validate_environment` to include:
141+
142+
- missing prerequisites
143+
- suggested commands
144+
- provider-specific checks
145+
- write-permission checks for target project path
146+
147+
This should be the first call in most generation workflows.
148+
149+
---
150+
151+
## C. Workflow Robustness
152+
153+
### C1. Persist workflow state
154+
155+
Persist active/paused workflow state (json) under project temp dir:
156+
157+
- step index
158+
- context snapshot (bounded)
159+
- recent errors
160+
- timestamps
161+
162+
This enables resume after IDE restart or process interruption.
163+
164+
### C2. Idempotent step semantics
165+
166+
Each workflow step should define:
167+
168+
- inputs hash
169+
- side-effect output paths
170+
- safe re-run behavior
171+
172+
This reduces duplicate writes and inconsistent state during retries.
173+
174+
### C3. Partial-success strategy
175+
176+
For large generation flows, return:
177+
178+
- `completed_steps`
179+
- `failed_steps`
180+
- `artifacts_generated`
181+
- `artifacts_skipped`
182+
183+
This lets Cursor continue incrementally instead of restarting whole workflows.
184+
185+
---
186+
187+
## D. Performance and Cost
188+
189+
### D1. Prompt/context budgeting
190+
191+
Implement per-tool token budget controls:
192+
193+
- baseline guide snippets by tool
194+
- adaptive truncation for large context files
195+
- cap for included project files by relevance
196+
197+
### D2. Smart caching
198+
199+
Cache with eviction:
200+
201+
- resource file loads (already present)
202+
- RAG query results by `(query, category, top_k)`
203+
- compile outputs keyed by project hash (short TTL)
204+
205+
### D3. Faster verification loops
206+
207+
Use staged checker strategy:
208+
209+
- quick check (low schedules) during iterative fixing
210+
- full check only when candidate looks stable
211+
212+
---
213+
214+
## E. Testing Strategy (Cursor-centered)
215+
216+
### E1. MCP contract tests
217+
218+
For each tool:
219+
220+
- validate pydantic input schema behavior
221+
- validate response envelope fields
222+
- validate error category conventions
223+
224+
### E2. Golden-path integration tests
225+
226+
Scenarios:
227+
228+
- `validate_environment` -> generate minimal project -> compile
229+
- compile failure -> `peasy-ai-fix-compile-error` -> compile pass
230+
- checker failure -> `peasy-ai-fix-checker-error` -> checker re-run
231+
232+
Mock LLM and deterministic fixtures for CI stability.
233+
234+
### E3. Human-guidance tests
235+
236+
Assert:
237+
238+
- tool pauses with `needs_guidance=true`
239+
- response includes structured questions/template
240+
- resume call applies guidance and continues workflow
241+
242+
### E4. Multi-provider smoke matrix
243+
244+
Run minimal smoke cases for:
245+
246+
- snowflake config detection
247+
- anthropic_direct detection
248+
- bedrock fallback
249+
250+
No live external calls required in CI; provider clients should be mocked.
251+
252+
---
253+
254+
## F. Documentation and Prompting for Cursor
255+
256+
### F1. Single source of truth for tool docs
257+
258+
Generate MCP tool docs from actual schema definitions to prevent drift.
259+
260+
### F2. Cursor usage playbooks
261+
262+
Add short runbooks:
263+
264+
- “Generate project from design doc”
265+
- “Fix compile errors interactively”
266+
- “Run verification loop with human guidance”
267+
268+
### F3. Troubleshooting map
269+
270+
Map common failures to:
271+
272+
- likely cause
273+
- diagnostic tool
274+
- expected remediation path
275+
276+
---
277+
278+
## 6) Proposed implementation plan
279+
280+
### Phase 1 (1 week): Contracts + Preflight
281+
282+
- Finalize response envelope + `api_version`
283+
- Add standardized `error_category`, `retryable`, `next_actions`
284+
- Expand and document `validate_environment`
285+
- Add contract tests for top 10 tools
286+
287+
### Phase 2 (1-2 weeks): Workflow resilience
288+
289+
- Workflow state persistence + resume
290+
- Idempotency metadata for critical steps
291+
- Partial-success artifact reporting
292+
- Guidance request template standardization
293+
294+
### Phase 3 (1 week): Performance + verification loop
295+
296+
- Token/context budgeting controls
297+
- RAG/compile short-term caching
298+
- quick-check/full-check staged strategy
299+
300+
### Phase 4 (1 week): Docs + playbooks
301+
302+
- Auto-generated tool docs
303+
- Cursor runbooks
304+
- Troubleshooting matrix and FAQ
305+
306+
---
307+
308+
## 7) Success metrics
309+
310+
- **Reliability:** tool-call failure rate in Cursor sessions
311+
- **Recovery:** percent of failures resolved without restarting full workflow
312+
- **Speed:** median time for generate->compile and fix->recheck loops
313+
- **Guidance quality:** percent of paused runs successfully resumed
314+
- **Dev velocity:** time to add/modify MCP tool with passing contract tests
315+
316+
---
317+
318+
## 8) Risks and mitigations
319+
320+
- **Risk:** schema changes break existing prompts/workflows
321+
**Mitigation:** versioned contracts + deprecation warnings.
322+
323+
- **Risk:** workflow persistence stores too much context
324+
**Mitigation:** bounded snapshots + redact large/sensitive payloads.
325+
326+
- **Risk:** over-caching returns stale results
327+
**Mitigation:** TTL + content hash invalidation.
328+
329+
- **Risk:** CI flakiness from model/provider calls
330+
**Mitigation:** deterministic mocks and fixture-driven integration tests.
331+
332+
---
333+
334+
## 9) Immediate next actions
335+
336+
1. Add contract tests for `peasy-ai-validate-env`, `peasy-ai-gen-*`, `peasy-ai-compile`, `peasy-ai-fix-*`, `peasy-ai-run-workflow`.
337+
2. Implement `api_version` + standardized error fields in tool responses.
338+
3. Add workflow persistence for pause/resume.
339+
4. Update README tool list and examples to match current tool signatures.

0 commit comments

Comments
 (0)