CLI code agent: prompt user to retry on transient API errors#3130
CLI code agent: prompt user to retry on transient API errors#3130
Conversation
Detect is_error on SDK result messages and thrown transport errors, and offer a bounded user-mediated retry (max 4 attempts) that continues from the existing session. Caps the SDK's internal retry at 1 via CLAUDE_CODE_MAX_RETRIES so each attempt fails fast instead of burning through the default 10 retries. Also swallows the SDK's ProcessTransport cleanup rejection on ESC to prevent the CLI from crashing. STU-1578
| // Node.js terminates the process on unhandled rejections. | ||
| const SDK_INTERRUPT_CLEANUP_ERRORS = [ | ||
| 'Query closed', | ||
| 'ProcessTransport is not ready for writing', |
There was a problem hiding this comment.
This is not strictly related to these changes, but I have identified that if the user presses ESC while in the Take screenshot step, the process terminates abruptly, so I added this to prevent that scenario. Now, it is correctly handled.
There was a problem hiding this comment.
It feels like there should be a better way to catch these SDK errors other than checking strings but it seems there's no other valid option now?
| // Fail fast on transient API errors so the user-mediated retry prompt can | ||
| // intervene instead of the SDK burning through its default 10 retries. | ||
| if ( ! env.CLAUDE_CODE_MAX_RETRIES ) { | ||
| env.CLAUDE_CODE_MAX_RETRIES = '1'; |
There was a problem hiding this comment.
This is to prevent the SDK internally re-trying. The default is 10 so before this change, the SDK was re-trying after a timeout without notifying the client code, so if the timeout is 4 mins, it would take 40 minutes to fail. Now, it fails immediately, and the agent handles the error; the user is notified that a hiccup on the server has happened.
There was a problem hiding this comment.
I wonder if we should retry at least once without user knowing.
📊 Performance Test ResultsComparing 7700476 vs trunk app-size
site-editor
site-startup
Results are median values from multiple test runs. Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff) |
| question: __( 'There was a hiccup on the server. Do you want to continue?' ), | ||
| options: [ | ||
| { label: 'Yes', description: __( 'Continue from where you left off' ) }, | ||
| { label: 'No', description: __( 'Stop here' ) }, |
There was a problem hiding this comment.
I wonder if a third-option (other) is needed where the user can give different instructions.
youknowriad
left a comment
There was a problem hiding this comment.
This is working well.
Related issues
How AI was used in this PR
AI was used throughout: investigating the Claude Agent SDK's streaming/non-streaming fallback, retry, and timeout defaults; identifying the mis-classification bug where
{subtype:"success", is_error:true}result messages were treated as successful turns; and implementing the retry-prompt + SDK retry-cap + ESC crash-fix changes. Every edit was reviewed against the session-file evidence supplied by me, and all changes pass lint, typecheck, and the existing 88 AI tests.Proposed Changes
apps/cli/ai/ui.ts,apps/cli/ai/output-adapter.ts): route result messages onmessage.is_errorinstead ofmessage.subtype. The SDK can emit{subtype:"success", is_error:true}(e.g. after exhausting retries on a 504), and this was previously mis-classified as success.apps/cli/commands/ai/index.ts): onis_error:trueor a thrown transport error, the code agent now asks "There was a hiccup on the server. Do you want to continue?".Yesresumes the session with"Continue from where you left off.";Nostops.apps/cli/ai/providers.ts): setCLAUDE_CODE_MAX_RETRIES=1in the provider env so each failed attempt errors in ~1-2 min instead of burning the SDK's default 10 retries (~29 min). Respects a user-supplied override.apps/cli/ai/agent.ts): the existingunhandledRejectionhandler that swallowedQuery closednow also swallowsProcessTransport is not ready for writing. Both come from the SDK's cleanup path on ESC; without this, pressing ESC during a tool call crashes the CLI.apps/cli/ai/output-adapter.ts,apps/cli/ai/ui.ts):HandleMessageResultnow carries aninterruptedflag so user-initiated ESC does not trigger the retry prompt.Error recovery flow
State machine
flowchart TD Start([User sends prompt]) --> Turn[runAgentTurn] Turn --> SDKCall[SDK → AI proxy] SDKCall --> SDKResult{Response ok?} SDKResult -->|Yes| Success([Turn succeeds]) SDKResult -->|Transient failure<br/>504 / timeout / etc.| SDKRetry[SDK retries once<br/>CLAUDE_CODE_MAX_RETRIES=1] SDKRetry --> SDKCall2[SDK → AI proxy] SDKCall2 --> SDKResult2{Response ok?} SDKResult2 -->|Yes| Success SDKResult2 -->|Still failing| Detect{is_error: true<br/>OR<br/>thrown error?} Detect -->|No| Success Detect -->|Yes| CheckCap{retryAttempt<br/>≥ 4?} CheckCap -->|Yes| GiveUp([Show 'try again later'<br/>return to input]) CheckCap -->|No| Prompt[/Prompt user:<br/>'There was a hiccup<br/>on the server.<br/>Do you want to continue?'/] Prompt -->|No| Stop([Return to input]) Prompt -->|Yes| Resume[Resume session<br/>with same sessionId<br/>retryAttempt++] Resume --> Turn classDef success fill:#d4edda,stroke:#28a745,color:#000 classDef error fill:#f8d7da,stroke:#dc3545,color:#000 classDef prompt fill:#fff3cd,stroke:#ffc107,color:#000 class Success success class GiveUp,Stop error class Prompt promptConcrete recovery (server-side hiccup that clears on its own)
sequenceDiagram participant U as User participant CLI as Studio CLI participant SDK as Agent SDK participant Proxy as AI Proxy U->>CLI: Send prompt CLI->>SDK: Start turn (attempt 1) SDK->>Proxy: POST /v1/messages (stream) Proxy--xSDK: 504 Gateway Timeout Note over SDK: CLAUDE_CODE_MAX_RETRIES=1<br/>one quick retry SDK->>Proxy: POST /v1/messages (retry) Proxy--xSDK: 504 again SDK-->>CLI: result: is_error=true<br/>(~1–2 min elapsed) CLI->>U: 'Hiccup on the server.<br/>Do you want to continue?' Note over U,Proxy: User pauses to read.<br/>Server recovers in the meantime. U->>CLI: Yes CLI->>SDK: Resume session<br/>'Continue from where you left off.'<br/>(attempt 2) SDK->>Proxy: POST /v1/messages Proxy-->>SDK: 200 stream OK SDK-->>CLI: Assistant messages + tool calls CLI->>U: Turn completesRecovery when the server stays down
Each failed attempt ends in ~1–2 min thanks to the SDK retry cap. After four prompted retries (attempts 2 → 5 counted from the user's POV), the CLI stops prompting and shows "The server has not recovered after multiple attempts. Please try again later." Total time elapsed: bounded at roughly 5–10 min instead of the previous ~40 mins (10 SDK retries × streaming + non-streaming paths).
Testing Instructions
Prerequisite:
npm run cli:build(the CLI must be rebuilt to pick up the changes).Happy path
node apps/cli/dist/cli/main.mjs code.Retry prompt (transient error)
public-api0-sandbox.phpadd_filter( 'wpcom_ai_api_proxy_request_timeout', fn() => 1 );.Yes/No.Yes→ turn re-runs as "Continue from where you left off." with session preserved.Fast-fail retries
ESC during tool call (no crash)
mcp__studio__take_screenshot).Error: ProcessTransport is not ready for writing.JSON mode
node apps/cli/dist/cli/main.mjs code --json "ping"with a failing proxy.turn.completedwithstatus: "error"(not"success"). Previously the adapter reported"success"when the SDK emitted{subtype:"success", is_error:true}.Pre-merge Checklist