CLI code agent: prompt user to retry on transient API errors by epeicher · Pull Request #3130 · Automattic/studio

epeicher · 2026-04-17T15:58:59Z

Related issues

STU-1578

How AI was used in this PR

AI was used throughout: investigating the Claude Agent SDK's streaming/non-streaming fallback, retry, and timeout defaults; identifying the mis-classification bug where {subtype:"success", is_error:true} result messages were treated as successful turns; and implementing the retry-prompt + SDK retry-cap + ESC crash-fix changes. Every edit was reviewed against the session-file evidence supplied by me, and all changes pass lint, typecheck, and the existing 88 AI tests.

Proposed Changes

Detect transient API failures correctly (apps/cli/ai/ui.ts, apps/cli/ai/output-adapter.ts): route result messages on message.is_error instead of message.subtype. The SDK can emit {subtype:"success", is_error:true} (e.g. after exhausting retries on a 504), and this was previously mis-classified as success.
User-mediated retry prompt (apps/cli/commands/ai/index.ts): on is_error:true or a thrown transport error, the code agent now asks "There was a hiccup on the server. Do you want to continue?". Yes resumes the session with "Continue from where you left off."; No stops.
Bounded retries: capped at 4 user-prompted attempts per turn. After that, the CLI shows "The server has not recovered after multiple attempts. Please try again later." instead of prompting again.
Fail-fast SDK retries (apps/cli/ai/providers.ts): set CLAUDE_CODE_MAX_RETRIES=1 in the provider env so each failed attempt errors in ~1-2 min instead of burning the SDK's default 10 retries (~29 min). Respects a user-supplied override.
Interrupt-cleanup safety (apps/cli/ai/agent.ts): the existing unhandledRejection handler that swallowed Query closed now also swallows ProcessTransport is not ready for writing. Both come from the SDK's cleanup path on ESC; without this, pressing ESC during a tool call crashes the CLI.
Interrupt vs. error distinction (apps/cli/ai/output-adapter.ts, apps/cli/ai/ui.ts): HandleMessageResult now carries an interrupted flag so user-initiated ESC does not trigger the retry prompt.

Error recovery flow

State machine

flowchart TD
    Start([User sends prompt]) --> Turn[runAgentTurn]
    Turn --> SDKCall[SDK → AI proxy]
    SDKCall --> SDKResult{Response ok?}
    SDKResult -->|Yes| Success([Turn succeeds])
    SDKResult -->|Transient failure<br/>504 / timeout / etc.| SDKRetry[SDK retries once<br/>CLAUDE_CODE_MAX_RETRIES=1]
    SDKRetry --> SDKCall2[SDK → AI proxy]
    SDKCall2 --> SDKResult2{Response ok?}
    SDKResult2 -->|Yes| Success
    SDKResult2 -->|Still failing| Detect{is_error: true<br/>OR<br/>thrown error?}
    Detect -->|No| Success
    Detect -->|Yes| CheckCap{retryAttempt<br/>≥ 4?}
    CheckCap -->|Yes| GiveUp([Show 'try again later'<br/>return to input])
    CheckCap -->|No| Prompt[/Prompt user:<br/>'There was a hiccup<br/>on the server.<br/>Do you want to continue?'/]
    Prompt -->|No| Stop([Return to input])
    Prompt -->|Yes| Resume[Resume session<br/>with same sessionId<br/>retryAttempt++]
    Resume --> Turn

    classDef success fill:#d4edda,stroke:#28a745,color:#000
    classDef error fill:#f8d7da,stroke:#dc3545,color:#000
    classDef prompt fill:#fff3cd,stroke:#ffc107,color:#000
    class Success success
    class GiveUp,Stop error
    class Prompt prompt

Concrete recovery (server-side hiccup that clears on its own)

sequenceDiagram
    participant U as User
    participant CLI as Studio CLI
    participant SDK as Agent SDK
    participant Proxy as AI Proxy

    U->>CLI: Send prompt
    CLI->>SDK: Start turn (attempt 1)
    SDK->>Proxy: POST /v1/messages (stream)
    Proxy--xSDK: 504 Gateway Timeout
    Note over SDK: CLAUDE_CODE_MAX_RETRIES=1<br/>one quick retry
    SDK->>Proxy: POST /v1/messages (retry)
    Proxy--xSDK: 504 again
    SDK-->>CLI: result: is_error=true<br/>(~1–2 min elapsed)
    CLI->>U: 'Hiccup on the server.<br/>Do you want to continue?'
    Note over U,Proxy: User pauses to read.<br/>Server recovers in the meantime.
    U->>CLI: Yes
    CLI->>SDK: Resume session<br/>'Continue from where you left off.'<br/>(attempt 2)
    SDK->>Proxy: POST /v1/messages
    Proxy-->>SDK: 200 stream OK
    SDK-->>CLI: Assistant messages + tool calls
    CLI->>U: Turn completes

Recovery when the server stays down

Each failed attempt ends in ~1–2 min thanks to the SDK retry cap. After four prompted retries (attempts 2 → 5 counted from the user's POV), the CLI stops prompting and shows "The server has not recovered after multiple attempts. Please try again later." Total time elapsed: bounded at roughly 5–10 min instead of the previous ~40 mins (10 SDK retries × streaming + non-streaming paths).

Testing Instructions

Prerequisite: npm run cli:build (the CLI must be rebuilt to pick up the changes).

Happy path

Run node apps/cli/dist/cli/main.mjs code.
Ask the agent anything simple (e.g. "Create a site called Test"). Turn should complete normally with no retry prompt.

Retry prompt (transient error)

Sandbox the public-api
Introduce a temporary failure on the AI proxy — e.g. set a low cURL timeout so streaming fails, you can add the following to your 0-sandbox.php add_filter( 'wpcom_ai_api_proxy_request_timeout', fn() => 1 );.
Run the code agent and send any prompt. You can check the logs in c66b35b8fcfd51720a0e77cff6a88441-logstash
Expect: turn ends with an error banner, then the prompt "There was a hiccup on the server. Do you want to continue?" with Yes / No.
Choose Yes → turn re-runs as "Continue from where you left off." with session preserved.
Keep failing → after the 4th retry attempt the CLI shows "The server has not recovered after multiple attempts. Please try again later." and does not prompt again.

Fast-fail retries

With the proxy intentionally failing, confirm the retry prompt appears in ~4-6 min (previously ~40 min due to SDK default of 10 retries).

ESC during tool call (no crash)

Start a long-running tool call (e.g. ask the agent to take a screenshot, or anything that triggers mcp__studio__take_screenshot).
Press ESC mid-tool.
Expect: "Interrupted / Ran for Xs before interruption" shown, no retry prompt, CLI returns to the input prompt. Previously this could crash with Error: ProcessTransport is not ready for writing.

JSON mode

node apps/cli/dist/cli/main.mjs code --json "ping" with a failing proxy.
Expect turn.completed with status: "error" (not "success"). Previously the adapter reported "success" when the SDK emitted {subtype:"success", is_error:true}.

Pre-merge Checklist

Have you checked for TypeScript, React or other console errors? (lint + typecheck + 88 AI tests pass)

Detect is_error on SDK result messages and thrown transport errors, and offer a bounded user-mediated retry (max 4 attempts) that continues from the existing session. Caps the SDK's internal retry at 1 via CLAUDE_CODE_MAX_RETRIES so each attempt fails fast instead of burning through the default 10 retries. Also swallows the SDK's ProcessTransport cleanup rejection on ESC to prevent the CLI from crashing. STU-1578

epeicher · 2026-04-17T16:00:36Z

 // Node.js terminates the process on unhandled rejections.
+const SDK_INTERRUPT_CLEANUP_ERRORS = [
+	'Query closed',
+	'ProcessTransport is not ready for writing',


This is not strictly related to these changes, but I have identified that if the user presses ESC while in the Take screenshot step, the process terminates abruptly, so I added this to prevent that scenario. Now, it is correctly handled.

It feels like there should be a better way to catch these SDK errors other than checking strings but it seems there's no other valid option now?

epeicher · 2026-04-17T16:03:21Z

+	// Fail fast on transient API errors so the user-mediated retry prompt can
+	// intervene instead of the SDK burning through its default 10 retries.
+	if ( ! env.CLAUDE_CODE_MAX_RETRIES ) {
+		env.CLAUDE_CODE_MAX_RETRIES = '1';


This is to prevent the SDK internally re-trying. The default is 10 so before this change, the SDK was re-trying after a timeout without notifying the client code, so if the timeout is 4 mins, it would take 40 minutes to fail. Now, it fails immediately, and the agent handles the error; the user is notified that a hiccup on the server has happened.

I wonder if we should retry at least once without user knowing.

wpmobilebot · 2026-04-17T16:19:46Z

📊 Performance Test Results

Comparing 7700476 vs trunk

app-size

Metric	trunk	`7700476`	Diff	Change
App Size (Mac)	1283.06 MB	1283.06 MB	+0.00 MB	⚪ 0.0%

site-editor

Metric	trunk	`7700476`	Diff	Change
load	1924 ms	1914 ms	10 ms	⚪ 0.0%

site-startup

Metric	trunk	`7700476`	Diff	Change
siteCreation	8113 ms	8098 ms	15 ms	⚪ 0.0%
siteStartup	4175 ms	4296 ms	+121 ms	🔴 2.9%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

youknowriad · 2026-04-18T11:40:28Z

+						question: __( 'There was a hiccup on the server. Do you want to continue?' ),
+						options: [
+							{ label: 'Yes', description: __( 'Continue from where you left off' ) },
+							{ label: 'No', description: __( 'Stop here' ) },


I wonder if a third-option (other) is needed where the user can give different instructions.

youknowriad

This is working well.

epeicher commented Apr 17, 2026

View reviewed changes

epeicher self-assigned this Apr 17, 2026

epeicher requested review from a team and youknowriad April 17, 2026 16:14

youknowriad reviewed Apr 18, 2026

View reviewed changes

youknowriad approved these changes Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI code agent: prompt user to retry on transient API errors#3130

CLI code agent: prompt user to retry on transient API errors#3130
epeicher wants to merge 1 commit intotrunkfrom
stu-1578-retry-prompt-on-transient-api-errors

epeicher commented Apr 17, 2026 •

edited

Loading

Uh oh!

epeicher Apr 17, 2026 •

edited

Loading

Uh oh!

youknowriad Apr 18, 2026

Uh oh!

epeicher Apr 17, 2026

Uh oh!

youknowriad Apr 18, 2026

Uh oh!

wpmobilebot commented Apr 17, 2026

Uh oh!

youknowriad Apr 18, 2026

Uh oh!

youknowriad left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

epeicher commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues

How AI was used in this PR

Proposed Changes

Error recovery flow

State machine

Concrete recovery (server-side hiccup that clears on its own)

Recovery when the server stays down

Testing Instructions

Happy path

Retry prompt (transient error)

Fast-fail retries

ESC during tool call (no crash)

JSON mode

Pre-merge Checklist

Uh oh!

epeicher Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youknowriad Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

epeicher Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

youknowriad Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

wpmobilebot commented Apr 17, 2026

📊 Performance Test Results

app-size

site-editor

site-startup

Uh oh!

youknowriad Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

youknowriad left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

epeicher commented Apr 17, 2026 •

edited

Loading

epeicher Apr 17, 2026 •

edited

Loading