Skip to content

CLI code agent: prompt user to retry on transient API errors#3130

Open
epeicher wants to merge 1 commit intotrunkfrom
stu-1578-retry-prompt-on-transient-api-errors
Open

CLI code agent: prompt user to retry on transient API errors#3130
epeicher wants to merge 1 commit intotrunkfrom
stu-1578-retry-prompt-on-transient-api-errors

Conversation

@epeicher
Copy link
Copy Markdown
Contributor

@epeicher epeicher commented Apr 17, 2026

Related issues

How AI was used in this PR

AI was used throughout: investigating the Claude Agent SDK's streaming/non-streaming fallback, retry, and timeout defaults; identifying the mis-classification bug where {subtype:"success", is_error:true} result messages were treated as successful turns; and implementing the retry-prompt + SDK retry-cap + ESC crash-fix changes. Every edit was reviewed against the session-file evidence supplied by me, and all changes pass lint, typecheck, and the existing 88 AI tests.

Proposed Changes

  • Detect transient API failures correctly (apps/cli/ai/ui.ts, apps/cli/ai/output-adapter.ts): route result messages on message.is_error instead of message.subtype. The SDK can emit {subtype:"success", is_error:true} (e.g. after exhausting retries on a 504), and this was previously mis-classified as success.
  • User-mediated retry prompt (apps/cli/commands/ai/index.ts): on is_error:true or a thrown transport error, the code agent now asks "There was a hiccup on the server. Do you want to continue?". Yes resumes the session with "Continue from where you left off."; No stops.
  • Bounded retries: capped at 4 user-prompted attempts per turn. After that, the CLI shows "The server has not recovered after multiple attempts. Please try again later." instead of prompting again.
  • Fail-fast SDK retries (apps/cli/ai/providers.ts): set CLAUDE_CODE_MAX_RETRIES=1 in the provider env so each failed attempt errors in ~1-2 min instead of burning the SDK's default 10 retries (~29 min). Respects a user-supplied override.
  • Interrupt-cleanup safety (apps/cli/ai/agent.ts): the existing unhandledRejection handler that swallowed Query closed now also swallows ProcessTransport is not ready for writing. Both come from the SDK's cleanup path on ESC; without this, pressing ESC during a tool call crashes the CLI.
  • Interrupt vs. error distinction (apps/cli/ai/output-adapter.ts, apps/cli/ai/ui.ts): HandleMessageResult now carries an interrupted flag so user-initiated ESC does not trigger the retry prompt.

Error recovery flow

State machine

flowchart TD
    Start([User sends prompt]) --> Turn[runAgentTurn]
    Turn --> SDKCall[SDK → AI proxy]
    SDKCall --> SDKResult{Response ok?}
    SDKResult -->|Yes| Success([Turn succeeds])
    SDKResult -->|Transient failure<br/>504 / timeout / etc.| SDKRetry[SDK retries once<br/>CLAUDE_CODE_MAX_RETRIES=1]
    SDKRetry --> SDKCall2[SDK → AI proxy]
    SDKCall2 --> SDKResult2{Response ok?}
    SDKResult2 -->|Yes| Success
    SDKResult2 -->|Still failing| Detect{is_error: true<br/>OR<br/>thrown error?}
    Detect -->|No| Success
    Detect -->|Yes| CheckCap{retryAttempt<br/>≥ 4?}
    CheckCap -->|Yes| GiveUp([Show 'try again later'<br/>return to input])
    CheckCap -->|No| Prompt[/Prompt user:<br/>'There was a hiccup<br/>on the server.<br/>Do you want to continue?'/]
    Prompt -->|No| Stop([Return to input])
    Prompt -->|Yes| Resume[Resume session<br/>with same sessionId<br/>retryAttempt++]
    Resume --> Turn

    classDef success fill:#d4edda,stroke:#28a745,color:#000
    classDef error fill:#f8d7da,stroke:#dc3545,color:#000
    classDef prompt fill:#fff3cd,stroke:#ffc107,color:#000
    class Success success
    class GiveUp,Stop error
    class Prompt prompt
Loading

Concrete recovery (server-side hiccup that clears on its own)

sequenceDiagram
    participant U as User
    participant CLI as Studio CLI
    participant SDK as Agent SDK
    participant Proxy as AI Proxy

    U->>CLI: Send prompt
    CLI->>SDK: Start turn (attempt 1)
    SDK->>Proxy: POST /v1/messages (stream)
    Proxy--xSDK: 504 Gateway Timeout
    Note over SDK: CLAUDE_CODE_MAX_RETRIES=1<br/>one quick retry
    SDK->>Proxy: POST /v1/messages (retry)
    Proxy--xSDK: 504 again
    SDK-->>CLI: result: is_error=true<br/>(~1–2 min elapsed)
    CLI->>U: 'Hiccup on the server.<br/>Do you want to continue?'
    Note over U,Proxy: User pauses to read.<br/>Server recovers in the meantime.
    U->>CLI: Yes
    CLI->>SDK: Resume session<br/>'Continue from where you left off.'<br/>(attempt 2)
    SDK->>Proxy: POST /v1/messages
    Proxy-->>SDK: 200 stream OK
    SDK-->>CLI: Assistant messages + tool calls
    CLI->>U: Turn completes
Loading

Recovery when the server stays down

Each failed attempt ends in ~1–2 min thanks to the SDK retry cap. After four prompted retries (attempts 2 → 5 counted from the user's POV), the CLI stops prompting and shows "The server has not recovered after multiple attempts. Please try again later." Total time elapsed: bounded at roughly 5–10 min instead of the previous ~40 mins (10 SDK retries × streaming + non-streaming paths).

Testing Instructions

Prerequisite: npm run cli:build (the CLI must be rebuilt to pick up the changes).

Happy path

  1. Run node apps/cli/dist/cli/main.mjs code.
  2. Ask the agent anything simple (e.g. "Create a site called Test"). Turn should complete normally with no retry prompt.

Retry prompt (transient error)

  1. Sandbox the public-api
  2. Introduce a temporary failure on the AI proxy — e.g. set a low cURL timeout so streaming fails, you can add the following to your 0-sandbox.php add_filter( 'wpcom_ai_api_proxy_request_timeout', fn() => 1 );.
  3. Run the code agent and send any prompt. You can check the logs in c66b35b8fcfd51720a0e77cff6a88441-logstash
  4. Expect: turn ends with an error banner, then the prompt "There was a hiccup on the server. Do you want to continue?" with Yes / No.
  5. Choose Yes → turn re-runs as "Continue from where you left off." with session preserved.
  6. Keep failing → after the 4th retry attempt the CLI shows "The server has not recovered after multiple attempts. Please try again later." and does not prompt again.

Fast-fail retries

  1. With the proxy intentionally failing, confirm the retry prompt appears in ~4-6 min (previously ~40 min due to SDK default of 10 retries).

ESC during tool call (no crash)

  1. Start a long-running tool call (e.g. ask the agent to take a screenshot, or anything that triggers mcp__studio__take_screenshot).
  2. Press ESC mid-tool.
  3. Expect: "Interrupted / Ran for Xs before interruption" shown, no retry prompt, CLI returns to the input prompt. Previously this could crash with Error: ProcessTransport is not ready for writing.

JSON mode

  1. node apps/cli/dist/cli/main.mjs code --json "ping" with a failing proxy.
  2. Expect turn.completed with status: "error" (not "success"). Previously the adapter reported "success" when the SDK emitted {subtype:"success", is_error:true}.

Pre-merge Checklist

  • Have you checked for TypeScript, React or other console errors? (lint + typecheck + 88 AI tests pass)

Detect is_error on SDK result messages and thrown transport errors, and
offer a bounded user-mediated retry (max 4 attempts) that continues from
the existing session. Caps the SDK's internal retry at 1 via
CLAUDE_CODE_MAX_RETRIES so each attempt fails fast instead of burning
through the default 10 retries. Also swallows the SDK's ProcessTransport
cleanup rejection on ESC to prevent the CLI from crashing.

STU-1578
Comment thread apps/cli/ai/agent.ts
// Node.js terminates the process on unhandled rejections.
const SDK_INTERRUPT_CLEANUP_ERRORS = [
'Query closed',
'ProcessTransport is not ready for writing',
Copy link
Copy Markdown
Contributor Author

@epeicher epeicher Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not strictly related to these changes, but I have identified that if the user presses ESC while in the Take screenshot step, the process terminates abruptly, so I added this to prevent that scenario. Now, it is correctly handled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there should be a better way to catch these SDK errors other than checking strings but it seems there's no other valid option now?

Comment thread apps/cli/ai/providers.ts
// Fail fast on transient API errors so the user-mediated retry prompt can
// intervene instead of the SDK burning through its default 10 retries.
if ( ! env.CLAUDE_CODE_MAX_RETRIES ) {
env.CLAUDE_CODE_MAX_RETRIES = '1';
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent the SDK internally re-trying. The default is 10 so before this change, the SDK was re-trying after a timeout without notifying the client code, so if the timeout is 4 mins, it would take 40 minutes to fail. Now, it fails immediately, and the agent handles the error; the user is notified that a hiccup on the server has happened.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should retry at least once without user knowing.

@epeicher epeicher self-assigned this Apr 17, 2026
@epeicher epeicher requested review from a team and youknowriad April 17, 2026 16:14
@wpmobilebot
Copy link
Copy Markdown
Collaborator

📊 Performance Test Results

Comparing 7700476 vs trunk

app-size

Metric trunk 7700476 Diff Change
App Size (Mac) 1283.06 MB 1283.06 MB +0.00 MB ⚪ 0.0%

site-editor

Metric trunk 7700476 Diff Change
load 1924 ms 1914 ms 10 ms ⚪ 0.0%

site-startup

Metric trunk 7700476 Diff Change
siteCreation 8113 ms 8098 ms 15 ms ⚪ 0.0%
siteStartup 4175 ms 4296 ms +121 ms 🔴 2.9%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

question: __( 'There was a hiccup on the server. Do you want to continue?' ),
options: [
{ label: 'Yes', description: __( 'Continue from where you left off' ) },
{ label: 'No', description: __( 'Stop here' ) },
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a third-option (other) is needed where the user can give different instructions.

Copy link
Copy Markdown
Contributor

@youknowriad youknowriad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants