Skip to content

fix: extend retry predicate to cover transient server errors and connection failures#2186

Open
VANDRANKI wants to merge 1 commit intohuggingface:mainfrom
VANDRANKI:fix/transient-error-retry
Open

fix: extend retry predicate to cover transient server errors and connection failures#2186
VANDRANKI wants to merge 1 commit intohuggingface:mainfrom
VANDRANKI:fix/transient-error-retry

Conversation

@VANDRANKI
Copy link
Copy Markdown

What does this PR do?

Closes #2165

ApiModel already has retry logic via Retrying, but the predicate is_rate_limit_error only matches 429 / rate-limit signals. Transient server errors (502, 503, 504, 500) and connection failures (reset, timeout) are not retried today — the agent run fails immediately even though a retry would succeed.

This PR adds is_transient_error, a broader predicate that covers:

  • Rate limits (429, "rate limit", "too many requests")
  • Server-side transient errors (500, 502, 503, 504)
  • Connection-level failures (reset, refused, timeout)

Non-retryable errors (400 bad request, 401 unauthorized, 404 not found) are not matched and still fail immediately.

ApiModel now uses is_transient_error as its retry predicate. The old is_rate_limit_error is kept for backwards compatibility.

Tests

Added TestIsTransientError in tests/test_models.py with 11 cases: rate limit variants, each server error code, connection reset, timeout, and three non-retryable codes that must not match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No built-in retry/backoff for transient model API errors in MultiStepAgent

1 participant