Skip to content

feat(retry): full-jitter backoff + Retry-After honoring + max_retries cap (fixes #5165)#5186

Open
leavedrop wants to merge 4 commits into
Aider-AI:mainfrom
leavedrop:feat/5165-retry-jitter-retry-after
Open

feat(retry): full-jitter backoff + Retry-After honoring + max_retries cap (fixes #5165)#5186
leavedrop wants to merge 4 commits into
Aider-AI:mainfrom
leavedrop:feat/5165-retry-jitter-retry-after

Conversation

@leavedrop
Copy link
Copy Markdown

The issue author asked: "Would maintainers be open to a small PR for this retry policy improvement? I'm happy to contribute a focused PR if this direction makes sense." This is that PR.

Both simple_send_with_retries (aider/models.py) and the streaming send loop (aider/coders/base_coder.py) used identical deterministic retry_delay *= 2 backoff with no jitter, no Retry-After handling, and no explicit attempt cap. This PR fixes all four points raised in #5165, via a small shared helper module.

What changed — 4 improvements

  • New aider/retry_utils.py (117 lines) — compute_retry_sleep(attempt, base_delay, cap, retry_after=None) implements the "Full Jitter" recipe (sleep is uniform(0, min(base * 2**(N-1), cap))). parse_retry_after(exception) walks common SDK header attribute paths plus one __cause__ hop, accepting both integer-seconds and HTTP-date forms per RFC 7231 §7.1.3. MAX_RETRIES = 8 module-level cap.

  • aider/models.py (+12/-7)simple_send_with_retries now uses compute_retry_sleep + parse_retry_after + MAX_RETRIES. Previous logic stopped retrying only when retry_delay > RETRY_TIMEOUT (60s), which allowed unbounded total wait if the doubling hit the ceiling early; now there's an explicit attempt counter.

  • aider/coders/base_coder.py (+14/-7) — same change applied to the streaming send loop. The two retry blocks were near-identical; both now share the helper.

  • Retry log line upgraded (both call sites) — Retrying in 12.3s (attempt 3/8, RateLimitError)... instead of just Retrying in 12.3 seconds.... Includes the actual jittered sleep (not the cap), attempt counter, and exception class so operators can tell which provider error is recurring.

Tests

tests/basic/test_retry_utils.py — 16 unit tests across three classes:

  • TestComputeRetrySleep (6): bound stays in [0, cap] across 1000 samples; bound grows exponentially with attempt until clamped at cap; clamping holds when 2**attempt overflows; Retry-After value overrides jitter exactly; Retry-After capped at cap*2 when hostile; zero/negative Retry-After falls through to jitter.
  • TestParseRetryAfter (9): integer-seconds from .response.headers; integer-seconds from top-level .headers; case-insensitive lookup; HTTP-date form returns seconds-until-date; past HTTP-date returns None; missing header returns None; no response/headers returns None; unparseable value returns None; walks __cause__ chain one hop.
  • TestMaxRetriesConstant (1): smoke check on the default.

Validation

$ python -m py_compile aider/retry_utils.py aider/models.py aider/coders/base_coder.py
$ python -m pytest tests/basic/test_retry_utils.py -v
============================= 16 passed in 0.47s ==============================

Backward compatibility

  • No CLI flag changes, no config changes — MAX_RETRIES and base delay (0.125s) are module-level constants matching the prior implicit defaults.
  • RETRY_TIMEOUT = 60 constant retained as the jitter cap; only its role changed (was the only ceiling; now the per-sleep cap with MAX_RETRIES as the second ceiling).
  • When the provider does not send Retry-After, behavior is identical to the old loop except for jitter randomization — no change to which errors are retryable (still gated by ex_info.retry).

Fixes #5165

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 25, 2026

CLA assistant check
All committers have signed the CLA.

leavedrop added 4 commits May 25, 2026 13:02
New aider/retry_utils.py exposes:

- compute_retry_sleep(attempt, base_delay, cap, retry_after=None)
  Full-jitter backoff: sleep is uniform in [0, min(base * 2**(N-1), cap)].
  If retry_after is supplied (parsed from a provider response header),
  it overrides the jitter, capped at cap*2 for safety.

- parse_retry_after(exception)
  Walks common SDK attribute paths (.response.headers, .headers,
  .response_headers, plus one __cause__ hop) to find a Retry-After
  header. Accepts both integer-seconds and HTTP-date forms (per
  RFC 7231 7.1.3). Returns None when absent or unparseable.

- MAX_RETRIES = 8
  Module-level cap on retry attempts so a persistent provider error
  cannot trap a session in an indefinite retry loop.

Refs Aider-AI#5165
…d_with_retries

Replaces deterministic 'retry_delay *= 2' exponential backoff with the
full-jitter helper. Now honors Retry-After when the provider response
carries one, and stops after MAX_RETRIES attempts regardless of
cumulative delay (previously the only ceiling was a per-sleep RETRY_TIMEOUT
that allowed unbounded total wait when the cap was hit early).

Retry log line now includes attempt count and the exception class so
operators can tell which provider error is recurring:

  Retrying in 12.3s (attempt 3/8, RateLimitError)...

Refs Aider-AI#5165
…e send loop

Same change as simple_send_with_retries: replace deterministic
exponential backoff with full-jitter + Retry-After honoring, and
enforce MAX_RETRIES on attempt count. Retry message format upgraded to
include attempt counter and exception class.

Refs Aider-AI#5165
…parsing

16 unit tests across three classes:

TestComputeRetrySleep (6 cases)
- bound stays within [0, cap] across 1000 samples
- bound grows with attempt count until clamped at cap
- bound clamps at cap when 2**attempt would overflow
- Retry-After value overrides the jitter draw exactly
- Retry-After value capped at cap*2 when hostile/buggy provider
- zero or negative Retry-After is ignored, falls through to jitter

TestParseRetryAfter (9 cases)
- integer-seconds form from response.headers
- integer-seconds form from a top-level .headers attribute
- case-insensitive header name lookup
- HTTP-date form returns seconds-until-date
- past HTTP-date returns None (no negative sleep)
- missing header returns None
- exception with no response/headers returns None
- unparseable header value returns None
- walks __cause__ chain one hop when outer exception lacks headers

TestMaxRetriesConstant (1 case)
- smoke test on the MAX_RETRIES default

Refs Aider-AI#5165
@leavedrop leavedrop force-pushed the feat/5165-retry-jitter-retry-after branch from 34d2ce7 to 3742952 Compare May 25, 2026 05:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLM retry backoff should use jitter and honor Retry-After

2 participants