feat(retry): full-jitter backoff + Retry-After honoring + max_retries cap (fixes #5165)#5186
Open
leavedrop wants to merge 4 commits into
Open
feat(retry): full-jitter backoff + Retry-After honoring + max_retries cap (fixes #5165)#5186leavedrop wants to merge 4 commits into
leavedrop wants to merge 4 commits into
Conversation
New aider/retry_utils.py exposes: - compute_retry_sleep(attempt, base_delay, cap, retry_after=None) Full-jitter backoff: sleep is uniform in [0, min(base * 2**(N-1), cap)]. If retry_after is supplied (parsed from a provider response header), it overrides the jitter, capped at cap*2 for safety. - parse_retry_after(exception) Walks common SDK attribute paths (.response.headers, .headers, .response_headers, plus one __cause__ hop) to find a Retry-After header. Accepts both integer-seconds and HTTP-date forms (per RFC 7231 7.1.3). Returns None when absent or unparseable. - MAX_RETRIES = 8 Module-level cap on retry attempts so a persistent provider error cannot trap a session in an indefinite retry loop. Refs Aider-AI#5165
…d_with_retries Replaces deterministic 'retry_delay *= 2' exponential backoff with the full-jitter helper. Now honors Retry-After when the provider response carries one, and stops after MAX_RETRIES attempts regardless of cumulative delay (previously the only ceiling was a per-sleep RETRY_TIMEOUT that allowed unbounded total wait when the cap was hit early). Retry log line now includes attempt count and the exception class so operators can tell which provider error is recurring: Retrying in 12.3s (attempt 3/8, RateLimitError)... Refs Aider-AI#5165
…e send loop Same change as simple_send_with_retries: replace deterministic exponential backoff with full-jitter + Retry-After honoring, and enforce MAX_RETRIES on attempt count. Retry message format upgraded to include attempt counter and exception class. Refs Aider-AI#5165
…parsing 16 unit tests across three classes: TestComputeRetrySleep (6 cases) - bound stays within [0, cap] across 1000 samples - bound grows with attempt count until clamped at cap - bound clamps at cap when 2**attempt would overflow - Retry-After value overrides the jitter draw exactly - Retry-After value capped at cap*2 when hostile/buggy provider - zero or negative Retry-After is ignored, falls through to jitter TestParseRetryAfter (9 cases) - integer-seconds form from response.headers - integer-seconds form from a top-level .headers attribute - case-insensitive header name lookup - HTTP-date form returns seconds-until-date - past HTTP-date returns None (no negative sleep) - missing header returns None - exception with no response/headers returns None - unparseable header value returns None - walks __cause__ chain one hop when outer exception lacks headers TestMaxRetriesConstant (1 case) - smoke test on the MAX_RETRIES default Refs Aider-AI#5165
34d2ce7 to
3742952
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The issue author asked: "Would maintainers be open to a small PR for this retry policy improvement? I'm happy to contribute a focused PR if this direction makes sense." This is that PR.
Both
simple_send_with_retries(aider/models.py) and the streaming send loop (aider/coders/base_coder.py) used identical deterministicretry_delay *= 2backoff with no jitter, noRetry-Afterhandling, and no explicit attempt cap. This PR fixes all four points raised in #5165, via a small shared helper module.What changed — 4 improvements
New
aider/retry_utils.py(117 lines) —compute_retry_sleep(attempt, base_delay, cap, retry_after=None)implements the "Full Jitter" recipe (sleep isuniform(0, min(base * 2**(N-1), cap))).parse_retry_after(exception)walks common SDK header attribute paths plus one__cause__hop, accepting both integer-seconds and HTTP-date forms per RFC 7231 §7.1.3.MAX_RETRIES = 8module-level cap.aider/models.py(+12/-7) —simple_send_with_retriesnow usescompute_retry_sleep+parse_retry_after+MAX_RETRIES. Previous logic stopped retrying only whenretry_delay > RETRY_TIMEOUT(60s), which allowed unbounded total wait if the doubling hit the ceiling early; now there's an explicit attempt counter.aider/coders/base_coder.py(+14/-7) — same change applied to the streaming send loop. The two retry blocks were near-identical; both now share the helper.Retry log line upgraded (both call sites) —
Retrying in 12.3s (attempt 3/8, RateLimitError)...instead of justRetrying in 12.3 seconds.... Includes the actual jittered sleep (not the cap), attempt counter, and exception class so operators can tell which provider error is recurring.Tests
tests/basic/test_retry_utils.py— 16 unit tests across three classes:TestComputeRetrySleep(6): bound stays in[0, cap]across 1000 samples; bound grows exponentially with attempt until clamped at cap; clamping holds when2**attemptoverflows;Retry-Aftervalue overrides jitter exactly;Retry-Aftercapped atcap*2when hostile; zero/negativeRetry-Afterfalls through to jitter.TestParseRetryAfter(9): integer-seconds from.response.headers; integer-seconds from top-level.headers; case-insensitive lookup; HTTP-date form returns seconds-until-date; past HTTP-date returnsNone; missing header returnsNone; no response/headers returnsNone; unparseable value returnsNone; walks__cause__chain one hop.TestMaxRetriesConstant(1): smoke check on the default.Validation
Backward compatibility
MAX_RETRIESand base delay (0.125s) are module-level constants matching the prior implicit defaults.RETRY_TIMEOUT = 60constant retained as the jitter cap; only its role changed (was the only ceiling; now the per-sleep cap withMAX_RETRIESas the second ceiling).Retry-After, behavior is identical to the old loop except for jitter randomization — no change to which errors are retryable (still gated byex_info.retry).Fixes #5165