🐛 fix(quota): add centralized quota exhaustion detection to prevent doomed requests #51

MasuRii · 2025-12-21T13:05:50Z

📝 Summary

Fix quota exhaustion detection to ensure the proxy accurately respects API quota limits and avoids making doomed requests that would result in StreamedAPIError.

✨ Changes

🔧 Centralized Quota Check: Added _is_quota_exhausted(key, model) method to UsageManager for consistent quota exhaustion detection across all code paths
🔄 Updated Key Acquisition: Modified get_available_credentials_for_model and both priority/non-priority paths in acquire_key to skip exhausted credentials
⏰ Enhanced Quota Sync: Updated update_quota_baseline to accept optional quota_reset_ts from API responses and auto-set model cooldowns when quota is depleted
🔗 Quota Group Propagation: Ensured cooldowns and reset timestamps are propagated across all models in a quota group (e.g., Gemini 3 Pro variants)

📁 Files Changed

File	Change Type	Impact
`src/rotator_library/usage_manager.py`	🔧 Modified	Core logic for quota exhaustion detection
`src/rotator_library/providers/utilities/antigravity_quota_tracker.py`	🔧 Modified	Pass reset timestamp to quota baseline update

🧪 Testing

Manual Testing: ~8 hours in live environment with no issues
All existing functionality remains intact

📋 Result

The proxy will now accurately respect quota limits fetched from the API and avoid making requests that would result in StreamedAPIError. Previously, credentials with 100% consumed quota could still be selected if no explicit cooldown was set yet.

Important

Centralized quota exhaustion detection in UsageManager to prevent using exhausted credentials and enhance quota synchronization.

Centralized Quota Check:
- Added _is_quota_exhausted(key, model) in UsageManager for consistent quota exhaustion detection.
Key Acquisition:
- Updated get_available_credentials_for_model and acquire_key to skip exhausted credentials.
Quota Synchronization:
- Enhanced update_quota_baseline to accept quota_reset_ts and auto-set cooldowns.
Quota Group Propagation:
- Propagated cooldowns and reset timestamps across quota groups.
Files Changed:
- usage_manager.py: Core logic for quota exhaustion detection.
- antigravity_quota_tracker.py: Pass reset timestamp to quota baseline update.

^{This description was created by}^{for c361cdb. You can customize this summary. It will automatically update as commits are pushed.}

…oomed requests Add a `_is_quota_exhausted` method to UsageManager that checks if quota for a credential/model combination is fully consumed. This method is now called in all key acquisition paths to ensure exhausted keys are skipped. Changes: - Add `_is_quota_exhausted(key, model)` method for centralized quota check - Update `get_available_credentials_for_model` to skip exhausted keys - Update `acquire_key` priority path to skip exhausted keys - Update `acquire_key` non-priority path to skip exhausted keys - Enhance `update_quota_baseline` to accept optional `quota_reset_ts` - Auto-set model cooldowns when quota is 0% and reset time is known - Propagate cooldowns across all models in a quota group This ensures the proxy accurately respects API quota limits and avoids making requests that would result in StreamedAPIError. Tested: ~8 hours in live environment with no issues.

mirrobot-agent · 2025-12-21T13:06:26Z

Starting my review of the quota exhaustion detection fix. I'll be examining the new _is_quota_exhausted method and how it integrates across the credential acquisition paths. Back shortly with findings.

mirrobot-agent

Overall Assessment

This PR implements a well-structured fix for quota exhaustion detection. The centralized _is_quota_exhausted method is cleanly integrated across all three credential acquisition paths (get_available_credentials_for_model, priority path in acquire_key, and non-priority path in acquire_key).

I initially questioned whether the quota check needed to account for quota groups, but on closer inspection, the design is correct: update_quota_baseline already syncs both request_count and quota_max_requests across all models in a quota group at write time, so the per-model check naturally works for grouped quotas as well.

Architectural Feedback

The approach of adding quota_reset_ts passthrough from the API response and auto-setting cooldowns when quota is depleted is sensible. This creates a dual-layer protection: immediate quota checks prevent selection of exhausted credentials, while cooldowns provide time-based recovery.

Key Suggestions

Documentation: The 0.001 threshold for triggering cooldowns could use a brief comment explaining its purpose.
DRY: The cooldown-setting logic for grouped models is duplicated and could be extracted into a helper.

Both are minor maintainability improvements and non-blocking.

Nitpicks and Minor Points

None beyond the inline comments.

Questions for the Author

Is the 0.001 threshold intentionally chosen as a precision tolerance (accounting for floating-point issues), or does it represent a "practically zero" quota state?

This review was generated by an AI assistant.

mirrobot-agent · 2025-12-21T13:08:28Z

src/rotator_library/usage_manager.py

+            if quota_reset_ts:
+                model_data["quota_reset_ts"] = quota_reset_ts
+                # Also set as model cooldown if it's in the future and quota is low
+                if quota_reset_ts > now_ts and remaining_fraction <= 0.001:


The threshold 0.001 (0.1% remaining) triggers cooldowns when quota is nearly exhausted. Consider adding a brief inline comment explaining this value (e.g., # 0.1% threshold - effectively zero quota) to document the design intent for future maintainers.

mirrobot-agent · 2025-12-21T13:08:28Z

src/rotator_library/usage_manager.py

+                        if quota_reset_ts:
+                            other_model_data["quota_reset_ts"] = quota_reset_ts
+                            # Also set as model cooldown if it's in the future and quota is low
+                            if quota_reset_ts > now_ts and remaining_fraction <= 0.001:
+                                model_cooldowns = key_data.setdefault(
+                                    "model_cooldowns", {}
+                                )
+                                model_cooldowns[grouped_model] = quota_reset_ts


This cooldown-setting logic for grouped models duplicates lines 2148-2154. Consider extracting a helper like _set_cooldown_if_exhausted(key_data, model, quota_reset_ts, now_ts, remaining_fraction) to centralize this logic and reduce maintenance burden.

Address PR Mirrowel#51 review suggestions from mirrobot-agent: - Extract duplicate cooldown logic into `_set_quota_reset_and_cooldown()` helper method, eliminating code duplication in quota sync flow - Add documentation explaining the 0.001 (0.1%) threshold meaning: quota is "practically zero" when below this level, not a float precision guard Refs: Mirrowel#51

Mirrowel · 2025-12-21T13:21:09Z

How would it know we used up 100%? Only real way of knowing is making a request and getting an error, or polling the API to get the current quota.
Any other way has inaccuracies.

Also, same reason the error/api reset timers are used as definitive - they are most accurate.

MasuRii · 2025-12-21T13:35:08Z

How would it know we used up 100%? Only real way of knowing is making a request and getting an error, or polling the API to get the current quota. Any other way has inaccuracies.

Also, same reason the error/api reset timers are used as definitive - they are most accurate.

We're still using the same polling to the api to get the current quota. The only problem is the usage manager is not checking the key and still stuck at seeing the key as available and this leads to the case where it will use the key eventhough it's already exhausted. So no changes to the polling

…nto dev

Address PR Mirrowel#51 review suggestions from mirrobot-agent: - Extract duplicate cooldown logic into `_set_quota_reset_and_cooldown()` helper method, eliminating code duplication in quota sync flow - Add documentation explaining the 0.001 (0.1%) threshold meaning: quota is "practically zero" when below this level, not a float precision guard Refs: Mirrowel#51

Mirrowel · 2026-01-08T07:52:48Z

@mirrobot-agent check this PR against latest dev branch and see if this is needed, as the logic should be robust already.

mirrobot-agent · 2026-01-08T07:53:33Z

@Mirrowel, understood! I'll analyze this PR against the latest dev branch to determine if these quota exhaustion detection changes are still necessary.

I'll compare:

The current quota handling logic in the dev branch
The changes proposed in this PR
Any overlapping or superseding implementations

Back shortly with my findings.

mirrobot-agent · 2026-01-08T07:54:51Z

@Mirrowel, I've completed my analysis comparing this PR against the latest dev branch.

Summary

This PR addresses a real gap in the current dev branch logic and is still needed.

Analysis

What the dev branch already has:

✅ update_quota_baseline accepts reset_timestamp parameter
✅ Cooldowns are set when quota is exhausted (via remaining_fraction <= 0.0 check)
✅ Quota group propagation syncs cooldowns across all grouped models
✅ get_available_credentials_for_model and acquire_key skip credentials on cooldown

What this PR adds that's NOT in dev:

❌ _is_quota_exhausted method - The check request_count >= max_requests does not exist in the dev branch
❌ Direct quota exhaustion filtering in the three credential acquisition paths

The Gap (Why This PR Is Still Needed)

The dev branch relies entirely on cooldowns to prevent using exhausted credentials. However, there's a timing vulnerability:

Scenario: A credential's quota becomes exhausted locally (request_count reaches max_requests) but:

update_quota_baseline hasn't been called yet, OR
The API didn't return a valid reset_timestamp, OR
The polling interval hasn't triggered yet

In this window, the credential has no cooldown set even though its quota is exhausted. The current dev branch will still select this credential, leading to failed requests.

The PR's _is_quota_exhausted method closes this gap by checking the raw request_count >= max_requests as a defensive filter that doesn't depend on cooldowns being set.

Note on the `0.001` Threshold

The PR uses a 0.001 threshold (0.1%) in _set_quota_reset_and_cooldown for triggering cooldowns. Based on the author's explanation and the code context, this is an intentional "practically zero" threshold, not a floating-point precision guard. This is reasonable for handling edge cases where quota might report tiny fractions remaining.

Recommendation

Merge this PR. The _is_quota_exhausted check provides a valuable defense-in-depth layer that prevents wasted requests when cooldown timing doesn't align with actual quota state.

This analysis was generated by an AI assistant.

Mirrowel · 2026-01-08T07:57:25Z

@mirrobot-agent google's backend is not precise, so we can go over the max quota value. Especially since we use approximate logic here. Additionally, reliance on hitting the error to set the cooldown means that if quotas increase - users can go over the max without their quota artificially being limited to old max.

Re-evaluate with new info.

mirrobot-agent · 2026-01-08T07:58:07Z

@Mirrowel, thank you for the additional context. I'll re-evaluate this PR with your points about:

Google's backend imprecision (can go over max quota)
The approximate logic used in the system
The benefit of error-based cooldowns: they allow users to benefit from quota increases without being artificially limited to old max values

Re-analyzing now with this in mind.

mirrobot-agent · 2026-01-08T07:59:19Z

@Mirrowel, I've completed my re-evaluation with your additional context. You're right—this PR should NOT be merged as-is.

Revised Analysis

The Problem with `_is_quota_exhausted`

The core issue is the check at line 45:

return bool(max_requests and request_count >= max_requests)

This creates two problems with your design philosophy:

1. Google's Backend Imprecision
Since Google's backend is not precise and the system uses approximate logic, the local request_count may hit max_requests while Google's backend still allows more requests. The _is_quota_exhausted check would prematurely block the credential, denying users requests that would actually succeed.

2. Quota Increases Become Invisible
The current dev branch philosophy is error-driven: if a request fails with quota exhaustion, a cooldown is set until the API-provided reset time. This has an important benefit—if Google increases quotas (e.g., from 50 to 100), users can immediately benefit because:

No local check blocks them at the old limit
They keep making requests until the API actually returns an error
The error response provides the accurate new reset time

With the _is_quota_exhausted check, users would be artificially capped at the old max_requests until either:

A quota baseline refresh happens to update max_requests, or
The cooldown period ends

This fundamentally contradicts the "trust the API, not local estimates" approach.

What the Current Dev Branch Already Does Right

The dev branch at update_quota_baseline (line 2240) already handles exhaustion:

is_exhausted = remaining_fraction <= 0.0
if is_exhausted and valid_reset_ts:
    model_cooldowns[model] = reset_timestamp

This is the correct approach because:

It only blocks credentials when the API confirms quota is exhausted (remaining_fraction <= 0.0)
It uses the API-provided reset timestamp—the most authoritative source
It allows over-max requests that Google's backend might accept

What About the Other Changes in This PR?

Change	Status	Notes
`_is_quota_exhausted` filtering	❌ Should NOT merge	Contradicts error-driven design
`_set_quota_reset_and_cooldown` helper	⚠️ Nice refactor	Code cleanup, but uses 0.001 threshold vs 0.0 in dev
`quota_reset_ts` passthrough	✅ Already in dev	Dev already accepts `reset_timestamp` param

The 0.001 threshold change is a subtle difference—dev uses remaining_fraction <= 0.0 (strict zero) while the PR uses <= 0.001 (0.1% tolerance). The strict zero is more aligned with "trust the API" since it only triggers when the API explicitly says quota is exhausted.

Recommendation

Close this PR without merging. The core feature (_is_quota_exhausted filtering in credential acquisition) conflicts with the error-driven design philosophy. The refactoring portions are minor improvements but introduce a threshold change that loosens the exhaustion criteria.

This analysis was generated by an AI assistant.

Mirrowel · 2026-01-08T08:02:09Z

As the bot said - closing. Already mostly implemented.

mirrobot-agent bot previously approved these changes Dec 21, 2025

View reviewed changes

MasuRii dismissed mirrobot-agent[bot]’s stale review via 91c3a34 December 24, 2025 22:09

Merge branch 'dev' of https://github.com/Mirrowel/LLM-API-Key-Proxy i…

bb9b29a

…nto dev

MasuRii requested a review from Mirrowel as a code owner December 25, 2025 05:02

Mirrowel closed this Jan 8, 2026

🐛 fix(quota): add centralized quota exhaustion detection to prevent doomed requests #51

🐛 fix(quota): add centralized quota exhaustion detection to prevent doomed requests #51

Uh oh!

Conversation

MasuRii commented Dec 21, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Summary

✨ Changes

📁 Files Changed

🧪 Testing

📋 Result

Uh oh!

mirrobot-agent bot commented Dec 21, 2025

Uh oh!

mirrobot-agent bot left a comment

Choose a reason for hiding this comment

Overall Assessment

Architectural Feedback

Key Suggestions

Nitpicks and Minor Points

Questions for the Author

Uh oh!

mirrobot-agent bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

mirrobot-agent bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

Mirrowel commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasuRii commented Dec 21, 2025

Uh oh!

Mirrowel commented Jan 8, 2026

Uh oh!

mirrobot-agent bot commented Jan 8, 2026

Uh oh!

mirrobot-agent bot commented Jan 8, 2026

Summary

Analysis

What the dev branch already has:

What this PR adds that's NOT in dev:

The Gap (Why This PR Is Still Needed)

Note on the 0.001 Threshold

Recommendation

Uh oh!

Mirrowel commented Jan 8, 2026

Uh oh!

mirrobot-agent bot commented Jan 8, 2026

Uh oh!

mirrobot-agent bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Revised Analysis

The Problem with _is_quota_exhausted

What the Current Dev Branch Already Does Right

What About the Other Changes in This PR?

Recommendation

Uh oh!

Mirrowel commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MasuRii commented Dec 21, 2025 •

edited by ellipsis-dev bot

Loading

Mirrowel commented Dec 21, 2025 •

edited

Loading

Note on the `0.001` Threshold

mirrobot-agent bot commented Jan 8, 2026 •

edited

Loading

The Problem with `_is_quota_exhausted`