-
Notifications
You must be signed in to change notification settings - Fork 757
Open
Labels
type/enhancementThe issue or PR belongs to an enhancement.The issue or PR belongs to an enhancement.
Description
Enhancement Task
Improve the observability of Resource Control related modules on both the client-side Controller (client/resource_group/controller) and the server-side Resource Manager (pkg/mcs/resourcemanager/server). The current metrics and logging cover aggregate consumption well but lack visibility into allocation decisions, per-client behavior, and internal token state — all of which are critical for debugging resource contention and fairness in multi-tenant deployments.
Motivation
- Operators cannot easily diagnose why a specific resource group or client is being throttled.
- Developers lack insight into how the demand-aware slot allocation algorithm distributes fill rates across clients.
- Several important internal states (limiter token balance, per-slot fill rate, RU demand samples, token depletion rate) are only logged at debug level or not exposed at all.
Scope
1. Client-Side Controller Improvements
| Area | Current State | Proposed Improvement |
|---|---|---|
| Token consumption per request | TokenConsumedHistogram only observed in trace mode |
Always observe token consumption; break down by RU type (read/write/SQL) |
| Token depletion rate | Only "low token notify" counter exists | Add metrics for token consumption rate vs. refill rate to detect imbalance before throttling |
| Limiter state visibility | Only active/tombstone status gauge | Expose current token balance, effective fill rate, and burst limit per resource group |
| RU calculation breakdown | Consumption sent to server as aggregate | Add per-component (KV CPU, read/write bytes, SQL CPU) consumption metrics on the client side |
| Request wait time | Only failed-request wait time histogram | Add wait time histogram for successful requests to understand latency impact of rate limiting |
2. Server-Side Resource Manager Improvements
| Area | Current State | Proposed Improvement |
|---|---|---|
| Per-slot fill rate allocation | Calculated in memory, only debug-level logging | Add metrics exposing allocated fill rate per client/slot within a resource group |
| Per-slot token grants | "assign slot tokens" debug log only |
Add histogram for token quantities granted per request to each client |
| RU demand sampling | Tracked internally in ruTracker.sample(), not exposed |
Add metrics for sampled RU/sec per client to show which clients drive allocation decisions |
| Slot lifecycle | Logged at debug level only | Add gauge for active slot count per resource group and counter for slot creation/deletion |
| Token loan/trickle time | Calculated in assignSlotTokens(), only debug-level logging |
Add metrics for loan amount and trickle duration to understand allocation smoothness |
| Service limit impact | Override metrics exist, but no cause tracking | Add metrics distinguishing throttling caused by service limit vs. configured fill rate |
Non-Goals
- Changing the allocation algorithm itself.
- Adding new gRPC APIs or modifying the token bucket protocol.
Related Code
client/resource_group/controller/— Controller, group cost controller, limiter, metricspkg/mcs/resourcemanager/server/— Manager, token buckets (slot allocation), keyspace manager, service limiter, metrics
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
type/enhancementThe issue or PR belongs to an enhancement.The issue or PR belongs to an enhancement.