resource control: improve observability for Controller and Server modules

## Enhancement Task

Improve the observability of Resource Control related modules on both the client-side Controller (`client/resource_group/controller`) and the server-side Resource Manager (`pkg/mcs/resourcemanager/server`). The current metrics and logging cover aggregate consumption well but lack visibility into allocation decisions, per-client behavior, and internal token state — all of which are critical for debugging resource contention and fairness in multi-tenant deployments.

### Motivation

- **Operators** cannot easily diagnose why a specific resource group or client is being throttled.
- **Developers** lack insight into how the demand-aware slot allocation algorithm distributes fill rates across clients.
- Several important internal states (limiter token balance, per-slot fill rate, RU demand samples, token depletion rate) are only logged at debug level or not exposed at all.

### Scope

#### 1. Client-Side Controller Improvements

| Area | Current State | Proposed Improvement |
|---|---|---|
| **Token consumption per request** | `TokenConsumedHistogram` only observed in trace mode | Always observe token consumption; break down by RU type (read/write/SQL) |
| **Token depletion rate** | Only "low token notify" counter exists | Add metrics for token consumption rate vs. refill rate to detect imbalance before throttling |
| **Limiter state visibility** | Only active/tombstone status gauge | Expose current token balance, effective fill rate, and burst limit per resource group |
| **RU calculation breakdown** | Consumption sent to server as aggregate | Add per-component (KV CPU, read/write bytes, SQL CPU) consumption metrics on the client side |
| **Request wait time** | Only failed-request wait time histogram | Add wait time histogram for successful requests to understand latency impact of rate limiting |

#### 2. Server-Side Resource Manager Improvements

| Area | Current State | Proposed Improvement |
|---|---|---|
| **Per-slot fill rate allocation** | Calculated in memory, only debug-level logging | Add metrics exposing allocated fill rate per client/slot within a resource group |
| **Per-slot token grants** | `"assign slot tokens"` debug log only | Add histogram for token quantities granted per request to each client |
| **RU demand sampling** | Tracked internally in `ruTracker.sample()`, not exposed | Add metrics for sampled RU/sec per client to show which clients drive allocation decisions |
| **Slot lifecycle** | Logged at debug level only | Add gauge for active slot count per resource group and counter for slot creation/deletion |
| **Token loan/trickle time** | Calculated in `assignSlotTokens()`, only debug-level logging | Add metrics for loan amount and trickle duration to understand allocation smoothness |
| **Service limit impact** | Override metrics exist, but no cause tracking | Add metrics distinguishing throttling caused by service limit vs. configured fill rate |

### Non-Goals

- Changing the allocation algorithm itself.
- Adding new gRPC APIs or modifying the token bucket protocol.

### Related Code

- `client/resource_group/controller/` — Controller, group cost controller, limiter, metrics
- `pkg/mcs/resourcemanager/server/` — Manager, token buckets (slot allocation), keyspace manager, service limiter, metrics


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource control: improve observability for Controller and Server modules #10488

Enhancement Task

Motivation

Scope

1. Client-Side Controller Improvements

2. Server-Side Resource Manager Improvements

Non-Goals

Related Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Area	Current State	Proposed Improvement
Token consumption per request	`TokenConsumedHistogram` only observed in trace mode	Always observe token consumption; break down by RU type (read/write/SQL)
Token depletion rate	Only "low token notify" counter exists	Add metrics for token consumption rate vs. refill rate to detect imbalance before throttling
Limiter state visibility	Only active/tombstone status gauge	Expose current token balance, effective fill rate, and burst limit per resource group
RU calculation breakdown	Consumption sent to server as aggregate	Add per-component (KV CPU, read/write bytes, SQL CPU) consumption metrics on the client side
Request wait time	Only failed-request wait time histogram	Add wait time histogram for successful requests to understand latency impact of rate limiting

Area	Current State	Proposed Improvement
Per-slot fill rate allocation	Calculated in memory, only debug-level logging	Add metrics exposing allocated fill rate per client/slot within a resource group
Per-slot token grants	`"assign slot tokens"` debug log only	Add histogram for token quantities granted per request to each client
RU demand sampling	Tracked internally in `ruTracker.sample()`, not exposed	Add metrics for sampled RU/sec per client to show which clients drive allocation decisions
Slot lifecycle	Logged at debug level only	Add gauge for active slot count per resource group and counter for slot creation/deletion
Token loan/trickle time	Calculated in `assignSlotTokens()`, only debug-level logging	Add metrics for loan amount and trickle duration to understand allocation smoothness
Service limit impact	Override metrics exist, but no cause tracking	Add metrics distinguishing throttling caused by service limit vs. configured fill rate

resource control: improve observability for Controller and Server modules #10488

Description

Enhancement Task

Motivation

Scope

1. Client-Side Controller Improvements

2. Server-Side Resource Manager Improvements

Non-Goals

Related Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions