Skip to content

resource control: improve observability for Controller and Server modules #10488

@JmPotato

Description

@JmPotato

Enhancement Task

Improve the observability of Resource Control related modules on both the client-side Controller (client/resource_group/controller) and the server-side Resource Manager (pkg/mcs/resourcemanager/server). The current metrics and logging cover aggregate consumption well but lack visibility into allocation decisions, per-client behavior, and internal token state — all of which are critical for debugging resource contention and fairness in multi-tenant deployments.

Motivation

  • Operators cannot easily diagnose why a specific resource group or client is being throttled.
  • Developers lack insight into how the demand-aware slot allocation algorithm distributes fill rates across clients.
  • Several important internal states (limiter token balance, per-slot fill rate, RU demand samples, token depletion rate) are only logged at debug level or not exposed at all.

Scope

1. Client-Side Controller Improvements

Area Current State Proposed Improvement
Token consumption per request TokenConsumedHistogram only observed in trace mode Always observe token consumption; break down by RU type (read/write/SQL)
Token depletion rate Only "low token notify" counter exists Add metrics for token consumption rate vs. refill rate to detect imbalance before throttling
Limiter state visibility Only active/tombstone status gauge Expose current token balance, effective fill rate, and burst limit per resource group
RU calculation breakdown Consumption sent to server as aggregate Add per-component (KV CPU, read/write bytes, SQL CPU) consumption metrics on the client side
Request wait time Only failed-request wait time histogram Add wait time histogram for successful requests to understand latency impact of rate limiting

2. Server-Side Resource Manager Improvements

Area Current State Proposed Improvement
Per-slot fill rate allocation Calculated in memory, only debug-level logging Add metrics exposing allocated fill rate per client/slot within a resource group
Per-slot token grants "assign slot tokens" debug log only Add histogram for token quantities granted per request to each client
RU demand sampling Tracked internally in ruTracker.sample(), not exposed Add metrics for sampled RU/sec per client to show which clients drive allocation decisions
Slot lifecycle Logged at debug level only Add gauge for active slot count per resource group and counter for slot creation/deletion
Token loan/trickle time Calculated in assignSlotTokens(), only debug-level logging Add metrics for loan amount and trickle duration to understand allocation smoothness
Service limit impact Override metrics exist, but no cause tracking Add metrics distinguishing throttling caused by service limit vs. configured fill rate

Non-Goals

  • Changing the allocation algorithm itself.
  • Adding new gRPC APIs or modifying the token bucket protocol.

Related Code

  • client/resource_group/controller/ — Controller, group cost controller, limiter, metrics
  • pkg/mcs/resourcemanager/server/ — Manager, token buckets (slot allocation), keyspace manager, service limiter, metrics

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/enhancementThe issue or PR belongs to an enhancement.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions