Skip to content

keyspace, scheduling: watch keyspace meta for suspect range updates#10506

Open
bufferflies wants to merge 2 commits intotikv:masterfrom
bufferflies:decouple_keyspace_utils
Open

keyspace, scheduling: watch keyspace meta for suspect range updates#10506
bufferflies wants to merge 2 commits intotikv:masterfrom
bufferflies:decouple_keyspace_utils

Conversation

@bufferflies
Copy link
Copy Markdown
Contributor

@bufferflies bufferflies commented Mar 27, 2026

What problem does this PR solve?

Issue Number: ref #10123

What is changed and how does it work?

add keyspace range parsing and cached keyspace metadata helpers so scheduling
can detect regions that span multiple keyspaces and calculate split points

watch keyspace metadata changes in scheduling, refresh the keyspace cache from
etcd events, and add suspect key ranges for affected raw and txn bounds

Check List

Tests

  • Unit test
  • Integration test

Side effects

  • Increased code complexity

Release note

None.

Summary by CodeRabbit

  • New Features
    • In-memory keyspace cache; real-time watching of keyspace metadata; automatic detection of regions spanning multiple keyspaces and computation of keyspace split points for smarter region management.
  • Bug Fixes
    • Reduced retained references in key-range structures to avoid memory retention.
  • Tests
    • Expanded unit and integration tests covering keyspace extraction, split/key detection, cache behavior, and metadata watch flow.

Signed-off-by: tongjian <[email protected]>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 27, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign overvenus for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

Introduces keyspace-aware utilities and an in-memory keyspace cache, a watcher that monitors etcd for keyspace meta changes and publishes suspect key ranges to the checker controller, and integrates the cache and watcher into the scheduling cluster and tests.

Changes

Cohort / File(s) Summary
Keyspace utilities & cache
pkg/keyspace/util.go, pkg/keyspace/util_test.go
Added KeyType, ExtractKeyspaceID, Checker interface, RegionSpansMultipleKeyspaces, GetKeyspaceSplitKeys, Cache (btree-backed) and NewKeyspaceMeta. Extensive tests for parsing, boundary detection, split key computation, and cache behavior.
Keyspace metadata watcher
pkg/mcs/scheduling/server/keyspace_meta/watcher.go, pkg/mcs/scheduling/server/keyspace_meta/watch_test.go
New Watcher that watches etcd keyspace meta prefix, parses meta keys, validates/updates keyspace cache, computes txn/raw region bounds, accumulates suspect ranges and forwards them to the checker controller; supports graceful close.
Scheduling cluster & server integration
pkg/mcs/scheduling/server/cluster.go, pkg/mcs/scheduling/server/server.go
Cluster: added keyspaceCache field and GetKeyspaceCache() accessor. Server: added keyspaceWatcher field; initializes keyspace_meta.Watcher during startup and closes it on stop.
Key range utilities
pkg/utils/keyutil/keyrange.go
Added (*KeyRanges).Clean() to clear internal accumulator slice between batches.
Rule watcher
pkg/mcs/scheduling/server/rule/watcher.go
Added failpoint skipRuleCommit in postEventsFn to optionally skip committing rule patches and subsequent suspect-range additions.
Integration tests
tests/integrations/mcs/scheduling/meta_test.go
Added TestKeyspaceMetaWatch to exercise end-to-end flow: create keyspace, start cluster, and assert watcher produces expected suspect key ranges; minor setup change disabling keyspace split wait.

Sequence Diagram(s)

sequenceDiagram
    participant Etcd as Etcd (keyspace meta)
    participant Watcher as Keyspace Watcher
    participant Cache as Keyspace Cache
    participant Checker as Checker Controller
    participant Server as Scheduling Server

    Server->>Watcher: NewWatcher(ctx, etcdClient, checkerController, keyspaceCache)
    Watcher->>Etcd: Watch(keypath.KeyspaceMetaPrefix())
    Etcd-->>Watcher: put/delete event (meta key)
    Watcher->>Watcher: extractKeyspaceIDFromMetaKey()
    Watcher->>Cache: Save/DeleteKeyspace(id, meta)
    Watcher->>Watcher: MakeRegionBound(id, mode) -> bounds
    Watcher->>Checker: AddSuspectKeyRange(start, end)
    Checker-->>Server: Suspect ranges queued/available
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

lgtm, approved, ok-to-test

Suggested reviewers

  • rleungx
  • JmPotato

Poem

🐰
I nibble bytes where keyspaces hide,
Watching bounds where regions glide,
Cache tucked warm beneath my paw,
Suspect ranges hop — I watch them all! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'keyspace, scheduling: watch keyspace meta for suspect range updates' clearly summarizes the main changes, following the repository's format and accurately reflecting the primary modifications across affected packages.
Description check ✅ Passed The description includes all essential sections: problem statement with issue reference, detailed explanation of changes via commit message, comprehensive checklist with relevant items selected, and release note; matches the required template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 27, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
pkg/mcs/scheduling/server/keyspace_meta/watcher.go (1)

17-36: Reorder the etcd imports into the third-party block.

go.etcd.io/... is currently grouped after the github.com/tikv/pd/... imports, so gci/goimports will keep this file dirty.

As per coding guidelines, **/*.go: Use gci/goimports ordering: stdlib | third-party | pingcap | tikv/pd | blank

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 17 - 36, The
import ordering violates gci/goimports: move the etcd imports
(go.etcd.io/etcd/api/v3/mvccpb and go.etcd.io/etcd/client/v3 as clientv3) into
the third-party block so imports follow stdlib | third-party |
github.com/pingcap | github.com/tikv/pd | blank; update the import block around
the existing imports in watcher.go so the two go.etcd.io entries appear with
other third-party imports before the github.com/pingcap and github.com/tikv/pd
groups.
tests/integrations/mcs/scheduling/meta_test.go (1)

159-165: Verify both raw and txn suspect ranges.

The watcher appends two ranges per keyspace change. This assertion returns on the first matching pair, so the test still passes if one of the two append paths regresses.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/scheduling/meta_test.go` around lines 159 - 165, The
test currently exits as soon as PopOneSuspectKeyRange returns a matching pair,
which can hide regressions because the watcher should append both txn and raw
ranges; update the test inside testutil.Eventually to drain/pop suspect ranges
until none remain (calling
GetCoordinator().GetCheckerController().PopOneSuspectKeyRange repeatedly),
record whether you observed the txn pair (bytes.Equal(kr[0], bound.TxnLeftBound)
&& bytes.Equal(kr[1], bound.TxnRightBound)) and the raw pair (bytes.Equal(kr[0],
bound.RawLeftBound) && bytes.Equal(kr[1], bound.RawRightBound)), and only return
true when both flags are seen; use the same
GetCoordinator().GetCheckerController().PopOneSuspectKeyRange(), kr variable and
bound constants to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/mcs/scheduling/server/cluster.go`:
- Around line 221-223: NewCluster never initializes the Cluster.keyspaceCache
field, so GetKeyspaceCache() returns nil and startCluster passes a nil cache
into keyspace_meta.NewWatcher; initialize keyspaceCache inside NewCluster (e.g.,
create a new keyspace.Cache instance and assign it to c.keyspaceCache) before
returning the Cluster so GetKeyspaceCache() returns a valid cache used by
startCluster and keyspace_meta.NewWatcher to populate metadata and enable
keyspace-aware split checks.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go`:
- Around line 124-130: The code calls
rw.checkerController.AddSuspectKeyRange(...) but the checker controller API is
AddSuspectKeyRanges(start, end []byte); update the postEventsFn in watcher.go to
build two slices (e.g., starts, ends) by iterating suspectKeyRanges.Ranges() and
appending kr.StartKey and kr.EndKey, then call
rw.checkerController.AddSuspectKeyRanges(starts, ends) (keeping the nil check
for rw.checkerController). This ensures you use the correct method name and
parameter shapes (AddSuspectKeyRanges and suspectKeyRanges.Ranges()) so the code
compiles against the controller.

In `@pkg/mcs/scheduling/server/server.go`:
- Around line 537-541: The keyspace watcher created with
keyspace_meta.NewWatcher using s.Context() (symbol: keyspaceWatcher) is not
closed during primary shutdown; update stopWatcher() to explicitly call
keyspaceWatcher.Close() (or Stop/Cancel method provided by the watcher) and/or
construct the watcher with the primary callback context instead of s.Context()
so it is canceled on primary exit; locate the creation site
(keyspace_meta.NewWatcher) and the stopWatcher() function and ensure the watcher
is closed and any errors ignored/handled when the primary steps down to prevent
goroutine leaks and duplicate updates.

---

Nitpick comments:
In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go`:
- Around line 17-36: The import ordering violates gci/goimports: move the etcd
imports (go.etcd.io/etcd/api/v3/mvccpb and go.etcd.io/etcd/client/v3 as
clientv3) into the third-party block so imports follow stdlib | third-party |
github.com/pingcap | github.com/tikv/pd | blank; update the import block around
the existing imports in watcher.go so the two go.etcd.io entries appear with
other third-party imports before the github.com/pingcap and github.com/tikv/pd
groups.

In `@tests/integrations/mcs/scheduling/meta_test.go`:
- Around line 159-165: The test currently exits as soon as PopOneSuspectKeyRange
returns a matching pair, which can hide regressions because the watcher should
append both txn and raw ranges; update the test inside testutil.Eventually to
drain/pop suspect ranges until none remain (calling
GetCoordinator().GetCheckerController().PopOneSuspectKeyRange repeatedly),
record whether you observed the txn pair (bytes.Equal(kr[0], bound.TxnLeftBound)
&& bytes.Equal(kr[1], bound.TxnRightBound)) and the raw pair (bytes.Equal(kr[0],
bound.RawLeftBound) && bytes.Equal(kr[1], bound.RawRightBound)), and only return
true when both flags are seen; use the same
GetCoordinator().GetCheckerController().PopOneSuspectKeyRange(), kr variable and
bound constants to locate the code to change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e5f0673a-3a4f-4ceb-84a2-98eaf6beeb61

📥 Commits

Reviewing files that changed from the base of the PR and between 99eb5b5 and 93b8c72.

📒 Files selected for processing (8)
  • pkg/keyspace/util.go
  • pkg/keyspace/util_test.go
  • pkg/mcs/scheduling/server/cluster.go
  • pkg/mcs/scheduling/server/keyspace_meta/watch_test.go
  • pkg/mcs/scheduling/server/keyspace_meta/watcher.go
  • pkg/mcs/scheduling/server/server.go
  • pkg/utils/keyutil/keyrange.go
  • tests/integrations/mcs/scheduling/meta_test.go

Comment on lines +221 to +223
// GetKeyspaceCache returns the keyspace cache.
func (c *Cluster) GetKeyspaceCache() *keyspace.Cache {
return c.keyspaceCache
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Initialize keyspaceCache before returning it.

NewCluster never assigns this field, so GetKeyspaceCache() currently returns nil. startCluster then passes that nil into keyspace_meta.NewWatcher, which means watch events never populate the cache and the new keyspace-aware split checks never see any metadata.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/cluster.go` around lines 221 - 223, NewCluster
never initializes the Cluster.keyspaceCache field, so GetKeyspaceCache() returns
nil and startCluster passes a nil cache into keyspace_meta.NewWatcher;
initialize keyspaceCache inside NewCluster (e.g., create a new keyspace.Cache
instance and assign it to c.keyspaceCache) before returning the Cluster so
GetKeyspaceCache() returns a valid cache used by startCluster and
keyspace_meta.NewWatcher to populate metadata and enable keyspace-aware split
checks.

Comment on lines +124 to +130
postEventsFn := func([]*clientv3.Event) error {
if rw.checkerController == nil {
return errors.New("checkerController is nil, cannot add suspect key ranges")
}
for _, kr := range suspectKeyRanges.Ranges() {
rw.checkerController.AddSuspectKeyRange(kr.StartKey, kr.EndKey)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use the checker controller's actual API name.

pkg/schedule/checker/checker_controller.go exposes AddSuspectKeyRanges(start, end []byte); this singular call doesn't match that method and won't build against the current controller.

🛠️ Proposed fix
-			rw.checkerController.AddSuspectKeyRange(kr.StartKey, kr.EndKey)
+			rw.checkerController.AddSuspectKeyRanges(kr.StartKey, kr.EndKey)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
postEventsFn := func([]*clientv3.Event) error {
if rw.checkerController == nil {
return errors.New("checkerController is nil, cannot add suspect key ranges")
}
for _, kr := range suspectKeyRanges.Ranges() {
rw.checkerController.AddSuspectKeyRange(kr.StartKey, kr.EndKey)
}
postEventsFn := func([]*clientv3.Event) error {
if rw.checkerController == nil {
return errors.New("checkerController is nil, cannot add suspect key ranges")
}
for _, kr := range suspectKeyRanges.Ranges() {
rw.checkerController.AddSuspectKeyRanges(kr.StartKey, kr.EndKey)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 124 - 130,
The code calls rw.checkerController.AddSuspectKeyRange(...) but the checker
controller API is AddSuspectKeyRanges(start, end []byte); update the
postEventsFn in watcher.go to build two slices (e.g., starts, ends) by iterating
suspectKeyRanges.Ranges() and appending kr.StartKey and kr.EndKey, then call
rw.checkerController.AddSuspectKeyRanges(starts, ends) (keeping the nil check
for rw.checkerController). This ensures you use the correct method name and
parameter shapes (AddSuspectKeyRanges and suspectKeyRanges.Ranges()) so the code
compiles against the controller.

Signed-off-by: bufferflies <[email protected]>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
pkg/mcs/scheduling/server/keyspace_meta/watcher.go (3)

89-108: Consider using log.Debug for routine watch events.

log.Info on every PUT event (line 90) could generate significant log volume in clusters with frequent keyspace metadata changes. Operational logs at Info level are typically reserved for less frequent, notable events.

♻️ Suggested adjustment
 	putFn := func(kv *mvccpb.KeyValue) error {
-		log.Info("update keyspace meta", zap.String("key", string(kv.Key)), zap.String("value", string(kv.Value)))
+		log.Debug("update keyspace meta", zap.String("key", string(kv.Key)), zap.String("value", string(kv.Value)))

The same consideration applies to deleteFn on line 111.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 89 - 108,
The watcher currently logs every PUT (in putFn) and DELETE (in deleteFn) event
with log.Info which can flood logs; change these to log.Debug (or conditional
logging) for routine watch events and keep Info for notable occurrences. Update
the calls to log.Info in putFn and deleteFn to use log.Debug (or guard with
log.Enabled(zap.DebugLevel) if supported) so routine keyspace meta updates use
debug level while preserving existing context fields (e.g., key and value) and
keeping any error logs at Info/Error levels.

30-36: Import ordering: move third-party import to correct group.

"go.etcd.io/etcd/api/v3/mvccpb" on line 35 is a third-party package but is placed in the tikv/pd import group. Per project conventions, it should be grouped with other third-party imports.

♻️ Suggested fix
 import (
 	"context"
 	"errors"
 	"fmt"
 	"strconv"
 	"strings"
 	"sync"
 
 	clientv3 "go.etcd.io/etcd/client/v3"
+	"go.etcd.io/etcd/api/v3/mvccpb"
 	"go.uber.org/zap"
 
 	"github.com/pingcap/log"
 
 	"github.com/tikv/pd/pkg/keyspace"
 	"github.com/tikv/pd/pkg/schedule/checker"
 	"github.com/tikv/pd/pkg/utils/etcdutil"
 	"github.com/tikv/pd/pkg/utils/keypath"
 	"github.com/tikv/pd/pkg/utils/keyutil"
-	"go.etcd.io/etcd/api/v3/mvccpb"
 )

As per coding guidelines: "Use gci/goimports ordering: stdlib | third-party | pingcap | tikv/pd | blank".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 30 - 36, The
import grouping is incorrect: the third-party package
"go.etcd.io/etcd/api/v3/mvccpb" is currently inside the tikv/pd group; move that
import into the third-party import group (between stdlib and pingcap groups) in
the import block of watcher.go so it follows the project's gci/goimports order;
ensure the import name mvccpb remains unchanged and the remaining tikv/pd
imports (keyspace, checker, etcdutil, keypath, keyutil) stay together.

60-81: Consider validating checkerController at construction time.

If checkerController is nil, postEventsFn returns an error on every event batch, which per etcdutil.LoopWatcher behavior (relevant snippet 4) gets logged but doesn't stop the watcher. This results in repeated error logs without surfacing the misconfiguration clearly.

Either validate non-nil at construction:

 func NewWatcher(
 	ctx context.Context,
 	etcdClient *clientv3.Client,
 	checkerController *checker.Controller,
 	keyspaceCache *keyspace.Cache) (*Watcher, error) {
+	if checkerController == nil {
+		return nil, errors.New("checkerController cannot be nil")
+	}
 	ctx, cancel := context.WithCancel(ctx)

Or handle nil gracefully in postEventsFn without returning an error (just skip adding ranges with a debug log).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 60 - 81,
NewWatcher must validate that checkerController is non-nil to avoid postEventsFn
returning errors on every batch; update NewWatcher to check checkerController
and return a descriptive error if nil (referencing NewWatcher and
checkerController), or alternatively modify postEventsFn to treat a nil
checkerController as a no-op (log debug and skip adding ranges) and not return
an error so etcdutil.LoopWatcher doesn't repeatedly log failures—prefer
validating in NewWatcher to fail-fast.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go`:
- Around line 89-108: The watcher currently logs every PUT (in putFn) and DELETE
(in deleteFn) event with log.Info which can flood logs; change these to
log.Debug (or conditional logging) for routine watch events and keep Info for
notable occurrences. Update the calls to log.Info in putFn and deleteFn to use
log.Debug (or guard with log.Enabled(zap.DebugLevel) if supported) so routine
keyspace meta updates use debug level while preserving existing context fields
(e.g., key and value) and keeping any error logs at Info/Error levels.
- Around line 30-36: The import grouping is incorrect: the third-party package
"go.etcd.io/etcd/api/v3/mvccpb" is currently inside the tikv/pd group; move that
import into the third-party import group (between stdlib and pingcap groups) in
the import block of watcher.go so it follows the project's gci/goimports order;
ensure the import name mvccpb remains unchanged and the remaining tikv/pd
imports (keyspace, checker, etcdutil, keypath, keyutil) stay together.
- Around line 60-81: NewWatcher must validate that checkerController is non-nil
to avoid postEventsFn returning errors on every batch; update NewWatcher to
check checkerController and return a descriptive error if nil (referencing
NewWatcher and checkerController), or alternatively modify postEventsFn to treat
a nil checkerController as a no-op (log debug and skip adding ranges) and not
return an error so etcdutil.LoopWatcher doesn't repeatedly log failures—prefer
validating in NewWatcher to fail-fast.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 24b83b47-b99e-494a-8e3e-9dfc08469666

📥 Commits

Reviewing files that changed from the base of the PR and between 93b8c72 and 0e0c8d1.

📒 Files selected for processing (4)
  • pkg/mcs/scheduling/server/keyspace_meta/watcher.go
  • pkg/mcs/scheduling/server/rule/watcher.go
  • pkg/mcs/scheduling/server/server.go
  • tests/integrations/mcs/scheduling/meta_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integrations/mcs/scheduling/meta_test.go
  • pkg/mcs/scheduling/server/server.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant