keyspace, scheduling: watch keyspace meta for suspect range updates#10506
keyspace, scheduling: watch keyspace meta for suspect range updates#10506bufferflies wants to merge 2 commits intotikv:masterfrom
Conversation
Signed-off-by: tongjian <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughIntroduces keyspace-aware utilities and an in-memory keyspace cache, a watcher that monitors etcd for keyspace meta changes and publishes suspect key ranges to the checker controller, and integrates the cache and watcher into the scheduling cluster and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Etcd as Etcd (keyspace meta)
participant Watcher as Keyspace Watcher
participant Cache as Keyspace Cache
participant Checker as Checker Controller
participant Server as Scheduling Server
Server->>Watcher: NewWatcher(ctx, etcdClient, checkerController, keyspaceCache)
Watcher->>Etcd: Watch(keypath.KeyspaceMetaPrefix())
Etcd-->>Watcher: put/delete event (meta key)
Watcher->>Watcher: extractKeyspaceIDFromMetaKey()
Watcher->>Cache: Save/DeleteKeyspace(id, meta)
Watcher->>Watcher: MakeRegionBound(id, mode) -> bounds
Watcher->>Checker: AddSuspectKeyRange(start, end)
Checker-->>Server: Suspect ranges queued/available
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
pkg/mcs/scheduling/server/keyspace_meta/watcher.go (1)
17-36: Reorder the etcd imports into the third-party block.
go.etcd.io/...is currently grouped after thegithub.com/tikv/pd/...imports, sogci/goimportswill keep this file dirty.As per coding guidelines,
**/*.go: Use gci/goimports ordering: stdlib | third-party | pingcap | tikv/pd | blank🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 17 - 36, The import ordering violates gci/goimports: move the etcd imports (go.etcd.io/etcd/api/v3/mvccpb and go.etcd.io/etcd/client/v3 as clientv3) into the third-party block so imports follow stdlib | third-party | github.com/pingcap | github.com/tikv/pd | blank; update the import block around the existing imports in watcher.go so the two go.etcd.io entries appear with other third-party imports before the github.com/pingcap and github.com/tikv/pd groups.tests/integrations/mcs/scheduling/meta_test.go (1)
159-165: Verify both raw and txn suspect ranges.The watcher appends two ranges per keyspace change. This assertion returns on the first matching pair, so the test still passes if one of the two append paths regresses.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integrations/mcs/scheduling/meta_test.go` around lines 159 - 165, The test currently exits as soon as PopOneSuspectKeyRange returns a matching pair, which can hide regressions because the watcher should append both txn and raw ranges; update the test inside testutil.Eventually to drain/pop suspect ranges until none remain (calling GetCoordinator().GetCheckerController().PopOneSuspectKeyRange repeatedly), record whether you observed the txn pair (bytes.Equal(kr[0], bound.TxnLeftBound) && bytes.Equal(kr[1], bound.TxnRightBound)) and the raw pair (bytes.Equal(kr[0], bound.RawLeftBound) && bytes.Equal(kr[1], bound.RawRightBound)), and only return true when both flags are seen; use the same GetCoordinator().GetCheckerController().PopOneSuspectKeyRange(), kr variable and bound constants to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/mcs/scheduling/server/cluster.go`:
- Around line 221-223: NewCluster never initializes the Cluster.keyspaceCache
field, so GetKeyspaceCache() returns nil and startCluster passes a nil cache
into keyspace_meta.NewWatcher; initialize keyspaceCache inside NewCluster (e.g.,
create a new keyspace.Cache instance and assign it to c.keyspaceCache) before
returning the Cluster so GetKeyspaceCache() returns a valid cache used by
startCluster and keyspace_meta.NewWatcher to populate metadata and enable
keyspace-aware split checks.
In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go`:
- Around line 124-130: The code calls
rw.checkerController.AddSuspectKeyRange(...) but the checker controller API is
AddSuspectKeyRanges(start, end []byte); update the postEventsFn in watcher.go to
build two slices (e.g., starts, ends) by iterating suspectKeyRanges.Ranges() and
appending kr.StartKey and kr.EndKey, then call
rw.checkerController.AddSuspectKeyRanges(starts, ends) (keeping the nil check
for rw.checkerController). This ensures you use the correct method name and
parameter shapes (AddSuspectKeyRanges and suspectKeyRanges.Ranges()) so the code
compiles against the controller.
In `@pkg/mcs/scheduling/server/server.go`:
- Around line 537-541: The keyspace watcher created with
keyspace_meta.NewWatcher using s.Context() (symbol: keyspaceWatcher) is not
closed during primary shutdown; update stopWatcher() to explicitly call
keyspaceWatcher.Close() (or Stop/Cancel method provided by the watcher) and/or
construct the watcher with the primary callback context instead of s.Context()
so it is canceled on primary exit; locate the creation site
(keyspace_meta.NewWatcher) and the stopWatcher() function and ensure the watcher
is closed and any errors ignored/handled when the primary steps down to prevent
goroutine leaks and duplicate updates.
---
Nitpick comments:
In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go`:
- Around line 17-36: The import ordering violates gci/goimports: move the etcd
imports (go.etcd.io/etcd/api/v3/mvccpb and go.etcd.io/etcd/client/v3 as
clientv3) into the third-party block so imports follow stdlib | third-party |
github.com/pingcap | github.com/tikv/pd | blank; update the import block around
the existing imports in watcher.go so the two go.etcd.io entries appear with
other third-party imports before the github.com/pingcap and github.com/tikv/pd
groups.
In `@tests/integrations/mcs/scheduling/meta_test.go`:
- Around line 159-165: The test currently exits as soon as PopOneSuspectKeyRange
returns a matching pair, which can hide regressions because the watcher should
append both txn and raw ranges; update the test inside testutil.Eventually to
drain/pop suspect ranges until none remain (calling
GetCoordinator().GetCheckerController().PopOneSuspectKeyRange repeatedly),
record whether you observed the txn pair (bytes.Equal(kr[0], bound.TxnLeftBound)
&& bytes.Equal(kr[1], bound.TxnRightBound)) and the raw pair (bytes.Equal(kr[0],
bound.RawLeftBound) && bytes.Equal(kr[1], bound.RawRightBound)), and only return
true when both flags are seen; use the same
GetCoordinator().GetCheckerController().PopOneSuspectKeyRange(), kr variable and
bound constants to locate the code to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e5f0673a-3a4f-4ceb-84a2-98eaf6beeb61
📒 Files selected for processing (8)
pkg/keyspace/util.gopkg/keyspace/util_test.gopkg/mcs/scheduling/server/cluster.gopkg/mcs/scheduling/server/keyspace_meta/watch_test.gopkg/mcs/scheduling/server/keyspace_meta/watcher.gopkg/mcs/scheduling/server/server.gopkg/utils/keyutil/keyrange.gotests/integrations/mcs/scheduling/meta_test.go
| // GetKeyspaceCache returns the keyspace cache. | ||
| func (c *Cluster) GetKeyspaceCache() *keyspace.Cache { | ||
| return c.keyspaceCache |
There was a problem hiding this comment.
Initialize keyspaceCache before returning it.
NewCluster never assigns this field, so GetKeyspaceCache() currently returns nil. startCluster then passes that nil into keyspace_meta.NewWatcher, which means watch events never populate the cache and the new keyspace-aware split checks never see any metadata.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/mcs/scheduling/server/cluster.go` around lines 221 - 223, NewCluster
never initializes the Cluster.keyspaceCache field, so GetKeyspaceCache() returns
nil and startCluster passes a nil cache into keyspace_meta.NewWatcher;
initialize keyspaceCache inside NewCluster (e.g., create a new keyspace.Cache
instance and assign it to c.keyspaceCache) before returning the Cluster so
GetKeyspaceCache() returns a valid cache used by startCluster and
keyspace_meta.NewWatcher to populate metadata and enable keyspace-aware split
checks.
| postEventsFn := func([]*clientv3.Event) error { | ||
| if rw.checkerController == nil { | ||
| return errors.New("checkerController is nil, cannot add suspect key ranges") | ||
| } | ||
| for _, kr := range suspectKeyRanges.Ranges() { | ||
| rw.checkerController.AddSuspectKeyRange(kr.StartKey, kr.EndKey) | ||
| } |
There was a problem hiding this comment.
Use the checker controller's actual API name.
pkg/schedule/checker/checker_controller.go exposes AddSuspectKeyRanges(start, end []byte); this singular call doesn't match that method and won't build against the current controller.
🛠️ Proposed fix
- rw.checkerController.AddSuspectKeyRange(kr.StartKey, kr.EndKey)
+ rw.checkerController.AddSuspectKeyRanges(kr.StartKey, kr.EndKey)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| postEventsFn := func([]*clientv3.Event) error { | |
| if rw.checkerController == nil { | |
| return errors.New("checkerController is nil, cannot add suspect key ranges") | |
| } | |
| for _, kr := range suspectKeyRanges.Ranges() { | |
| rw.checkerController.AddSuspectKeyRange(kr.StartKey, kr.EndKey) | |
| } | |
| postEventsFn := func([]*clientv3.Event) error { | |
| if rw.checkerController == nil { | |
| return errors.New("checkerController is nil, cannot add suspect key ranges") | |
| } | |
| for _, kr := range suspectKeyRanges.Ranges() { | |
| rw.checkerController.AddSuspectKeyRanges(kr.StartKey, kr.EndKey) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 124 - 130,
The code calls rw.checkerController.AddSuspectKeyRange(...) but the checker
controller API is AddSuspectKeyRanges(start, end []byte); update the
postEventsFn in watcher.go to build two slices (e.g., starts, ends) by iterating
suspectKeyRanges.Ranges() and appending kr.StartKey and kr.EndKey, then call
rw.checkerController.AddSuspectKeyRanges(starts, ends) (keeping the nil check
for rw.checkerController). This ensures you use the correct method name and
parameter shapes (AddSuspectKeyRanges and suspectKeyRanges.Ranges()) so the code
compiles against the controller.
Signed-off-by: bufferflies <[email protected]>
There was a problem hiding this comment.
🧹 Nitpick comments (3)
pkg/mcs/scheduling/server/keyspace_meta/watcher.go (3)
89-108: Consider usinglog.Debugfor routine watch events.
log.Infoon every PUT event (line 90) could generate significant log volume in clusters with frequent keyspace metadata changes. Operational logs at Info level are typically reserved for less frequent, notable events.♻️ Suggested adjustment
putFn := func(kv *mvccpb.KeyValue) error { - log.Info("update keyspace meta", zap.String("key", string(kv.Key)), zap.String("value", string(kv.Value))) + log.Debug("update keyspace meta", zap.String("key", string(kv.Key)), zap.String("value", string(kv.Value)))The same consideration applies to
deleteFnon line 111.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 89 - 108, The watcher currently logs every PUT (in putFn) and DELETE (in deleteFn) event with log.Info which can flood logs; change these to log.Debug (or conditional logging) for routine watch events and keep Info for notable occurrences. Update the calls to log.Info in putFn and deleteFn to use log.Debug (or guard with log.Enabled(zap.DebugLevel) if supported) so routine keyspace meta updates use debug level while preserving existing context fields (e.g., key and value) and keeping any error logs at Info/Error levels.
30-36: Import ordering: move third-party import to correct group.
"go.etcd.io/etcd/api/v3/mvccpb"on line 35 is a third-party package but is placed in the tikv/pd import group. Per project conventions, it should be grouped with other third-party imports.♻️ Suggested fix
import ( "context" "errors" "fmt" "strconv" "strings" "sync" clientv3 "go.etcd.io/etcd/client/v3" + "go.etcd.io/etcd/api/v3/mvccpb" "go.uber.org/zap" "github.com/pingcap/log" "github.com/tikv/pd/pkg/keyspace" "github.com/tikv/pd/pkg/schedule/checker" "github.com/tikv/pd/pkg/utils/etcdutil" "github.com/tikv/pd/pkg/utils/keypath" "github.com/tikv/pd/pkg/utils/keyutil" - "go.etcd.io/etcd/api/v3/mvccpb" )As per coding guidelines: "Use gci/goimports ordering: stdlib | third-party | pingcap | tikv/pd | blank".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 30 - 36, The import grouping is incorrect: the third-party package "go.etcd.io/etcd/api/v3/mvccpb" is currently inside the tikv/pd group; move that import into the third-party import group (between stdlib and pingcap groups) in the import block of watcher.go so it follows the project's gci/goimports order; ensure the import name mvccpb remains unchanged and the remaining tikv/pd imports (keyspace, checker, etcdutil, keypath, keyutil) stay together.
60-81: Consider validatingcheckerControllerat construction time.If
checkerControllerisnil,postEventsFnreturns an error on every event batch, which peretcdutil.LoopWatcherbehavior (relevant snippet 4) gets logged but doesn't stop the watcher. This results in repeated error logs without surfacing the misconfiguration clearly.Either validate non-nil at construction:
func NewWatcher( ctx context.Context, etcdClient *clientv3.Client, checkerController *checker.Controller, keyspaceCache *keyspace.Cache) (*Watcher, error) { + if checkerController == nil { + return nil, errors.New("checkerController cannot be nil") + } ctx, cancel := context.WithCancel(ctx)Or handle nil gracefully in
postEventsFnwithout returning an error (just skip adding ranges with a debug log).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 60 - 81, NewWatcher must validate that checkerController is non-nil to avoid postEventsFn returning errors on every batch; update NewWatcher to check checkerController and return a descriptive error if nil (referencing NewWatcher and checkerController), or alternatively modify postEventsFn to treat a nil checkerController as a no-op (log debug and skip adding ranges) and not return an error so etcdutil.LoopWatcher doesn't repeatedly log failures—prefer validating in NewWatcher to fail-fast.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go`:
- Around line 89-108: The watcher currently logs every PUT (in putFn) and DELETE
(in deleteFn) event with log.Info which can flood logs; change these to
log.Debug (or conditional logging) for routine watch events and keep Info for
notable occurrences. Update the calls to log.Info in putFn and deleteFn to use
log.Debug (or guard with log.Enabled(zap.DebugLevel) if supported) so routine
keyspace meta updates use debug level while preserving existing context fields
(e.g., key and value) and keeping any error logs at Info/Error levels.
- Around line 30-36: The import grouping is incorrect: the third-party package
"go.etcd.io/etcd/api/v3/mvccpb" is currently inside the tikv/pd group; move that
import into the third-party import group (between stdlib and pingcap groups) in
the import block of watcher.go so it follows the project's gci/goimports order;
ensure the import name mvccpb remains unchanged and the remaining tikv/pd
imports (keyspace, checker, etcdutil, keypath, keyutil) stay together.
- Around line 60-81: NewWatcher must validate that checkerController is non-nil
to avoid postEventsFn returning errors on every batch; update NewWatcher to
check checkerController and return a descriptive error if nil (referencing
NewWatcher and checkerController), or alternatively modify postEventsFn to treat
a nil checkerController as a no-op (log debug and skip adding ranges) and not
return an error so etcdutil.LoopWatcher doesn't repeatedly log failures—prefer
validating in NewWatcher to fail-fast.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 24b83b47-b99e-494a-8e3e-9dfc08469666
📒 Files selected for processing (4)
pkg/mcs/scheduling/server/keyspace_meta/watcher.gopkg/mcs/scheduling/server/rule/watcher.gopkg/mcs/scheduling/server/server.gotests/integrations/mcs/scheduling/meta_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integrations/mcs/scheduling/meta_test.go
- pkg/mcs/scheduling/server/server.go
What problem does this PR solve?
Issue Number: ref #10123
What is changed and how does it work?
Check List
Tests
Side effects
Release note
Summary by CodeRabbit