fix: retry on REPLICATE_VIOLATION for global cluster region switch#3285
Conversation
|
Welcome @huanghaoyuanhhy! It looks like this is your first PR to milvus-io/pymilvus 🎉 |
|
/assign @bigsheeper |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3285 +/- ##
==========================================
+ Coverage 76.68% 76.81% +0.13%
==========================================
Files 63 63
Lines 13235 13244 +9
==========================================
+ Hits 10149 10174 +25
+ Misses 3086 3070 -16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
When a Global Cluster switches its primary region, write operations to the old primary fail with STREAMING_CODE_REPLICATE_VIOLATION. Previously this error was not handled in the retry decorator, causing writes to fail for up to 5 minutes until the background topology refresher ran. Add _handle_global_routing_error() to detect REPLICATE_VIOLATION and trigger an immediate topology refresh with retry, enabling automatic recovery in seconds instead of minutes. Signed-off-by: huanghaoyuanhhy <[email protected]>
e543d13 to
a7844f7
Compare
|
/lgtm |
Signed-off-by: huanghaoyuanhhy <[email protected]>
|
Fixed the review comment and lint issues. Verified locally with |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: huanghaoyuanhhy, XuanYang-cn The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…3285) ## Summary - When a Global Cluster switches its primary region, write operations to the old primary (now secondary) fail with `STREAMING_CODE_REPLICATE_VIOLATION` - Previously this `MilvusException` was not handled in the retry decorator, so writes failed for up to 5 minutes until the background `TopologyRefresher` (300s interval) detected the change - Add `_handle_global_routing_error()` in `GrpcHandler` to detect `REPLICATE_VIOLATION` and trigger immediate topology refresh - Hook it into the `retry_on_rpc_failure` decorator's `MilvusException` branch (both sync and async) so the operation retries automatically after refresh ## Test plan - [x] Deploy a Global Cluster on Zilliz Cloud (UAT) - [x] Run continuous insert + search loop - [x] Switch primary region in console - [x] **Before fix**: INSERT fails with `REPLICATE_VIOLATION` for ~5 minutes until background refresh - [x] **After fix**: INSERT auto-recovers in ~10 seconds (topology refresh + retry backoff) - [x] Unit tests for `_handle_global_routing_error` (4 new tests, all passing) --------- Signed-off-by: huanghaoyuanhhy <[email protected]> (cherry picked from commit 4d607f3) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
✅ Backport Created (cc @bigsheeper @XuanYang-cn) |
…egion switch (#3285) (#3298) Backport of #3285 to `2.6`. Signed-off-by: huanghaoyuanhhy <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: huanghaoyuanhhy <[email protected]>
Summary
STREAMING_CODE_REPLICATE_VIOLATIONMilvusExceptionwas not handled in the retry decorator, so writes failed for up to 5 minutes until the backgroundTopologyRefresher(300s interval) detected the change_handle_global_routing_error()inGrpcHandlerto detectREPLICATE_VIOLATIONand trigger immediate topology refreshretry_on_rpc_failuredecorator'sMilvusExceptionbranch (both sync and async) so the operation retries automatically after refreshTest plan
REPLICATE_VIOLATIONfor ~5 minutes until background refresh_handle_global_routing_error(4 new tests, all passing)