fix: retry on REPLICATE_VIOLATION for global cluster region switch by huanghaoyuanhhy · Pull Request #3285 · milvus-io/pymilvus

huanghaoyuanhhy · 2026-02-17T05:11:17Z

Summary

When a Global Cluster switches its primary region, write operations to the old primary (now secondary) fail with STREAMING_CODE_REPLICATE_VIOLATION
Previously this MilvusException was not handled in the retry decorator, so writes failed for up to 5 minutes until the background TopologyRefresher (300s interval) detected the change
Add _handle_global_routing_error() in GrpcHandler to detect REPLICATE_VIOLATION and trigger immediate topology refresh
Hook it into the retry_on_rpc_failure decorator's MilvusException branch (both sync and async) so the operation retries automatically after refresh

Test plan

Deploy a Global Cluster on Zilliz Cloud (UAT)
Run continuous insert + search loop
Switch primary region in console
Before fix: INSERT fails with REPLICATE_VIOLATION for ~5 minutes until background refresh
After fix: INSERT auto-recovers in ~10 seconds (topology refresh + retry backoff)
Unit tests for _handle_global_routing_error (4 new tests, all passing)

sre-ci-robot · 2026-02-17T05:11:27Z

Welcome @huanghaoyuanhhy! It looks like this is your first PR to milvus-io/pymilvus 🎉

XuanYang-cn · 2026-03-02T09:58:11Z

/assign @bigsheeper

codecov · 2026-03-02T09:59:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.81%. Comparing base (2277a72) to head (fd6323b).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3285      +/-   ##
==========================================
+ Coverage   76.68%   76.81%   +0.13%     
==========================================
  Files          63       63              
  Lines       13235    13244       +9     
==========================================
+ Hits        10149    10174      +25     
+ Misses       3086     3070      -16

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pymilvus/client/grpc_handler.py

When a Global Cluster switches its primary region, write operations to the old primary fail with STREAMING_CODE_REPLICATE_VIOLATION. Previously this error was not handled in the retry decorator, causing writes to fail for up to 5 minutes until the background topology refresher ran. Add _handle_global_routing_error() to detect REPLICATE_VIOLATION and trigger an immediate topology refresh with retry, enabling automatic recovery in seconds instead of minutes. Signed-off-by: huanghaoyuanhhy <[email protected]>

bigsheeper · 2026-03-03T06:15:10Z

/lgtm

Signed-off-by: huanghaoyuanhhy <[email protected]>

huanghaoyuanhhy · 2026-03-03T07:11:27Z

Fixed the review comment and lint issues. Verified locally with make lint - all checks passed.

XuanYang-cn

/lgtm

sre-ci-robot · 2026-03-03T07:55:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: huanghaoyuanhhy, XuanYang-cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [XuanYang-cn]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…3285) ## Summary - When a Global Cluster switches its primary region, write operations to the old primary (now secondary) fail with `STREAMING_CODE_REPLICATE_VIOLATION` - Previously this `MilvusException` was not handled in the retry decorator, so writes failed for up to 5 minutes until the background `TopologyRefresher` (300s interval) detected the change - Add `_handle_global_routing_error()` in `GrpcHandler` to detect `REPLICATE_VIOLATION` and trigger immediate topology refresh - Hook it into the `retry_on_rpc_failure` decorator's `MilvusException` branch (both sync and async) so the operation retries automatically after refresh ## Test plan - [x] Deploy a Global Cluster on Zilliz Cloud (UAT) - [x] Run continuous insert + search loop - [x] Switch primary region in console - [x] **Before fix**: INSERT fails with `REPLICATE_VIOLATION` for ~5 minutes until background refresh - [x] **After fix**: INSERT auto-recovers in ~10 seconds (topology refresh + retry backoff) - [x] Unit tests for `_handle_global_routing_error` (4 new tests, all passing) --------- Signed-off-by: huanghaoyuanhhy <[email protected]> (cherry picked from commit 4d607f3) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

pymilvus-bot · 2026-03-03T07:57:31Z

✅ Backport Created
Hi @huanghaoyuanhhy, Backport PR for 2.6 has been created: #3298

(cc @bigsheeper @XuanYang-cn)

…egion switch (#3285) (#3298) Backport of #3285 to `2.6`. Signed-off-by: huanghaoyuanhhy <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: huanghaoyuanhhy <[email protected]>

sre-ci-robot requested review from czs007 and wangting0128 February 17, 2026 05:11

sre-ci-robot added the size/L label Feb 17, 2026

mergify bot added the dco-passed label Feb 17, 2026

sre-ci-robot assigned bigsheeper Mar 2, 2026

bigsheeper reviewed Mar 2, 2026

View reviewed changes

pymilvus/client/grpc_handler.py Outdated Show resolved Hide resolved

huanghaoyuanhhy force-pushed the fix/global-cluster-replicate-violation-retry branch from e543d13 to a7844f7 Compare March 3, 2026 04:01

sre-ci-robot added the lgtm label Mar 3, 2026

fix: address review comments and lint issues

fd6323b

Signed-off-by: huanghaoyuanhhy <[email protected]>

sre-ci-robot removed the lgtm label Mar 3, 2026

mergify bot added the ci-passed label Mar 3, 2026

XuanYang-cn approved these changes Mar 3, 2026

View reviewed changes

sre-ci-robot assigned XuanYang-cn Mar 3, 2026

sre-ci-robot added the lgtm label Mar 3, 2026

sre-ci-robot added the approved label Mar 3, 2026

XuanYang-cn added the backport-to-2.6 label Mar 3, 2026

sre-ci-robot merged commit 4d607f3 into milvus-io:master Mar 3, 2026
13 checks passed

pymilvus-bot mentioned this pull request Mar 3, 2026

[Backport 2.6] fix: retry on REPLICATE_VIOLATION for global cluster region switch (#3285) #3298

Merged

pymilvus-bot added the backported label Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry on REPLICATE_VIOLATION for global cluster region switch#3285

fix: retry on REPLICATE_VIOLATION for global cluster region switch#3285
sre-ci-robot merged 2 commits intomilvus-io:masterfrom
huanghaoyuanhhy:fix/global-cluster-replicate-violation-retry

huanghaoyuanhhy commented Feb 17, 2026

Uh oh!

sre-ci-robot commented Feb 17, 2026

Uh oh!

XuanYang-cn commented Mar 2, 2026

Uh oh!

codecov bot commented Mar 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

bigsheeper commented Mar 3, 2026

Uh oh!

huanghaoyuanhhy commented Mar 3, 2026

Uh oh!

XuanYang-cn left a comment

Uh oh!

sre-ci-robot commented Mar 3, 2026

Uh oh!

Uh oh!

pymilvus-bot commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

huanghaoyuanhhy commented Feb 17, 2026

Summary

Test plan

Uh oh!

sre-ci-robot commented Feb 17, 2026

Uh oh!

XuanYang-cn commented Mar 2, 2026

Uh oh!

codecov bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

bigsheeper commented Mar 3, 2026

Uh oh!

huanghaoyuanhhy commented Mar 3, 2026

Uh oh!

XuanYang-cn left a comment

Choose a reason for hiding this comment

Uh oh!

sre-ci-robot commented Mar 3, 2026

Uh oh!

Uh oh!

pymilvus-bot commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Mar 2, 2026 •

edited

Loading