Skip to content

[Backport 2.6] fix: retry on REPLICATE_VIOLATION for global cluster region switch (#3285)#3298

Merged
sre-ci-robot merged 1 commit into2.6from
backport-3285-to-2.6-1772524647-22292
Mar 3, 2026
Merged

[Backport 2.6] fix: retry on REPLICATE_VIOLATION for global cluster region switch (#3285)#3298
sre-ci-robot merged 1 commit into2.6from
backport-3285-to-2.6-1772524647-22292

Conversation

@pymilvus-bot
Copy link
Collaborator

Backport of #3285 to 2.6.

…3285)

## Summary
- When a Global Cluster switches its primary region, write operations to
the old primary (now secondary) fail with
`STREAMING_CODE_REPLICATE_VIOLATION`
- Previously this `MilvusException` was not handled in the retry
decorator, so writes failed for up to 5 minutes until the background
`TopologyRefresher` (300s interval) detected the change
- Add `_handle_global_routing_error()` in `GrpcHandler` to detect
`REPLICATE_VIOLATION` and trigger immediate topology refresh
- Hook it into the `retry_on_rpc_failure` decorator's `MilvusException`
branch (both sync and async) so the operation retries automatically
after refresh

## Test plan
- [x] Deploy a Global Cluster on Zilliz Cloud (UAT)
- [x] Run continuous insert + search loop
- [x] Switch primary region in console
- [x] **Before fix**: INSERT fails with `REPLICATE_VIOLATION` for ~5
minutes until background refresh
- [x] **After fix**: INSERT auto-recovers in ~10 seconds (topology
refresh + retry backoff)
- [x] Unit tests for `_handle_global_routing_error` (4 new tests, all
passing)

---------

Signed-off-by: huanghaoyuanhhy <haoyuan.huang@zilliz.com>
(cherry picked from commit 4d607f3)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.42%. Comparing base (bb87447) to head (339df22).
⚠️ Report is 1 commits behind head on 2.6.

Additional details and impacted files
@@            Coverage Diff             @@
##              2.6    #3298      +/-   ##
==========================================
+ Coverage   76.39%   76.42%   +0.03%     
==========================================
  Files          63       63              
  Lines       12946    12954       +8     
==========================================
+ Hits         9890     9900      +10     
+ Misses       3056     3054       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mergify mergify bot added the ci-passed label Mar 3, 2026
Copy link
Contributor

@XuanYang-cn XuanYang-cn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@sre-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pymilvus-bot, XuanYang-cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot merged commit c1a4b59 into 2.6 Mar 3, 2026
13 checks passed
@XuanYang-cn XuanYang-cn deleted the backport-3285-to-2.6-1772524647-22292 branch March 3, 2026 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants