Skip to content

fix: prevent crash when KnnSearch and Tune run concurrently#1764

Open
Copilot wants to merge 3 commits intomainfrom
copilot/fix-crash-knnsearch-tune
Open

fix: prevent crash when KnnSearch and Tune run concurrently#1764
Copilot wants to merge 3 commits intomainfrom
copilot/fix-crash-knnsearch-tune

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 27, 2026

  • Analyze merge conflicts in src/algorithm/hgraph.cpp
  • Resolve conflict 1: case 4 reorder logic — keep our new_use_reorder local variable approach (defers mutation until global lock)
  • Resolve conflict 2: global_mutex_ lock scope — keep our lock, incorporate main's param->use_reorder update inside it
  • Build and verify compilation (algorithm target builds successfully)
  • Commit proper merge commit with two parents

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

The crash occurred because Tune() modified basic_flatten_codes_,
high_precise_codes_, and use_reorder_ under add_mutex_ while search
methods (KnnSearch, SearchWithRequest) held shared global_mutex_ —
these are different mutexes providing no synchronization.

Changes:
- In Tune(), defer member variable mutations using local variables,
  build new flatten codes first, then atomically swap all shared state
  under exclusive global_mutex_ lock
- In RangeSearch(), add missing shared_lock(global_mutex_) for
  consistency with KnnSearch and SearchWithRequest

Agent-Logs-Url: https://github.com/antgroup/vsag/sessions/ff0a358d-0694-4cb2-a40e-daf1830bd6f4

Co-authored-by: wxyucs <12595343+wxyucs@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix crash when KnnSearch and Tune concurrently fix: prevent crash when KnnSearch and Tune run concurrently Mar 27, 2026
Copilot AI requested a review from wxyucs March 27, 2026 07:04
@wxyucs wxyucs marked this pull request as ready for review March 27, 2026 08:16
@wxyucs wxyucs requested a review from LHT129 as a code owner March 27, 2026 08:16
Copilot AI review requested due to automatic review settings March 27, 2026 08:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a concurrency crash in VSAG’s HGraph index when Tune() runs concurrently with search operations by ensuring shared search state is swapped under the same mutex that protects readers.

Changes:

  • Update HGraph::Tune() to rebuild flatten codes off-lock and then atomically swap basic_flatten_codes_, high_precise_codes_, and use_reorder_ under an exclusive global_mutex_.
  • Add the missing shared_lock(global_mutex_) in HGraph::RangeSearch() to synchronize with Tune().

Comment on lines +620 to +633
// Acquire exclusive global lock to atomically swap flatten codes,
// preventing concurrent searches from accessing partially updated state.
{
std::scoped_lock<std::shared_mutex> wlock(this->global_mutex_);
basic_flatten_codes_ = new_basic;
if (drop_precise_codes) {
high_precise_codes_.reset();
} else {
high_precise_codes_ = new_precise;
}
use_reorder_ = new_use_reorder;

// set status
if (disable_future_tuning) {
this->index_feature_list_->SetFeature(IndexFeature::SUPPORT_TUNE, false);
this->raw_vector_.reset();
has_raw_vector_ = false;
create_new_raw_vector_ = false;
check_and_init_raw_vector(param->raw_vector_param, common_param, false);
init_resize_bit_and_reorder();
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tune now swaps basic_flatten_codes_/high_precise_codes_ under global_mutex_, but other public read paths (e.g., CalcDistanceById/CalDistanceById) access these members without taking global_mutex_. That means concurrent Tune() can still race with those APIs and potentially trigger the same use-after-free. Consider taking a std::shared_lock on global_mutex_ in those distance-by-id methods (and any other readers of these pointers), or otherwise ensuring the swapped state is read atomically everywhere.

Copilot uses AI. Check for mistakes.
Comment on lines +616 to +629
auto new_basic = tune_and_rebuild(is_tune_base_code, basic_flatten_codes_, new_basic_code);
auto new_precise =
tune_and_rebuild(is_tune_precise_code, high_precise_codes_, new_precise_code);

check_and_init_raw_vector(param->raw_vector_param, common_param, false);
init_resize_bit_and_reorder();
// Acquire exclusive global lock to atomically swap flatten codes,
// preventing concurrent searches from accessing partially updated state.
{
std::scoped_lock<std::shared_mutex> wlock(this->global_mutex_);
basic_flatten_codes_ = new_basic;
if (drop_precise_codes) {
high_precise_codes_.reset();
} else {
high_precise_codes_ = new_precise;
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR fixes a specific concurrent crash scenario (Tune() vs searches) but doesn't add a regression test that runs Tune() concurrently with KnnSearch()/RangeSearch() to ensure the race stays fixed. There are already concurrent tests for read/write in tests/test_hgraph.cpp, so adding a focused Tune-vs-search stress/regression test would help prevent future reintroductions.

Copilot generated this review using guidance from repository custom instructions.
Merged upstream's need_enable_reorder pattern into our concurrent-safe
Tune() implementation. The key resolution keeps our global_mutex_ lock
to atomically swap flatten codes while incorporating main's deferred
use_reorder_ update (param->use_reorder = new_use_reorder) inside
the lock scope.

Co-authored-by: wxyucs <12595343+wxyucs@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M version/1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants