Skip to content

Comments

performance(detector): reduce lock contention in waitingObkects matching logic#7175

Open
zhzhuang-zju wants to merge 1 commit intokarmada-io:masterfrom
zhzhuang-zju:waitinglock-segment
Open

performance(detector): reduce lock contention in waitingObkects matching logic#7175
zhzhuang-zju wants to merge 1 commit intokarmada-io:masterfrom
zhzhuang-zju:waitinglock-segment

Conversation

@zhzhuang-zju
Copy link
Contributor

@zhzhuang-zju zhzhuang-zju commented Feb 6, 2026

What type of PR is this?
/kind feature

What this PR does / why we need it:
waitingObjects is designed to track resource templates that haven't been propagated due to a lack of a matching (Cluster)PropagationPolicy. The current implementation has the following issues:

  1. Coarse-grained Locking: Any read/write operation locks the entire waitingObjects, leading to severe lock contention in large-scale clusters.
  2. Inefficient Matching: Each policy match requires a full traversal of all waiting objects, resulting in $O(N)$ complexity and high CPU usage.

This PR introduces the following optimizations:

  1. Two-level Indexing: Implements a GVK -> Namespace -> Object structure. This significantly reduces invalid traversals during matching.
  2. Fine-grained Locking: Uses separate locks for different GVKs to minimize cross-resource lock contention.
  3. Optimized Lookup: Employs sync.Map to index from GVK to namespacedObjects. Since GVKs are relatively stable, this achieves near lock-free access for GVK lookups.

Which issue(s) this PR fixes:

Part of #7139

Special notes for your reviewer:

Test Environment:
The environment contains approximately 60k-70k workloads distributed across different namespaces, with 7k-8k per single namespace. Workloads are dispatched to member clusters via policies.
In addition, there are thousands of dependent resources such as PVCs and 20k-30k ConfigMaps. These do not have separately bound policies but are dispatched to member clusters along with the workloads through the dependent propagation feature.
Now, restart karmada-controller-manager.
image
For the analysis of this figure, you can refer to: #7139 (comment)

After optimization:
c1c9f524af0283a579aa775b0a5dc6b

As seen from the images:

  1. The reconciliation of existing resources was reduced from 11 minutes to 5 minutes;
  2. The duration of a single matching operation dropped from 100ms to 0.9ms.

Does this PR introduce a user-facing change?:

`karmada-controller-manager`: Enhanced the concurrency and scalability of the `Detector` by refactoring `waitingObjects` to minimize lock contention during policy matching against waiting resources, leading to more responsive processing in high-load scenarios.

Signed-off-by: zhzhuang-zju <m17799853869@163.com>
@karmada-bot karmada-bot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 6, 2026
@karmada-bot karmada-bot requested a review from Poor12 February 6, 2026 03:12
@karmada-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chaunceyjiang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 6, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @zhzhuang-zju, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on a performance optimization within the ResourceDetector component. The primary goal is to reduce lock contention in the logic that matches waitingObjects with propagation policies. By transitioning from a single global lock to a more distributed and granular locking mechanism using sync.Map and a new namespacedObject structure, the changes aim to improve the detector's efficiency and scalability, especially under high load conditions where many objects are awaiting policy propagation.

Highlights

  • Refactored Waiting Objects Data Structure: The waitingObjects field in ResourceDetector has been changed from a map[keys.ClusterWideKey]struct{} protected by a single sync.RWMutex to a sync.Map.
  • Introduced Hierarchical Locking: A new namespacedObject struct was introduced, which itself contains a sync.RWMutex and a nested map structure (map[string]map[keys.ClusterWideKey]struct{}). This allows for more granular locking, reducing contention by locking only specific GVKs and namespaces rather than the entire waitingObjects map.
  • Optimized Matching Logic: The isWaiting, AddWaiting, RemoveWaiting, and GetMatching methods have been updated to utilize the new sync.Map and namespacedObject structure, improving concurrency and performance.
  • Changed Return Type for GetMatching: The GetMatching function now returns a sets.Set[keys.ClusterWideKey] instead of a []keys.ClusterWideKey, which can offer better performance for set operations and uniqueness guarantees.
Changelog
  • pkg/detector/detector.go
    • Added import for k8s.io/apimachinery/pkg/util/sets.
    • Replaced the waitingObjects map and its associated waitingLock with a sync.Map.
    • Defined a new namespacedObject struct to manage objects by namespace within each GVK entry in the sync.Map, including its own sync.RWMutex.
    • Refactored isWaiting, AddWaiting, RemoveWaiting, and GetMatching functions to use the new sync.Map and namespacedObject for concurrent access.
    • Updated GetMatching to return sets.Set[keys.ClusterWideKey] and introduced a helper match function for processing resource selectors.
    • Modified HandlePropagationPolicyCreationOrUpdate and HandleClusterPropagationPolicyCreationOrUpdate to adapt to the sets.Set return type from GetMatching.
Activity
  • The pull request is currently marked as 'Work In Progress' (WIP).
  • No specific review comments or discussions have been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the waitingObjects data structure to use a sync.Map with more granular locking, aiming to reduce lock contention. However, the current implementation introduces a cross-namespace resource interference vulnerability and a Denial of Service risk. Specifically, the GetMatching function incorrectly handles empty namespaces in resource selectors, leading to unintended matching across namespaces for namespaced policies. Additionally, holding a read lock during potentially slow API server lookups can block other controller operations, contributing to the Denial of Service risk.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 65 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.05%. Comparing base (4403fa4) to head (22cde24).

Files with missing lines Patch % Lines
pkg/detector/detector.go 0.00% 65 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7175      +/-   ##
==========================================
- Coverage   42.08%   42.05%   -0.04%     
==========================================
  Files         871      871              
  Lines       53301    53341      +40     
==========================================
- Hits        22432    22431       -1     
- Misses      29184    29224      +40     
- Partials     1685     1686       +1     
Flag Coverage Δ
unittests 42.05% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zhzhuang-zju
Copy link
Contributor Author

/retitle performance(detector): reduce lock contention in waitingObkects matching logic

@karmada-bot karmada-bot changed the title WIP: performance(detector): reduce lock contention in waitingObkects matching logic performance(detector): reduce lock contention in waitingObkects matching logic Feb 9, 2026
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants