fix: recompute FullyApplied condition inside retry loop to prevent st…#7203
fix: recompute FullyApplied condition inside retry loop to prevent st…#7203Ady0333 wants to merge 1 commit intokarmada-io:masterfrom
Conversation
…ale status on concurrent rescheduling Signed-off-by: Ady0333 <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @Ady0333, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a bug where the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
@XiShanYongYe-Chang, the pr is ready please review it once you have time. Happy to make any changes if needed!! |
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition in status updates where the FullyApplied condition could be incorrectly set to True during concurrent rescheduling operations. The issue occurred because the condition was computed using a stale cluster list before UpdateStatus() internally re-fetched the latest binding spec.
Changes:
- Move
generateFullyAppliedCondition()call inside the mutation closure inAggregateResourceBindingWorkStatusto evaluate against freshly re-fetchedbinding.Spec - Apply identical fix to
AggregateClusterResourceBindingWorkStatusfor consistency - Update comments to clarify why the condition is recomputed inside the closure
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a race condition where the FullyApplied condition on resource bindings could become stale due to concurrent updates. Moving the condition generation inside the retry.RetryOnConflict loop is the right approach to solve this. I've kept the suggestions to further improve the robustness of the status update logic by also re-calculating the aggregated statuses within the same retry loop, which would prevent other temporary status inconsistencies.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7203 +/- ##
==========================================
- Coverage 42.03% 42.03% -0.01%
==========================================
Files 874 874
Lines 53454 53454
==========================================
- Hits 22471 22468 -3
- Misses 29294 29296 +2
- Partials 1689 1690 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
XiShanYongYe-Chang
left a comment
There was a problem hiding this comment.
Hi @Ady0333
Since the same element in the queue will be merged, and concurrent processing will not be performed on the same binding, there should be no such problem in this logic.
|
Hello @XiShanYongYe-Chang I see your point, if the informer cache is already updated when the controller processes the event, then the controller should not reach this path with a missing resource. In that case, it seems the NotFound scenario may indeed not occur in practice. Please let me know if you think this PR should be closed or if there is another situation where handling NotFound would still be useful. |
|
I think we can close it. |
Fix stale FullyApplied condition during concurrent rescheduling
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a status inconsistency where the
FullyAppliedcondition could be incorrectly set toTrueduring concurrent rescheduling.In
AggregateResourceBindingWorkStatusandAggregateClusterResourceBindingWorkStatus, thegenerateFullyAppliedCondition()result was previously computed outside theretry.RetryOnConflict→UpdateStatus()mutation closure.Since
UpdateStatus()internally re-fetches the latest binding before applying the mutation,a concurrent scheduler update to
spec.clusterscould cause the condition to be evaluatedusing a stale cluster list, while the status update was applied to a newer binding.
This could result in
FullyApplied=Truebeing written even though newly scheduled clustershad not finished applying.
The fix moves the condition generation inside the mutation closure so it always evaluates
against the freshly re-fetched
binding.Spec, ensuring correctness under concurrent updates.Special notes for your reviewer:
Release-note:
NONE