WIP: fix(scheduler): statement handling and allocation priority for pipelined jobs#5045
WIP: fix(scheduler): statement handling and allocation priority for pipelined jobs#5045hajnalmt wants to merge 7 commits intovolcano-sh:masterfrom
Conversation
The statement logic persists the pipelined job state now. This update will help allocate to prefer pipelined tasks properly. Cache has a Pipeline interface to update the nominatedNodeName field. Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
Fix the three-way statement handling in allocateResources: - Commit when JobReady (full allocation) - Commit when JobPipelined (persist pipeline state for next cycle) - Discard when neither (roll back failed allocation attempts) Previously, when a job had pipelined tasks but was not ready, the statement was neither committed nor discarded, causing phantom pipelined state in the session without persisting to cache. Also adds rollback handling for failed Pipeline operations. Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
Summary of ChangesHello @hajnalmt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves critical issues in Volcano's scheduler concerning the lifecycle and resource management of pipelined jobs. Previously, pipelined tasks suffered from a lack of persistent state, leading to phantom resource reservations and the inability to leverage Kubernetes' Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly improves the scheduler's handling of pipelined jobs by fixing issues like phantom resource reservations, lost pipeline decisions, and lack of priority. It introduces a two-pass allocation strategy, enhances statement handling for committing/discarding allocations, and adds Cache.Pipeline to persist NominatedNodeName. However, two security vulnerabilities were identified: a high-severity Denial of Service in the cache due to a global lock during a network call, and a medium-severity logic error in the allocate action that skips pipelined jobs without hard topology constraints, effectively denying them service. Additionally, one bug in the implementation needs to be addressed.
Split resource allocation into separate passes for normal and pipelined jobs in the scheduler. We generate two contexts and improved clarity of job classification. Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
bf43a76 to
1c79f41
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes critical issues in the allocate scheduler action related to pipelined job handling. It implements proper statement commit/discard logic for pipelined jobs, persists pipeline decisions to the cache (including setting NominatedNodeName on pods), and introduces a two-pass allocation strategy where pipelined jobs are processed before pending jobs to reduce scheduling churn.
Changes:
- Implements
Cache.Pipeline()to persist pipeline state by settingNominatedNodeNameon pods and updating task status toPipelinedin the cache - Fixes statement handling in allocate action to use three-way logic: commit on
JobReady, commit onJobPipelined, and discard otherwise (previously pipelined jobs were neither committed nor discarded) - Splits allocation into two passes: pipelined jobs first, then pending jobs, preventing pipelined tasks from competing with fresh pending jobs
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/scheduler/framework/statement.go | Implements pipeline() to call cache.Pipeline(), splits UnPipeline()/unpipeline() for proper rollback, adds error handling in Commit() for pipeline failures |
| pkg/scheduler/cache/interface.go | Adds Pipeline(task, nodeName) method to Cache interface |
| pkg/scheduler/cache/cache.go | Implements Pipeline() method that sets NominatedNodeName, updates task status to Pipelined, and adds task to node with rollback on failure |
| pkg/scheduler/actions/allocate/allocate.go | Splits buildAllocateContext() to create separate contexts for pipelined and pending jobs, adds addJobToContext() helper, implements three-way statement handling (Ready/Pipelined/Neither), processes pipelined jobs first, and includes both Pipelined and Pending tasks in worksheets with priority ordering |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Move the UpdatePodStatus call for NominatedNodeName out of the critical section and into an async goroutine, matching the pattern used by Evict(). Internal cache state mutations (task status, node resource accounting) remain synchronous under the lock. On API server update failure, resyncTask is called to reconcile. Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
Document the three bugs in pipelined task handling (empty pipeline() commit function, missing commit/discard for pipelined jobs, no pipelined-first allocation) and the proposed fixes (Cache.Pipeline(), three-way statement handling, dual allocateContext). Includes kube-scheduler NominatedNodeName analysis for context, Volcano session vs cache architecture explanation with TransactionContext lifecycle, and dual update safety analysis. All current-state code references link to the v1.14.0 tag. Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Remove the EvictionOccurred field from TransactionContext and its parameter from Statement.Pipeline(). With Cache.Pipeline() now unconditionally setting NominatedNodeName at commit time, this eviction-only gate in TaskSchedulingReason() is no longer needed. All pipelined tasks now consistently get nominated regardless of how they were pipelined. Add a guard at the top of Statement.Pipeline() to skip re-pipelining when a task is already Pipelined on the same node. This prevents duplicate Pipeline events on podgroups every scheduling session, caused by organizeJobWorksheet() including both Pipelined and Pending tasks. The guard is in Statement.Pipeline() rather than individual actions so all callers are protected uniformly. Update the design document with sections for both changes, revised NominatedNodeName safety analysis, and corrected Affected Code table. Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
Extend allocatedPodInCache() guard to cover Pipelined status, preventing informer UpdatePod events from reverting pipelined tasks back to Pending via the deletePod+addPod reconstruction path. Remove nominatedNodeName from TaskSchedulingReason() and taskUnschedulable() since Cache.Pipeline() is the sole authoritative setter of pod.Status.NominatedNodeName. This eliminates the redundant UpdatePodStatus call that triggered a secondary informer race. Add regression tests for both fixes. Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
What type of PR is this?
/kind bug
/kind rfe
What this PR does / why we need it:
Fixes statement (transaction) handling in the
allocatescheduler action so that pipelined jobs are correctly committed or rolled back, implements cache persistence for pipeline decisions (includingNominatedNodeName), and prioritizes pipelined jobs in the allocation loop so they reclaim their nominated nodes before new jobs compete for the same resources.Problem:
The allocate action only commits statements when a job reaches
JobReadystatus. When a job isJobPipelined(some tasks pipelined, butminAvailablenot met), the statement is silently dropped — neither committed nor discarded. This causes:Statement.pipeline()was a no-op (empty function body), so even on commit, pipeline state was never persisted to cache.NominatedNodeNamewas never set on pods, meaning Kubernetes preemption signaling is completely non-functional for Volcano-scheduled pods.ReadyorPipelinedstatus have their statements silently dropped, leaking session-level resource reservations.Fix (3 commits):
Commit 1:
statement: new pipeline API for statement handlingCache.Pipeline()to persist pipeline decisions: setsNominatedNodeNameon the pod viaStatusUpdater, updates task status toPipelined, and adds the task to the target node with proper rollback on failure.Pipeline(task, nodeName)to theCacheinterface.Statement.pipeline()(was empty) to callcache.Pipeline()on commit.UnPipeline()into publicUnPipeline()and privateunpipeline()(just rolls back), so thatDiscard()can roll back properly later.organizeJobWorksheet()to include bothPipelinedandPendingtasks, with pipelined tasks given higher priority so they're re-allocated to the same nodes first.Commit 2:
scheduler: fix statement handling for pipelined jobs in allocateif stmt != nil && ssn.JobReady(job)with three-way logic in both hard-topology and non-hard-topology scheduling paths:JobReady→stmt.Commit()(full allocation, mark dirty, record decision)JobPipelined→stmt.Commit()(persist pipeline state for next cycle)stmt.Discard()(roll back all session changes)stmt.Pipeline()fails during individual task allocation inallocateResourcesForTask().Commit 3:
refactor: allocate pipelined jobs before pending jobsbuildAllocateContext()to produce two separateallocateContextinstances: one for pipelined jobs and one for normal (pending) jobs.Execute(), so previously-pipelined jobs reclaim their nominated nodes before new pending jobs enter the allocation loop.addJobToContext()helper to eliminate duplicated job-to-queue insertion logic.This ordering ensures that pipelined jobs — which already have
NominatedNodeNameset from a previous cycle — are given priority to re-acquire their target nodes, reducing scheduling churn and preventing cyclic eviction scenarios.Which issue(s) this PR fixes:
Fixes: #4947
Fixes: #5044
Related: #4936 (this PR supersedes the approach in that PR by addressing the root cause — missing cache persistence and allocation priority — rather than only adjusting eviction ordering)
Special notes for your reviewer:
allocateForJob()path (hard topology / subjob policy) already checksssn.JobReady(job) || ssn.JobPipelined(job)when saving backup solutions, but the outerallocateResources()only committed onJobReady. This inconsistency meant successful pipeline solutions were computed but never committed.Discard()path now correctly uses the privateunpipeline()(without clearinglastOps) to avoid double-delete in operation tracking, matching the existingunallocate/UnAllocatesplit pattern.Pipeline()implementation follows the same pattern as the existingEvict()method: lock, find job/task/node, update status, update node, emit event.NominatedNodeNameanywhere, so theallocateaction's existing branch that checks for it (if task.NominatedNodeName != "") was dead code. With this PR, that branch becomes reachable.I am fixing the scheduler action first, we can copy the code to the agent-scheduler action too, additionally if the code goes through the review process, I can add UTs too. Shall I write a design doc too?
Does this PR introduce a user-facing change?