Skip to content

WIP: fix(scheduler): statement handling and allocation priority for pipelined jobs#5045

Open
hajnalmt wants to merge 7 commits intovolcano-sh:masterfrom
hajnalmt:fix/pipelined-statement-handling
Open

WIP: fix(scheduler): statement handling and allocation priority for pipelined jobs#5045
hajnalmt wants to merge 7 commits intovolcano-sh:masterfrom
hajnalmt:fix/pipelined-statement-handling

Conversation

@hajnalmt
Copy link
Contributor

@hajnalmt hajnalmt commented Feb 13, 2026

What type of PR is this?

/kind bug
/kind rfe

What this PR does / why we need it:

Fixes statement (transaction) handling in the allocate scheduler action so that pipelined jobs are correctly committed or rolled back, implements cache persistence for pipeline decisions (including NominatedNodeName), and prioritizes pipelined jobs in the allocation loop so they reclaim their nominated nodes before new jobs compete for the same resources.

Problem:

The allocate action only commits statements when a job reaches JobReady status. When a job is JobPipelined (some tasks pipelined, but minAvailable not met), the statement is silently dropped — neither committed nor discarded. This causes:

  1. Phantom resource reservations — Pipelined tasks reserve resources on nodes in the session, but since the statement is never committed or discarded, these reservations persist in the session without being reflected in cache.
  2. Lost pipeline decisionsStatement.pipeline() was a no-op (empty function body), so even on commit, pipeline state was never persisted to cache. NominatedNodeName was never set on pods, meaning Kubernetes preemption signaling is completely non-functional for Volcano-scheduled pods.
  3. No rollback path — Jobs that fail to reach either Ready or Pipelined status have their statements silently dropped, leaking session-level resource reservations.
  4. No allocation priority for pipelined jobs — All jobs (pipelined and pending) were processed in a single pass with the same priority queue. Previously-pipelined jobs competed with fresh pending jobs for resources, causing unnecessary re-evaluation and potential displacement of already-pipelined tasks. This also contributes to cyclic eviction scenarios where lower-priority pipelined tasks evict other lower-priority tasks (see Cyclic eviction issue in the capacity plugin where pods could be repeatedly evicted and rescheduled in a loop, causing scheduling instability. #4947, fix(capacity): ensure allocated resources remain above deserved after evction #4936).

Fix (3 commits):

Commit 1: statement: new pipeline API for statement handling

  • Implement Cache.Pipeline() to persist pipeline decisions: sets NominatedNodeName on the pod via StatusUpdater, updates task status to Pipelined, and adds the task to the target node with proper rollback on failure.
  • Add Pipeline(task, nodeName) to the Cache interface.
  • Implement Statement.pipeline() (was empty) to call cache.Pipeline() on commit.
  • Split UnPipeline() into public UnPipeline() and private unpipeline() (just rolls back), so that Discard() can roll back properly later.
  • Modify organizeJobWorksheet() to include both Pipelined and Pending tasks, with pipelined tasks given higher priority so they're re-allocated to the same nodes first.

Commit 2: scheduler: fix statement handling for pipelined jobs in allocate

  • Replace the two-way if stmt != nil && ssn.JobReady(job) with three-way logic in both hard-topology and non-hard-topology scheduling paths:
    • JobReadystmt.Commit() (full allocation, mark dirty, record decision)
    • JobPipelinedstmt.Commit() (persist pipeline state for next cycle)
    • Neither → stmt.Discard() (roll back all session changes)
  • Add rollback handling when stmt.Pipeline() fails during individual task allocation in allocateResourcesForTask().

Commit 3: refactor: allocate pipelined jobs before pending jobs

  • Split buildAllocateContext() to produce two separate allocateContext instances: one for pipelined jobs and one for normal (pending) jobs.
  • Process the pipelined context first in Execute(), so previously-pipelined jobs reclaim their nominated nodes before new pending jobs enter the allocation loop.
  • Extract addJobToContext() helper to eliminate duplicated job-to-queue insertion logic.

This ordering ensures that pipelined jobs — which already have NominatedNodeName set from a previous cycle — are given priority to re-acquire their target nodes, reducing scheduling churn and preventing cyclic eviction scenarios.

Which issue(s) this PR fixes:

Fixes: #4947
Fixes: #5044
Related: #4936 (this PR supersedes the approach in that PR by addressing the root cause — missing cache persistence and allocation priority — rather than only adjusting eviction ordering)

Special notes for your reviewer:

  • The allocateForJob() path (hard topology / subjob policy) already checks ssn.JobReady(job) || ssn.JobPipelined(job) when saving backup solutions, but the outer allocateResources() only committed on JobReady. This inconsistency meant successful pipeline solutions were computed but never committed.
  • The Discard() path now correctly uses the private unpipeline() (without clearing lastOps) to avoid double-delete in operation tracking, matching the existing unallocate/UnAllocate split pattern.
  • The cache Pipeline() implementation follows the same pattern as the existing Evict() method: lock, find job/task/node, update status, update node, emit event.
  • The two-pass allocation (pipelined first, then pending) is the key to preventing cyclic eviction: by the time pending jobs are evaluated, pipelined jobs have already secured their nodes, so lower-priority evictions cannot displace them.
  • The current codebase never sets NominatedNodeName anywhere, so the allocate action's existing branch that checks for it (if task.NominatedNodeName != "") was dead code. With this PR, that branch becomes reachable.

I am fixing the scheduler action first, we can copy the code to the agent-scheduler action too, additionally if the code goes through the review process, I can add UTs too. Shall I write a design doc too?

Does this PR introduce a user-facing change?

Fix pipelined job statement handling in allocate action: statements for pipelined jobs are now correctly committed to persist NominatedNodeName and pipeline state, failed allocations are properly discarded to prevent phantom resource reservations, and pipelined jobs are allocated before pending jobs to prevent scheduling churn and cyclic eviction.

The statement logic persists the pipelined job state now.
This update will help allocate to prefer pipelined tasks properly.
Cache has a Pipeline interface to update the nominatedNodeName field.

Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
Fix the three-way statement handling in allocateResources:
- Commit when JobReady (full allocation)
- Commit when JobPipelined (persist pipeline state for next cycle)
- Discard when neither (roll back failed allocation attempts)

Previously, when a job had pipelined tasks but was not ready,
the statement was neither committed nor discarded, causing
phantom pipelined state in the session without persisting to cache.

Also adds rollback handling for failed Pipeline operations.

Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
Copilot AI review requested due to automatic review settings February 13, 2026 11:13
@volcano-sh-bot volcano-sh-bot added kind/bug Categorizes issue or PR as related to a bug. kind/RFE Categorizes issue or PR as related to design. labels Feb 13, 2026
@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 13, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @hajnalmt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical issues in Volcano's scheduler concerning the lifecycle and resource management of pipelined jobs. Previously, pipelined tasks suffered from a lack of persistent state, leading to phantom resource reservations and the inability to leverage Kubernetes' NominatedNodeName for preemption signaling. The changes introduce a robust mechanism to correctly commit or discard statements for pipelined jobs, ensuring their decisions are properly cached and NominatedNodeName is set. Furthermore, it prioritizes the allocation of already-pipelined jobs, allowing them to secure their resources before new jobs are considered, which significantly improves scheduling stability and prevents inefficient resource re-evaluation and cyclic eviction scenarios.

Highlights

  • Pipelined Job Cache Persistence: Implemented a new Cache.Pipeline() API to persist pipeline decisions, which sets NominatedNodeName on pods and updates task status to Pipelined, ensuring pipeline state is not lost.
  • Refined Statement Handling: Refactored statement handling in the allocate action to correctly commit pipelined job states, discard failed allocations, and commit fully ready jobs, addressing issues of phantom resource reservations and missing rollbacks.
  • Prioritized Allocation for Pipelined Jobs: Introduced a two-pass allocation strategy where previously pipelined jobs are processed and allowed to reclaim their nominated nodes before new pending jobs compete for resources, reducing scheduling churn and cyclic evictions.
  • Task Priority in Worksheets: Modified organizeJobWorksheet() to include both Pipelined and Pending tasks, with Pipelined tasks now receiving higher priority in the allocation queue.
  • Improved Rollback Mechanism: Split the UnPipeline() method into public and private functions, and updated Discard() to use the private rollback method, preventing double-clearing of operation tracking.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/scheduler/actions/allocate/allocate.go
    • Modified Execute to process pipelined jobs before pending jobs by calling allocateResources twice with separate contexts.
    • Refactored buildAllocateContext to return two distinct allocateContext instances for pipelined and normal jobs, and extracted job-to-context insertion logic into addJobToContext.
    • Updated organizeJobWorksheet to include Pipelined tasks and assign them higher priority than Pending tasks in the priority queue.
    • Adjusted allocateResources to apply three-way logic for statement handling: commit for JobReady, commit for JobPipelined, and discard otherwise.
    • Added rollback handling (stmt.UnPipeline) in allocateResourcesForTask if the stmt.Pipeline operation fails.
  • pkg/scheduler/cache/cache.go
    • Implemented a new Pipeline method to persist pipelined task decisions, which involves setting NominatedNodeName on the pod, updating the task status to Pipelined, and adding the task to the target node.
  • pkg/scheduler/cache/interface.go
    • Added the Pipeline method signature to the Cache interface.
  • pkg/scheduler/framework/statement.go
    • Implemented the internal pipeline method to call ssn.cache.Pipeline for persisting pipeline decisions.
    • Refactored UnPipeline into a public UnPipeline and a private unpipeline method to manage rollback logic more precisely.
    • Updated the Discard method to utilize the private unpipeline method for rolling back pipelined tasks.
    • Enhanced the Commit method's Pipeline case with error handling and a rollback mechanism using unpipeline if the pipeline operation fails.
Activity
  • The pull request is structured into three distinct commits, each addressing a specific part of the overall solution.
  • The author, hajnalmt, has indicated plans to potentially port the changes to the agent-scheduler action and add unit tests after the review process is complete.
  • The author also raised a question regarding the necessity of writing a design document for these changes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@hajnalmt hajnalmt changed the title https://github.com/volcano-sh/volcano/issues/5044 fix(scheduler): fix statement handling and allocation priority for pipelined jobs Feb 13, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the scheduler's handling of pipelined jobs by fixing issues like phantom resource reservations, lost pipeline decisions, and lack of priority. It introduces a two-pass allocation strategy, enhances statement handling for committing/discarding allocations, and adds Cache.Pipeline to persist NominatedNodeName. However, two security vulnerabilities were identified: a high-severity Denial of Service in the cache due to a global lock during a network call, and a medium-severity logic error in the allocate action that skips pipelined jobs without hard topology constraints, effectively denying them service. Additionally, one bug in the implementation needs to be addressed.

Split resource allocation into separate passes for normal
and pipelined jobs in the scheduler. We generate two contexts and
improved clarity of job classification.

Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
@hajnalmt hajnalmt force-pushed the fix/pipelined-statement-handling branch from bf43a76 to 1c79f41 Compare February 13, 2026 11:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes critical issues in the allocate scheduler action related to pipelined job handling. It implements proper statement commit/discard logic for pipelined jobs, persists pipeline decisions to the cache (including setting NominatedNodeName on pods), and introduces a two-pass allocation strategy where pipelined jobs are processed before pending jobs to reduce scheduling churn.

Changes:

  • Implements Cache.Pipeline() to persist pipeline state by setting NominatedNodeName on pods and updating task status to Pipelined in the cache
  • Fixes statement handling in allocate action to use three-way logic: commit on JobReady, commit on JobPipelined, and discard otherwise (previously pipelined jobs were neither committed nor discarded)
  • Splits allocation into two passes: pipelined jobs first, then pending jobs, preventing pipelined tasks from competing with fresh pending jobs

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
pkg/scheduler/framework/statement.go Implements pipeline() to call cache.Pipeline(), splits UnPipeline()/unpipeline() for proper rollback, adds error handling in Commit() for pipeline failures
pkg/scheduler/cache/interface.go Adds Pipeline(task, nodeName) method to Cache interface
pkg/scheduler/cache/cache.go Implements Pipeline() method that sets NominatedNodeName, updates task status to Pipelined, and adds task to node with rollback on failure
pkg/scheduler/actions/allocate/allocate.go Splits buildAllocateContext() to create separate contexts for pipelined and pending jobs, adds addJobToContext() helper, implements three-way statement handling (Ready/Pipelined/Neither), processes pipelined jobs first, and includes both Pipelined and Pending tasks in worksheets with priority ordering

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hajnalmt hajnalmt changed the title fix(scheduler): fix statement handling and allocation priority for pipelined jobs fix(scheduler): statement handling and allocation priority for pipelined jobs Feb 13, 2026
@hajnalmt hajnalmt changed the title fix(scheduler): statement handling and allocation priority for pipelined jobs WIP: fix(scheduler): statement handling and allocation priority for pipelined jobs Feb 14, 2026
@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 14, 2026
Move the UpdatePodStatus call for NominatedNodeName out of the
critical section and into an async goroutine, matching the pattern
used by Evict(). Internal cache state mutations (task status, node
resource accounting) remain synchronous under the lock. On API
server update failure, resyncTask is called to reconcile.

Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
Document the three bugs in pipelined task handling (empty pipeline()
commit function, missing commit/discard for pipelined jobs, no
pipelined-first allocation) and the proposed fixes (Cache.Pipeline(),
three-way statement handling, dual allocateContext).

Includes kube-scheduler NominatedNodeName analysis for context,
Volcano session vs cache architecture explanation with
TransactionContext lifecycle, and dual update safety analysis.

All current-state code references link to the v1.14.0 tag.

Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wpeng102 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 17, 2026
Remove the EvictionOccurred field from TransactionContext and its
parameter from Statement.Pipeline(). With Cache.Pipeline() now
unconditionally setting NominatedNodeName at commit time, this
eviction-only gate in TaskSchedulingReason() is no longer needed.
All pipelined tasks now consistently get nominated regardless of
how they were pipelined.

Add a guard at the top of Statement.Pipeline() to skip re-pipelining
when a task is already Pipelined on the same node. This prevents
duplicate Pipeline events on podgroups every scheduling session,
caused by organizeJobWorksheet() including both Pipelined and
Pending tasks. The guard is in Statement.Pipeline() rather than
individual actions so all callers are protected uniformly.

Update the design document with sections for both changes, revised
NominatedNodeName safety analysis, and corrected Affected Code table.

Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
Extend allocatedPodInCache() guard to cover Pipelined status, preventing
informer UpdatePod events from reverting pipelined tasks back to Pending
via the deletePod+addPod reconstruction path.

Remove nominatedNodeName from TaskSchedulingReason() and
taskUnschedulable() since Cache.Pipeline() is the sole authoritative
setter of pod.Status.NominatedNodeName. This eliminates the redundant
UpdatePodStatus call that triggered a secondary informer race.

Add regression tests for both fixes.

Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. kind/RFE Categorizes issue or PR as related to design. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

2 participants