Skip to content

feat: add opt-in scheduling gates for queue capacity control#5033

Draft
devzizu wants to merge 22 commits intovolcano-sh:masterfrom
devzizu:ft-proposal-4710
Draft

feat: add opt-in scheduling gates for queue capacity control#5033
devzizu wants to merge 22 commits intovolcano-sh:masterfrom
devzizu:ft-proposal-4710

Conversation

@devzizu
Copy link

@devzizu devzizu commented Feb 8, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds opt-in scheduling gates for queue capacity control to prevent cluster autoscalers from triggering unnecessary scale-ups when pods are waiting for queue admission rather than cluster capacity.

When a pod opts in via the scheduling.volcano.sh/queue-allocation-gate: "true" annotation, Volcano adds a scheduling gate that delays the Unschedulable condition until the queue has capacity.

The design is further described in #4727.

Which issue(s) this PR fixes:

Fixes #4710

Special notes for your reviewer:

  • This feature is opt-in via pod annotation to ensure backward compatibility.
  • A simple E2E test was added to validate capacity reservation behavior, but further test cases may be required.

Does this PR introduce a user-facing change?

Add opt-in scheduling gates for queue capacity control. Pods can now use the scheduling.volcano.sh/queue-allocation-gate: "true" annotation to prevent false autoscaler scale-ups when waiting for queue admission.

@volcano-sh-bot volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 8, 2026
@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 8, 2026
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign shinytang6 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 8, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @devzizu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where Volcano's scheduler could cause unnecessary cluster autoscaler scale-ups by marking pods as 'Unschedulable' even when they were only waiting for queue capacity. By leveraging Kubernetes scheduling gates, it introduces an opt-in mechanism to delay the 'Unschedulable' condition until a pod is truly blocked by a lack of cluster resources, while also ensuring proper queue capacity accounting for pods that have passed initial admission but are awaiting node allocation.

Highlights

  • Opt-in Scheduling Gates for Queue Capacity Control: Introduces an opt-in mechanism using the scheduling.volcano.sh/queue-allocation-gate: "true" annotation. Pods with this annotation will have a Volcano-specific scheduling gate added by the MutatingAdmissionWebhook at creation time.
  • Asynchronous Gate Removal: The scheduler now asynchronously removes the Volcano scheduling gate from pods that fail predicate checks or prioritization (i.e., cannot find a suitable node). This signals to cluster autoscalers that a scale-up is needed, preventing false scale-ups when pods are merely waiting for queue admission.
  • Queue Capacity Reservation for Ungated Pods: Implements a new mechanism in the capacity plugin to reserve queue capacity for pods that have had their Volcano scheduling gate removed but are still pending. This prevents other pods from consuming the capacity intended for the ungated pod, addressing a potential race condition.
  • Gate Removal on Successful Bind: For pods that successfully pass all allocation checks and are bound to a node, the Volcano scheduling gate is removed during the binding operation.
  • Enhanced Scheduler Logic: Updates the scheduler's allocate action to manage the lifecycle of these scheduling gates, including skipping externally gated pods, queuing gate removals, and integrating with the new capacity reservation logic.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/design/scheduling-gates-queue-admission.md
    • Added a new design document detailing the motivation, proposal, and high-level implementation of gate-controlled scheduling for cluster autoscaler compatibility.
  • pkg/scheduler/actions/allocate/allocate.go
    • Introduced asynchronous workers for scheduling gate removal to avoid blocking the main scheduler loop.
    • Modified allocateResourcesForTasks to enqueue gate removal for pods that fail pre-predicate, predicate, or prioritization steps.
    • Updated organizeJobWorksheet to only skip tasks with external scheduling gates, allowing Volcano-managed gates to be processed.
    • Added Initialize and UnInitialize methods to manage the lifecycle of the async gate removal workers.
  • pkg/scheduler/api/helpers.go
    • Added HasOnlyVolcanoSchedulingGate helper function to identify pods with only the Volcano-specific scheduling gate.
    • Added HasQueueAllocationGateAnnotation helper function to check for the opt-in annotation.
  • pkg/scheduler/api/job_info.go
    • Added RemoveGateDuringBind boolean field to TaskInfo to signal if a gate should be removed during the bind operation.
    • Modified GetSchGatedPodResources to exclude pods with only the Volcano scheduling gate from resource deduction, ensuring they are counted in in-queue resources.
  • pkg/scheduler/api/types.go
    • Defined CleanupReservationsFn type for functions that clean up reserved resources before a statement commit.
  • pkg/scheduler/cache/cache.go
    • Updated the Bind method to remove the Volcano scheduling gate from a pod if task.RemoveGateDuringBind is true before performing the standard bind operation.
    • Adjusted RecordJobStatusEvent to skip updating the PodScheduled condition if taskInfo.SchGated is true, respecting the scheduling gate's role.
  • pkg/scheduler/cache/util.go
    • Added RemoveVolcanoSchGate function to programmatically remove the Volcano scheduling gate from a pod using a JSON patch.
    • Added AddVolcanoSchGate function to programmatically add the Volcano scheduling gate to a pod using a JSON patch.
  • pkg/scheduler/framework/session.go
    • Added cleanupReservationsFns map to store registered cleanup functions.
  • pkg/scheduler/framework/session_plugins.go
    • Added AddCleanupReservationsFn to allow plugins to register cleanup logic.
    • Implemented CleanupReservations method to execute all registered cleanup functions before a statement commit.
  • pkg/scheduler/framework/statement.go
    • Added Name() and Task() methods to the operation struct for easier access to operation details.
  • pkg/scheduler/plugins/capacity/capacity.go
    • Introduced queueGateReservedTasks cache to track pods that have passed capacity checks but are not yet allocated, reserving their queue capacity.
    • Added addTaskToReservedCache, removeTaskFromReservedCache, and buildQueueReservedTasksCache methods to manage this reservation cache.
    • Implemented queueAllocatableWithReserved to include reserved resources in capacity checks.
    • Integrated the reservation cleanup mechanism by registering a CleanupReservationsFn in OnSessionOpen.
  • pkg/webhooks/admission/pods/mutate/mutate_pod.go
    • Added patchSchedulingGates function to inject the scheduling.volcano.sh/queue-allocation-gate into a pod's spec.schedulingGates if the opt-in annotation is present.
  • staging/src/volcano.sh/apis/pkg/apis/scheduling/v1beta1/labels.go
    • Defined QueueAllocationGateKey constant for the new scheduling gate annotation and name.
  • test/e2e/schedulingaction/allocate.go
    • Added a new E2E test case to validate the queue capacity reservation behavior, demonstrating that an unschedulable pod with a removed gate reserves capacity and blocks other pods.
  • test/e2e/util/job.go
    • Added Annotations field to TaskSpec to support setting annotations on pods created in E2E tests.
  • test/e2e/util/pod.go
    • Added PodHasOnlyVolcanoSchedulingGate and HasSchedulingGatedCondition helper functions for E2E tests to check pod scheduling gate status.
Activity
  • The pull request introduces a new feature to address a known issue (Cluster Autoscaler node scale-up for Pods that exceed Queue's capability #4710) regarding cluster autoscaler compatibility.
  • A detailed design document (docs/design/scheduling-gates-queue-admission.md) was added to explain the motivation and implementation of this feature.
  • A simple E2E test was included to validate the new capacity reservation behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@devzizu devzizu marked this pull request as ready for review February 8, 2026 14:31
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 8, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed feature to add opt-in scheduling gates for queue capacity control, improving compatibility with cluster autoscalers. While the overall design is comprehensive and includes a valuable E2E test, two significant security-related issues were identified: a data race in the asynchronous gate removal workers that could lead to a scheduler panic (Denial of Service), and a race condition in the JSON Patch logic for removing scheduling gates that could result in the unauthorized removal of security-related gates (Broken Access Control). Both of these critical security issues must be addressed before merging. Additionally, there are minor suggestions for a grammatical correction in the design document and replacing context.TODO() with context.Background() in background operations for improved context management.

@JesseStutler
Copy link
Member

I still have some opinions on the proposal and please take a look @devzizu #4727, thanks, we may need to furthur discuss it on the community meeting

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements opt-in scheduling gates for queue capacity control to prevent cluster autoscalers from triggering unnecessary scale-ups when pods are waiting for queue admission rather than cluster capacity. The solution addresses issue #4710 by using Kubernetes scheduling gates to delay the Unschedulable condition until pods pass queue capacity checks.

Changes:

  • Added opt-in annotation scheduling.volcano.sh/queue-allocation-gate: "true" for pods to participate in gate-based queue admission control
  • Implemented asynchronous gate removal workers in the allocate action to remove gates when pods pass capacity checks
  • Enhanced capacity plugin with reserved task tracking to prevent race conditions where ungated-but-unscheduled pods could allow queue over-allocation
  • Added comprehensive E2E tests and design documentation

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
staging/src/volcano.sh/apis/pkg/apis/scheduling/v1beta1/labels.go Added QueueAllocationGateKey constant for the opt-in annotation
pkg/scheduler/api/helpers.go Added helper functions to check for Volcano scheduling gates and annotations
pkg/webhooks/admission/pods/mutate/mutate_pod.go Added webhook logic to inject scheduling gates for opted-in pods
pkg/scheduler/actions/allocate/allocate.go Implemented async worker infrastructure and gate removal logic in allocate action
pkg/scheduler/plugins/capacity/capacity.go Added reserved task cache to account for ungated pods in capacity checks
pkg/scheduler/framework/statement.go Added Name() and Task() methods to operation struct for cleanup callbacks
pkg/scheduler/framework/session.go Added cleanupReservationsFns map to session
pkg/scheduler/framework/session_plugins.go Added AddCleanupReservationsFn and CleanupReservations methods
pkg/scheduler/api/types.go Added CleanupReservationsFn function type
pkg/scheduler/api/job_info.go Modified GetSchGatedPodResources to exclude Volcano-only gated pods
pkg/scheduler/cache/util.go Added RemoveVolcanoSchGate utility function for gate removal
pkg/scheduler/cache/cache.go Added synchronous gate removal before bind as safety guarantee
test/e2e/schedulingaction/allocate.go Added comprehensive E2E test for capacity reservation behavior
test/e2e/util/pod.go Added helper functions for checking gate status in tests
test/e2e/util/job.go Added Annotations field to TaskSpec for test configuration
pkg/scheduler/util/test_utils.go Updated test utilities to handle SubGroupPolicy without NetworkTopology
pkg/scheduler/actions/allocate/allocate_test.go Added unit test for SubGroupPolicy without NetworkTopology
docs/design/scheduling-gates-queue-admission.md Added comprehensive design document explaining the feature
docs/user-guide/version-compatibility-archive.md Added version compatibility archive document
README.md Updated Kubernetes compatibility table

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@devzizu devzizu marked this pull request as draft February 15, 2026 12:33
@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 15, 2026
Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is more than solid thank you @devzizu ! When Jesse comes back we can schedule a meeting.

Please squash the commits and let's keep the design doc in the separate PR!

gomega.Expect(err).NotTo(gomega.HaveOccurred())
})

ginkgo.It("Unschedulable pod with removed gate reserves queue capacity and blocks other pods", func() {
Copy link
Contributor

@hajnalmt hajnalmt Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good test! it's so cool that it covers the webhook injection, the gate removal, the capacity reservation and reservation cleanup even in one go.

What we should add maybe as a test case too is that we don't remove other scheduling gates accidentally, like a pod with multiple scheduling gates one is the volcano one with annotation, it shouldn't schedule.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new test case, can you please see if it fits the purpose? One part I'm not entirely happy with is the check that both gates remain and the pod stays Pending. Right now I'm using gomega.Consistently over 10 seconds (polling every second) to assert that the pod still has both gates and is Pending. It works, but it feels a bit arbitrary (fixed window and interval) and doesn't give a clear, deterministic guarantee that the scheduler has run and decided to skip the task.

Copy link
Contributor

@hajnalmt hajnalmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

This is quite well written. I don't see a reason not to test it. Let's see.

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Feb 18, 2026
commit f4db07d
Merge: 1000067 01823d1
Author: Volcano Bot <[email protected]>
Date:   Fri Feb 13 11:26:57 2026 +0800

    Merge pull request volcano-sh#5039 from volcano-sh/copilot/update-kubernetes-compatibility-map

    Reorganize Kubernetes compatibility matrix: show 5 latest versions with newest first

commit 01823d1
Author: Jesse Stutler <[email protected]>
Date:   Fri Feb 13 10:20:07 2026 +0800

    Add version compatibility archive and update README with reference link

commit 1000067
Merge: 6ba0e29 56c6901
Author: Volcano Bot <[email protected]>
Date:   Thu Feb 12 16:49:57 2026 +0800

    Merge pull request volcano-sh#5038 from JesseStutler/fix_4871

    Fixed issue where jobs with subgroups but not hard networkTopology.mode could not be scheduled.

commit f22bea4
Author: copilot-swe-agent[bot] <[email protected]>
Date:   Thu Feb 12 08:42:26 2026 +0000

    Reorganize Kubernetes compatibility table: show only 5 latest versions with newest on top and left

    Co-authored-by: JesseStutler <[email protected]>

commit 56c6901
Author: JesseStutler <[email protected]>
Date:   Thu Feb 12 16:05:35 2026 +0800

    Fixed issue where jobs with subgroups but not hard networkTopology.mode could not be scheduled.

    Signed-off-by: zhengchenyu <[email protected]>
    Signed-off-by: JesseStutler <[email protected]>

commit 9125f3e
Author: copilot-swe-agent[bot] <[email protected]>
Date:   Thu Feb 12 06:15:55 2026 +0000

    Add Kubernetes compatibility for Volcano v1.13 and v1.14

    Co-authored-by: JesseStutler <[email protected]>

commit f4e0f66
Author: copilot-swe-agent[bot] <[email protected]>
Date:   Thu Feb 12 06:14:39 2026 +0000

    Initial plan

Signed-off-by: devzizu <[email protected]>
@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cluster Autoscaler node scale-up for Pods that exceed Queue's capability

4 participants

Comments