Skip to content

Add scheduling options and strategies in CA#9205

Open
walidghallab wants to merge 2 commits intokubernetes:masterfrom
walidghallab:scheduler
Open

Add scheduling options and strategies in CA#9205
walidghallab wants to merge 2 commits intokubernetes:masterfrom
walidghallab:scheduler

Conversation

@walidghallab
Copy link
Contributor

@walidghallab walidghallab commented Feb 10, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add scheduling options and strategies in CA.

To be more exact:

  • An iterator to control the order of nodes to go through.
  • Prefer the smallest number of iterations in case of multiple matches.
  • Add a way of exiting early if no more nodes should be checked.
  • Add some predefined strategies ready to use for simplicity.

Why we need it? Because currently RunFiltersUntilPassingNode has only one scheduling strategy, which is to always start from the last node it tried. While this work in many cases, there are some cases where we don't want this behavior.

Adding this option is a way to:

  • Provide some version of scoring nodes.
  • Provide simple way to avoid spreading of pods on nodes without affecting efficiency.
  • Customize the order of nodes that scheduling go through.
  • Due to parallelism, prefer the match with the earliest iterations (to decrease unnecessary spreading and indeterminism).

The only alternative that exist right now is to run CheckPredicates N times, but this is very slow because:

  • It run prefilters N times (instead of just once), and prefilters are much slower than filters.
  • Doesn't do parallelism.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/needs-area labels Feb 10, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: walidghallab
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 10, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @walidghallab. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-area labels Feb 10, 2026
@walidghallab
Copy link
Contributor Author

/cc x13n

@k8s-ci-robot k8s-ci-robot requested a review from x13n February 10, 2026 19:04
@walidghallab
Copy link
Contributor Author

/uncc elmiko

@k8s-ci-robot k8s-ci-robot removed the request for review from elmiko February 10, 2026 19:04
@walidghallab
Copy link
Contributor Author

/uncc aleksandra-malinowska

@elmiko
Copy link
Contributor

elmiko commented Feb 10, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2026
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2026
@walidghallab walidghallab force-pushed the scheduler branch 3 times, most recently from 27ab375 to 526fd85 Compare February 18, 2026 22:56
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 18, 2026
@walidghallab walidghallab changed the title Add scheduling strategies in CA Add scheduling options in CA Feb 18, 2026
@walidghallab walidghallab changed the title Add scheduling options in CA Add scheduling options and strategies in CA Feb 18, 2026
@walidghallab walidghallab force-pushed the scheduler branch 2 times, most recently from 3901633 to 8dd10e5 Compare February 19, 2026 21:13
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 19, 2026
@walidghallab walidghallab force-pushed the scheduler branch 9 times, most recently from a831585 to a4606c0 Compare February 20, 2026 12:45
To be more exact:
- An iterator to control the order of nodes to go through.
- Prefer the smallest number of iterations in case of
  multiple matches.
- Add a way of exiting early if no more nodes should be checked.
// NodeOrderMapping defines the order in which nodes are iterated during scheduling simulation.
type NodeOrderMapping interface {
// Init initializes the mapping with the list of nodes and the index of the last successful match.
Init(collection []*framework.NodeInfo, lastMatch int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could avoid passing lastMatch in the argument here to avoid leaking abstractions. I understand the current implementation starts from last match + 1, but if we changed semantics to last accessed + 1 instead, you could encapsulate last index logic entirely in lastIndexOrderMapping rather than splitting the responsibility between that and RunFiltersUntilPassingNode. The point of last match + 1 is to prevent starting from the same node every time and that property is going to be maintained if you change the semantic to be last accesssed + 1.

Copy link
Contributor Author

@walidghallab walidghallab Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, I want to allow having 2 strategies:

  • One that starts from lastMatch + 1 (current).
  • One that starts from lastMatch (this can be useful in the ASN bug we found for example). This is highly efficient while limiting spreading.

So the current way is more flexible and allow both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I wonder though whether in this case there should be a separate MarkLastMatch(int) method in the interface. Then you could drop lastIndex field from plugin runner - no need for it to be stored there.

Copy link
Contributor Author

@walidghallab walidghallab Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. But in that case lastIndex won't be communicated between 2 different NodeOrderings. And also will need some NodeOrderings to be global. (e.g. one that start from last node and one that start last node + 1).

Maybe it is okay but I wanted to have flexibility. Callers who don't need it (e.g. priorityNodeOrderMapping) can simply ignore it in init calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, with the current code structure it is possible to sometimes pass one ordering and sometimes another. And you want to pass the information about last match between these orderings? Actually the whole last index idea is quite fragile, because you have to ensure the list of nodes returned from snapshot is identical between the calls. Do you have a use case for running the predicates twice for the same pod, on the same snapshot, but with different ordering?

Copy link
Contributor Author

@walidghallab walidghallab Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have to ensure the list of nodes returned from snapshot is identical between the calls

True. The detailed answer will be:

  • In basic snapshot, it is always changing (bec it is based on a map without cache AFAICS)
  • In delta snapshot (which I assume cloud providers to be using), the answer it depends:
    -- The part that comes from before Fork will always be at the beginning of the nodeInfoList and same order (due to cache in parent snapshot).
    -- The whole list will be the same unless add/remove/modify operations are done (due to cache in the current snapshot).

Do you have a use case for running the predicates twice for the same pod, on the same snapshot, but with different ordering?

No. Bec once pod is matched, I assume it will be cached for the rest of the CA loop.

Actually the whole last index idea is quite fragile

I agree.

Okay I have made the changes:

  • Removed lastIndex from Init.
  • Added MarkMatch method to the interface.
  • Added global DefaultNodeOrdering
  • Renamed Init to Reset, bec Init indicates that it is only done once but in reality it is called every time a new pod is being scheduled.

I made them in a separate commit so you can compare the diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, I made defaultNodeOrdering a field in the pluginRunner (not global one).
To avoid race conditions (e.g. when 2 snapshots are created and running in parallel).

  • this way we can have the ability to modify the default one in the future for a specific cloud provider (or even for just specific instance of clustersnapshot) without affecting the others (e.g. by adding SetDefaultNodeOrdering or passing it in the constructor and/or having it in AutoscalingOptions).

To be more detailed, changes are:
- Removed lastIndex from Init.
- Added MarkMatch method to the interface.
- Stored defaultNodeOrdering (didn't have global one to avoid race
  conditions between different clustersnapshots).
- Renamed Init to Reset, bec Init indicates that it is only done once but in reality it is called every time a new pod is being scheduled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants