Skip to content

GREP-375 add scheduler backend framework#372

Open
kangclzjc wants to merge 32 commits intoai-dynamo:mainfrom
kangclzjc:grep_scheduler_backend
Open

GREP-375 add scheduler backend framework#372
kangclzjc wants to merge 32 commits intoai-dynamo:mainfrom
kangclzjc:grep_scheduler_backend

Conversation

@kangclzjc
Copy link
Contributor

@kangclzjc kangclzjc commented Jan 27, 2026

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Add scheduler backend framework to support multiple scheduler backends

Which issue(s) this PR fixes:

Fixes #275
Fixes #375

Special notes for your reviewer:

Does this PR introduce a API change?


Additional documentation e.g., enhancement proposals, usage docs, etc.:


Signed-off-by: kangclzjc <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
@kangclzjc kangclzjc marked this pull request as ready for review January 27, 2026 12:45
@kangclzjc kangclzjc changed the title GREP add scheduler backend framework GREP-375 add scheduler backend framework Jan 28, 2026
Signed-off-by: kangclzjc <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kangclzjc and others added 8 commits February 3, 2026 15:43
remove useless words

Co-authored-by: Madhav Bhargava <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>

### Non-Goals

* **Extract PodGang Reconciler**: Moving the PodGang reconciliation logic from the PodCliqueSet reconciler into an independent reconciler is out of scope. The current reconciliation architecture will be maintained.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is also an implementation detail that perhaps should be removed as a non-goal from the GREP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, this need to be removed


#### Story 3: Scheduler Migration Path

As a cluster administrator, I want to migrate from one scheduler to another (e.g., from a custom scheduler to KAI or vice versa) without significant disruption. The Scheduler Backend Framework should provide a clear migration path where I can update the OperatorConfiguration, restart Grove, and have new workloads use the new scheduler while existing workloads continue running.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it Scheduler migration or Workload migration across clusters using different schedulers?

Copy link
Contributor Author

@kangclzjc kangclzjc Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a little confusing. Let me change this. It should be both cluster using different scheduler and then how workload should use new scheduler.

Copy link
Contributor

@Ronkahn21 Ronkahn21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Half way. I will complete the review tomorrow


The current tight coupling between Grove and specific scheduler implementations creates several challenges:

* **High Integration Cost**: Adding support for a new scheduler requires extensive modifications across Grove's codebase, touching multiple components and requiring deep knowledge of both Grove and the target scheduler's internals.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architectural Rigidity: Rather than adapting to various scheduler interfaces, Grove acts as a passive producer of PodGang resources. Integration is only possible if the target scheduler is modified to recognize and handle Grove's custom primitives. This shift of responsibility makes it difficult to support third-party or "off-the-shelf" schedulers without custom development on their end.


As a cluster administrator, I want to migrate from one scheduler to another (e.g., from a custom scheduler to KAI or vice versa) without significant disruption. The Scheduler Backend Framework should provide a clear migration path where I can update the OperatorConfiguration, restart Grove, and have new workloads use the new scheduler while existing workloads continue running.

### Limitations/Risks & Mitigations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One limitation that is missing is different schedulers has different capabilities what is the mitigation for that

Copy link
Collaborator

@unmarshall unmarshall Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a valid point. PodGang exposes a uniform capability set across different schedulers via its API. There are now 3 possibilities:

  1. Scheduler also offers all the capabilities for which PodGang provides configurations.
  2. Scheduler does not provide lets say TAS but constraints are configured in PCS that flow to PodGang.
  3. Scheduler provides additional capabilities for which any additional configuration (if required) is missing in PodGang resource.

(1) is a perfect match and is therefore not an issue.
(2) PCS status should indicate that TAS is not supported by the selected scheduler via conditions. TAS is just one example and could be anything else in future.
(3) We can check during integrating the scheduler if the additional scheduler capabilities require additional configuration. This will potentially be an API change probably at the PCS + PodGang level. This can even be done in phases. This cannot be automatically provided out of the box.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on both 1,3 solutions
I am not sure if I fully agree on status indicate, As the user request are invalid for the chosen scheduler,
might we should failed submission when using unsupported feature for the given scheduler backend

WDYT @sanjaychatterjee @kangclzjc @unmarshall

kangclzjc and others added 13 commits February 4, 2026 13:09
Add a missing asterisk

Co-authored-by: Madhav Bhargava <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
add symbol

Co-authored-by: Madhav Bhargava <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
remove phase1 in limitation

Co-authored-by: Madhav Bhargava <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>

For detailed lifecycle flow, see [PodGang Lifecycle Changes](#podgang-lifecycle-changes).

### Backend Interface Definition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface currently omits the relationship between ClusterTopology and secondary resources. How do you envision the navigational link from the main topology to other specific Topology CRDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good point. Per my understanding, for each scheduler backend, we should first define the mapping once the backend Initiation, and then we have several hooks like: PreparePod for modify topology label in spec, also SycPodGang hook to translate Topology to specific Topology in other CRDs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

today we dont have controller for Cluster Topology, we would add it as part of multi cluster topology support,
so it might need extension point of its own

// SchedulerName is the name of the scheduler backend with which this instance of Grove operator will run.
// Valid values: "kai-scheduler" or "default-scheduler"
// +required
// +kubebuilder:validation:Enum=kai-scheduler;default-scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should we support default scheduler with no workload api to one with workload api ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so. When we support default scheduler, we may need to detect the version or whether have workload api or not.


#### New Flow (With Framework):
1. **Create PodGang early** with PodGroups having empty PodReferences and `Initialized=False`
2. **Create Pods** (with scheduling gates to block scheduling)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this without using the scheduling Gate, In large scale this would be intensive to modify all the pods spec to remove the scheduling Gate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with u. If we could refine this scheduling gate, that's would be a good enhancement. Maybe we should raise this question and discuss it in another GREP?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe different question would would happend if would not use the scheduling gate at all (beside what we do today)


We introduce Initialized as new PodGang Status Condition to signal that:
- All expected pods have been created
- PodGang.Spec.PodGroups[].PodReferences have been populated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature note, we need to talk about the existing off the PodReferences field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

Move scheduler string to struct

Co-authored-by: Ron Kahn <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GREP: add scheduler backend framework Add Native Support for Kubernetes Workload API to Enable Gang Scheduling

4 participants