GREP-375 add scheduler backend framework by kangclzjc · Pull Request #372 · ai-dynamo/grove

kangclzjc · 2026-01-27T12:44:25Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Add scheduler backend framework to support multiple scheduler backends

Which issue(s) this PR fixes:

Fixes #275
Fixes #375

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Signed-off-by: kangclzjc <[email protected]>

Signed-off-by: Kang Zhang <[email protected]>

Signed-off-by: kangclzjc <[email protected]>

docs/proposals/375-scheduler-backend-framework/README.md

Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

copy-pr-bot · 2026-02-03T06:59:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

remove useless words Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Signed-off-by: kangclzjc <[email protected]>

Signed-off-by: Madhav Bhargava <[email protected]>

docs/proposals/375-scheduler-backend-framework/README.md

unmarshall · 2026-02-03T17:02:48Z

docs/proposals/375-scheduler-backend-framework/README.md

+
+### Non-Goals
+
+* **Extract PodGang Reconciler**: Moving the PodGang reconciliation logic from the PodCliqueSet reconciler into an independent reconciler is out of scope. The current reconciliation architecture will be maintained.


Actually this is also an implementation detail that perhaps should be removed as a non-goal from the GREP.

Exactly, this need to be removed

docs/proposals/375-scheduler-backend-framework/README.md

unmarshall · 2026-02-03T17:08:44Z

docs/proposals/375-scheduler-backend-framework/README.md

+
+#### Story 3: Scheduler Migration Path
+
+As a cluster administrator, I want to migrate from one scheduler to another (e.g., from a custom scheduler to KAI or vice versa) without significant disruption. The Scheduler Backend Framework should provide a clear migration path where I can update the OperatorConfiguration, restart Grove, and have new workloads use the new scheduler while existing workloads continue running.


Is it Scheduler migration or Workload migration across clusters using different schedulers?

Yes, a little confusing. Let me change this. It should be both cluster using different scheduler and then how workload should use new scheduler.

docs/proposals/375-scheduler-backend-framework/README.md

Ronkahn21

Half way. I will complete the review tomorrow

Ronkahn21 · 2026-02-03T15:47:00Z

docs/proposals/375-scheduler-backend-framework/README.md

+
+The current tight coupling between Grove and specific scheduler implementations creates several challenges:
+
+* **High Integration Cost**: Adding support for a new scheduler requires extensive modifications across Grove's codebase, touching multiple components and requiring deep knowledge of both Grove and the target scheduler's internals.


Architectural Rigidity: Rather than adapting to various scheduler interfaces, Grove acts as a passive producer of PodGang resources. Integration is only possible if the target scheduler is modified to recognize and handle Grove's custom primitives. This shift of responsibility makes it difficult to support third-party or "off-the-shelf" schedulers without custom development on their end.

docs/proposals/375-scheduler-backend-framework/README.md

Ronkahn21 · 2026-02-03T16:06:26Z

docs/proposals/375-scheduler-backend-framework/README.md

+
+As a cluster administrator, I want to migrate from one scheduler to another (e.g., from a custom scheduler to KAI or vice versa) without significant disruption. The Scheduler Backend Framework should provide a clear migration path where I can update the OperatorConfiguration, restart Grove, and have new workloads use the new scheduler while existing workloads continue running.
+
+### Limitations/Risks & Mitigations


One limitation that is missing is different schedulers has different capabilities what is the mitigation for that

That is a valid point. PodGang exposes a uniform capability set across different schedulers via its API. There are now 3 possibilities:

Scheduler also offers all the capabilities for which PodGang provides configurations.

Scheduler does not provide lets say TAS but constraints are configured in PCS that flow to PodGang.

Scheduler provides additional capabilities for which any additional configuration (if required) is missing in PodGang resource.

(1) is a perfect match and is therefore not an issue.
(2) PCS status should indicate that TAS is not supported by the selected scheduler via conditions. TAS is just one example and could be anything else in future.
(3) We can check during integrating the scheduler if the additional scheduler capabilities require additional configuration. This will potentially be an API change probably at the PCS + PodGang level. This can even be done in phases. This cannot be automatically provided out of the box.

Agree on both 1,3 solutions
I am not sure if I fully agree on status indicate, As the user request are invalid for the chosen scheduler,
might we should failed submission when using unsupported feature for the given scheduler backend

WDYT @sanjaychatterjee @kangclzjc @unmarshall

docs/proposals/375-scheduler-backend-framework/README.md

Add a missing asterisk Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

add symbol Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Signed-off-by: Kang Zhang <[email protected]>

remove phase1 in limitation Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Signed-off-by: Kang Zhang <[email protected]>

Ronkahn21 · 2026-02-05T12:16:38Z

docs/proposals/375-scheduler-backend-framework/README.md

+
+For detailed lifecycle flow, see [PodGang Lifecycle Changes](#podgang-lifecycle-changes).
+
+### Backend Interface Definition


The interface currently omits the relationship between ClusterTopology and secondary resources. How do you envision the navigational link from the main topology to other specific Topology CRDs?

Yes, this is a good point. Per my understanding, for each scheduler backend, we should first define the mapping once the backend Initiation, and then we have several hooks like: PreparePod for modify topology label in spec, also SycPodGang hook to translate Topology to specific Topology in other CRDs.

today we dont have controller for Cluster Topology, we would add it as part of multi cluster topology support,
so it might need extension point of its own

docs/proposals/375-scheduler-backend-framework/README.md

Ronkahn21 · 2026-02-05T12:26:04Z

docs/proposals/375-scheduler-backend-framework/README.md

+	// SchedulerName is the name of the scheduler backend with which this instance of Grove operator will run.
+	// Valid values: "kai-scheduler" or "default-scheduler"
+	// +required
+	// +kubebuilder:validation:Enum=kai-scheduler;default-scheduler


also should we support default scheduler with no workload api to one with workload api ?

Yes, I believe so. When we support default scheduler, we may need to detect the version or whether have workload api or not.

Ronkahn21 · 2026-02-05T12:28:00Z

docs/proposals/375-scheduler-backend-framework/README.md

+
+#### New Flow (With Framework):
+1. **Create PodGang early** with PodGroups having empty PodReferences and `Initialized=False`
+2. **Create Pods** (with scheduling gates to block scheduling)


Could we do this without using the scheduling Gate, In large scale this would be intensive to modify all the pods spec to remove the scheduling Gate

I agree with u. If we could refine this scheduling gate, that's would be a good enhancement. Maybe we should raise this question and discuss it in another GREP?

Maybe different question would would happend if would not use the scheduling gate at all (beside what we do today)

Ronkahn21 · 2026-02-05T12:29:41Z

docs/proposals/375-scheduler-backend-framework/README.md

+
+We introduce Initialized as new PodGang Status Condition to signal that:
+- All expected pods have been created
+- PodGang.Spec.PodGroups[].PodReferences have been populated


Feature note, we need to talk about the existing off the PodReferences field

Move scheduler string to struct Co-authored-by: Ron Kahn <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

kangclzjc added 3 commits January 27, 2026 12:26

add scheduler backend grep

8322af4

Signed-off-by: kangclzjc <[email protected]>

add pic

0bbc20a

Signed-off-by: kangclzjc <[email protected]>

refine proposal

42f591f

Signed-off-by: Kang Zhang <[email protected]>

kangclzjc marked this pull request as ready for review January 27, 2026 12:45

kangclzjc requested review from Ronkahn21, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners January 27, 2026 12:45

kangclzjc added 3 commits January 28, 2026 09:07

update toc

b166471

Signed-off-by: Kang Zhang <[email protected]>

format tab to spaces

f99adf4

Signed-off-by: Kang Zhang <[email protected]>

rename 275 to 375 as I created a issue for this

39eb335

Signed-off-by: Kang Zhang <[email protected]>

kangclzjc changed the title ~~GREP add scheduler backend framework~~ GREP-375 add scheduler backend framework Jan 28, 2026

update pod mutate

77fbb02

Signed-off-by: kangclzjc <[email protected]>

unmarshall reviewed Feb 3, 2026

View reviewed changes

docs/proposals/375-scheduler-backend-framework/README.md Outdated Show resolved Hide resolved

unmarshall reviewed Feb 3, 2026

View reviewed changes

docs/proposals/375-scheduler-backend-framework/README.md Outdated Show resolved Hide resolved

sanjaychatterjee reviewed Feb 3, 2026

View reviewed changes

unmarshall reviewed Feb 3, 2026

View reviewed changes

docs/proposals/375-scheduler-backend-framework/README.md Outdated Show resolved Hide resolved

unmarshall reviewed Feb 3, 2026

View reviewed changes

docs/proposals/375-scheduler-backend-framework/README.md Outdated Show resolved Hide resolved

unmarshall reviewed Feb 3, 2026

View reviewed changes

Update docs/proposals/375-scheduler-backend-framework/README.md

43e2dc2

Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

kangclzjc and others added 8 commits February 3, 2026 15:43

Update docs/proposals/375-scheduler-backend-framework/README.md

dab7c26

Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Update docs/proposals/375-scheduler-backend-framework/README.md

5dd7420

Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Update docs/proposals/375-scheduler-backend-framework/README.md

c1401ea

Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Update docs/proposals/375-scheduler-backend-framework/README.md

a2b6113

Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Update docs/proposals/375-scheduler-backend-framework/README.md

fcfbeb1

Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Update docs/proposals/375-scheduler-backend-framework/README.md

0fc0dec

Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Update docs/proposals/375-scheduler-backend-framework/README.md

5b6ad65

remove useless words Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

remove kai and some non-goals

1c8e245

Signed-off-by: kangclzjc <[email protected]>

kangclzjc and others added 2 commits February 3, 2026 16:41

GREP should not have so much codes implementation

2d65ea0

Signed-off-by: kangclzjc <[email protected]>

shortened the Motivation section

6c71b72

Signed-off-by: Madhav Bhargava <[email protected]>

unmarshall reviewed Feb 3, 2026

View reviewed changes

docs/proposals/375-scheduler-backend-framework/README.md Outdated Show resolved Hide resolved

Ronkahn21 reviewed Feb 3, 2026

View reviewed changes

kangclzjc and others added 13 commits February 4, 2026 13:09

Update docs/proposals/375-scheduler-backend-framework/README.md

52bd316

Add a missing asterisk Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Update docs/proposals/375-scheduler-backend-framework/README.md

7ba8cf4

add symbol Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Update docs/proposals/375-scheduler-backend-framework/README.md

8ac04e0

Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

Update docs/proposals/375-scheduler-backend-framework/README.md

44710f0

Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

update png with embed the scene

9938e06

Signed-off-by: Kang Zhang <[email protected]>

fix migration

12edb56

Signed-off-by: Kang Zhang <[email protected]>

Apply suggestions from code review

08b6a4d

remove phase1 in limitation Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>

add schedulers capability mismatch

868c475

Signed-off-by: Kang Zhang <[email protected]>

format toc

888167f

Signed-off-by: Kang Zhang <[email protected]>

refine podgang lifecycle

9b79f39

Signed-off-by: Kang Zhang <[email protected]>

remove dependencies

afe8ebf

Signed-off-by: Kang Zhang <[email protected]>

modify scheduler backend framework picture

fa19181

Signed-off-by: Kang Zhang <[email protected]>

modify scheduler backend framework picture

3a9bc2a

Signed-off-by: Kang Zhang <[email protected]>

Ronkahn21 reviewed Feb 5, 2026

View reviewed changes

Update docs/proposals/375-scheduler-backend-framework/README.md

4017c84

Move scheduler string to struct Co-authored-by: Ron Kahn <[email protected]> Signed-off-by: Kang Zhang <[email protected]>


		### Non-Goals

		* Extract PodGang Reconciler: Moving the PodGang reconciliation logic from the PodCliqueSet reconciler into an independent reconciler is out of scope. The current reconciliation architecture will be maintained.


		#### Story 3: Scheduler Migration Path

		As a cluster administrator, I want to migrate from one scheduler to another (e.g., from a custom scheduler to KAI or vice versa) without significant disruption. The Scheduler Backend Framework should provide a clear migration path where I can update the OperatorConfiguration, restart Grove, and have new workloads use the new scheduler while existing workloads continue running.


		The current tight coupling between Grove and specific scheduler implementations creates several challenges:

		* High Integration Cost: Adding support for a new scheduler requires extensive modifications across Grove's codebase, touching multiple components and requiring deep knowledge of both Grove and the target scheduler's internals.


		As a cluster administrator, I want to migrate from one scheduler to another (e.g., from a custom scheduler to KAI or vice versa) without significant disruption. The Scheduler Backend Framework should provide a clear migration path where I can update the OperatorConfiguration, restart Grove, and have new workloads use the new scheduler while existing workloads continue running.

		### Limitations/Risks & Mitigations


		For detailed lifecycle flow, see [PodGang Lifecycle Changes](#podgang-lifecycle-changes).

		### Backend Interface Definition

Conversation

kangclzjc commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kangclzjc Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ronkahn21 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

unmarshall Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kangclzjc commented Jan 27, 2026 •

edited

Loading

kangclzjc Feb 4, 2026 •

edited

Loading

unmarshall Feb 4, 2026 •

edited

Loading