Add Scheduler Backend framework by kangclzjc · Pull Request #293 · ai-dynamo/grove

kangclzjc · 2025-12-22T11:06:12Z

What type of PR is this?

In order to support different scheduler as backends we modify Grove and import scheduler backend interface

What this PR does / why we need it:

In the current PodGang component's sync flow we do the following:

Get the list of PodGangs that are expected to be created for the PCS.
Check which ones are pending to be created. For each pending PodGang we do the following:
- Check if all pods have been created for the PodGang.
- Check if all pods have required PodGang label which adds back reference to the PodGang.
If all the above checks are satisfied then it will go ahead and create the PodGang resource.
So you can see PodGang will be created after pods. However, there is a problem with upcoming Workload API support and kube-scheduler backend.

We don't want break current PodGang working flow. We import this scheduler backend framework to leave the Workload management work to scheduler backend in Grove. For other scheduler, scheduler backend in Grove may manage different CR based on PodGang(Just like KAI, it will create PodGroups. In the future, we will move this management from KAI scheduler to Grove scheduler backend).

To create a Workload object, you will need to create PodGang resource. The PodGang resource cannot be created before the Pods have been created and have a back reference to the PodGang. The issue is that only after the Workload object is created will the kube-scheduler choose to run scoring/filtering plugins to reserve node capacity to schedule this workload PodGroups. The Pods need to have a reference to the Workload object in their spec.

So to accommodate Workload API the flow needs to be changed as below in the PodGang component:

Create PodGang with PodGroups(having empty PodReferences as none will exist at this point) and Initialized condition set to False.
Creation of PodGang will trigger the creation of the Workload object in the schedulerbackend reconciler which will use the kube scheduler backend.

This is out of scope of this PR and should be included in the next PR which specifically handles the Workload APi and kube-scheduler.
Once all Pod references are updated then set it to true
Pods should not lift their scheduling gate till PodGang has Initialized condition set to True. - done in the PCLQ reconciler.

Which issue(s) this PR fixes:

Fixes #275

Special notes for your reviewer:

Does this PR introduce a API change?

Yes. We will introduce a new API SchedulerBackend

type SchedulerBackend interface {
	// Name is a unique name of the scheduler backend.
	Name() string

	// Init provides a hook to initialize/setup one-time scheduler resources,
	// called at the startup of grove operator.
	Init() error

	// SyncPodGang synchronizes (creates/updates) scheduler specific resources for a PodGang
	// reacting to a creation or update of a PodGang resource.
	SyncPodGang(ctx context.Context, podGang *groveschedulerv1alpha1.PodGang) error

	// OnPodGangDelete cleans up scheduler specific resources for the given PodGang.
	OnPodGangDelete(ctx context.Context, podGang *groveschedulerv1alpha1.PodGang) error

	// PreparePod adds scheduler backend specific configuration to the given Pod object
	// prior to its creation. This includes setting schedulerName, scheduling gates,
	// annotations, etc.
	PreparePod(pod *corev1.Pod)
}

Additional documentation e.g., enhancement proposals, usage docs, etc.:

operator/internal/controller/common/component/types.go

gflarity · 2026-01-02T18:47:40Z

Not sure I quite understand the goals of this, it's already possible to support different schedulers via the pod specs? (though Kai is the only gang scheduler currently working). I'd suggest kicking off work like this off with a github issue with plenty of detail and a discord discussion as well.

kangclzjc · 2026-01-05T02:50:32Z

Not sure I quite understand the goals of this, it's already possible to support different schedulers via the pod specs? (though Kai is the only gang scheduler currently working). I'd suggest kicking off work like this off with a github issue with plenty of detail and a discord discussion as well.

Sure, Let me create a new issue to introduce this. I can introduce some background here. This is a real request from one of our customer. We have some schedulers which want to integrate Grove. It would be great to have a unify scheduler backend. In that way, we can support other schedulers easily. Since we need to support multiple scheduler as backend especially we need to support k8s 1.34 workload API. Once we have this backend framework we can easily add new scheduler support like default-kube scheduler, Koordinator. In this PR I will only involve scheduler backend framework. For KAI scheduler backend, I won't change the currently workflow that means KAI will still handle podgang and create podgroups/pods.

Ronkahn21

Overall looks great! A few architectural points to consider:

Controller Responsibility: I don’t think the pcs-controller should be updating the PodGang status. Ideally, it should only handle the creation, leaving the podGang-controller to manage its own status.
Scaling & Performance: We should discuss the PodGang pod reference fields. Adding this to the pcs-controller increases its complexity. For better scalability, it might be better to let the PodGroup own the pod status before we move toward creating the backend API.

Since the API changes are currently out of scope, we can sync on this later. Amazing job overall, thanks!

operator/internal/controller/podclique/components/pod/syncflow.go

Ronkahn21 · 2026-01-08T09:27:10Z

operator/internal/controller/podclique/register.go

+	// Also verify that all PodGroups have enough podReferences to meet minReplicas
+	for _, pg := range podGang.Spec.PodGroups {
+		if int32(len(pg.PodReferences)) < pg.MinReplicas {
+			return false
+		}
+	}
+


can you explain why it is a requirement for filtering the events?,
I am afraid will be missing event handleling

can you explain why it is a requirement for filtering the events?, I am afraid will be missing event handleling

For reconcilers that are registered to react to PodGang create/update events we should now consider looking at the PodGangInitialized condition with its value set to True to allow enqueue of events for the reconciler. I add it because I only want to handle event when we have a stable(enough replicas) status. If PodGangInitialized is false then it means PodGang isn't been completely create yet(without PodReferece), in this case we don't need to handle event I think.

Yes, based on GREP, we don't need compare replicas since PodGangInitialized = true means we have enough pod references in podgang

operator/internal/controller/podcliqueset/components/podgang/syncflow.go

Ronkahn21 · 2026-01-08T09:49:18Z

operator/internal/controller/podcliqueset/components/podgang/syncflow.go

+		sort.Slice(podReferences, func(i, j int) bool {
+			return podReferences[i].Name < podReferences[j].Name
+		})


we might need to change the api field to ignore order, to reduce case where sorting big pods list each time

we might need to change the api field to ignore order, to reduce case where sorting big pods list each time

That's a good idea since we have customer which have over 500 pods. But in this case, we won't have the order of the pods. I am not sure is it acceptable?

And if you do not sort it then there will be unnecessary updates to the PodGangs.

operator/internal/controller/podcliqueset/components/podgang/syncflow.go

operator/internal/controller/podcliqueset/reconcilespec.go

Ronkahn21 · 2026-01-08T11:41:43Z

operator/internal/schedulerbackend/kai/backend.go

+
+const (
+	// BackendName is the name of the KAI backend
+	BackendName = "kai"


KAI

Suggested change

BackendName = "kai"

BackendName = "KAI-Scheduler"

I know kai is better enough, only 1 thing, that in pod Spec we also use kai-scheduler/default-scheduler. So I unify all this to kai-scheduler/default-scheduler, even in values.yaml.

kangclzjc · 2026-01-12T10:28:33Z

Overall looks great! A few architectural points to consider:

Controller Responsibility: I don’t think the pcs-controller should be updating the PodGang status. Ideally, it should only handle the creation, leaving the podGang-controller to manage its own status.

Scaling & Performance: We should discuss the PodGang pod reference fields. Adding this to the pcs-controller increases its complexity. For better scalability, it might be better to let the PodGroup own the pod status before we move toward creating the backend API.

Since the API changes are currently out of scope, we can sync on this later. Amazing job overall, thanks!

We don't have PodGang-controller currently, so do you mean add a new podGang-controller?
Actually if we use default kube-scheduler, we won't have PodGroup, so we'd better use pcs-controller to fill the pod reference fields.

unmarshall

1/n reviews

operator/api/config/v1alpha1/types.go

operator/charts/templates/_helpers.tpl

operator/charts/values.yaml

operator/internal/controller/manager.go

unmarshall · 2026-01-29T18:32:25Z

@kangclzjc please rebase your PR so that it becomes easier to review this PR.

unmarshall

2/n review comments

operator/api/config/v1alpha1/types.go

operator/api/config/validation/validation.go

operator/charts/values.yaml

operator/cmd/cli/testdata/valid-config-mnnvl-enabled.yaml

operator/cmd/cli/testdata/valid-config.yaml

unmarshall · 2026-01-30T11:05:39Z

operator/internal/controller/manager.go

 		return err
 	}
-	if err := registerWebhooksWithMgr(mgr, operatorCfg.Authorizer, operatorCfg.TopologyAwareScheduling, operatorCfg.Network); err != nil {
+	if err := registerWebhooksWithMgr(mgr, operatorCfg.Authorizer, operatorCfg.TopologyAwareScheduling, operatorCfg.Network, string(operatorCfg.SchedulerName)); err != nil {


By your own arguments in comment can you replace arguments to registerWebhooksWithMgr with just 2 arguments namely mgr and operatorCfg?

Yes, and since in this function, there are still other functions(many NewHandler) call with multiple parameters, do u think we should modify them together like below ?

pcsValidatingWebhook := pcsvalidation.NewHandler(mgr, operatorCfg.TopologyAwareScheduling, operatorCfg.Network, string(operatorCfg.SchedulerName))

operator/internal/controller/podclique/register.go

Signed-off-by: kangclzjc <kangz@nvidia.com>

copy-pr-bot · 2026-02-03T08:54:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: kangclzjc <kangz@nvidia.com>

enoodle · 2026-02-05T14:53:51Z

operator/internal/schedulerbackend/kai/backend.go

+
+// PreparePod adds KAI scheduler-specific configuration to the Pod
+// This includes: labels, annotations, etc.
+func (b *Backend) PreparePod(_ *corev1.Pod) {


I think we should add schedulerName: kai-scheduler to the pod spec here.

For all schedulers backend, it all need to add schedulerName, so I unify this in buildResource function. Because I think it's should be common for every backends and every pods.

pod.Spec.SchedulerName = schedulerbackend.Get().Name() // Use backend to prepare Pod spec based on scheduler requirements // This adds labels, annotations, etc. if err = schedulerbackend.PreparePod(pod); err != nil { return groveerr.WrapError(err, errCodeBuildPodResource, component.OperationSync, "failed to prepare pod spec with scheduler backend", ) }

kangclzjc · 2026-02-06T02:03:31Z

operator/internal/webhook/admission/pcs/validation/handler.go


 // NewHandler creates a new handler for PodCliqueSet Webhook.
-func NewHandler(mgr manager.Manager, tasConfig configv1alpha1.TopologyAwareSchedulingConfiguration, networkConfig configv1alpha1.NetworkAcceleration) *Handler {
+func NewHandler(mgr manager.Manager, tasConfig configv1alpha1.TopologyAwareSchedulingConfiguration, networkConfig configv1alpha1.NetworkAcceleration, schedulerName string) *Handler {


@unmarshall do u think we should also group them all as only two parameters since they may all come from one parent config

unmarshall · 2026-02-05T10:33:56Z

operator/cmd/cli/testdata/valid-config-mnnvl-enabled.yaml

    concurrentSyncs: 3
  podCliqueScalingGroup:
    concurrentSyncs: 2
+schedulerName: kai-scheduler


Ideally we should not be creating multiple configuration test data files. Instead just creating one is sufficient. Use go templates to create a test-config-template.yaml and in tests replace the template parameters with different values. But maybe we do this as a separate PR since you have not really introduced any new testdata file as part of this PR but only added schedulerName to an existing one.

unmarshall · 2026-02-05T10:35:44Z

operator/cmd/main.go

+		operatorConfig.SchedulerName,
+	); err != nil {
+		logger.Error(err, "failed to initialize scheduler backend")
+		handleErrorAndExit(err, cli.ExitErrInitializeManager)


Can you introduce a new code ErrInitializeSchedulerBackend and use it here?

unmarshall · 2026-02-05T10:41:10Z

operator/internal/controller/register.go

-func RegisterControllers(mgr ctrl.Manager, controllerConfig configv1alpha1.ControllerConfiguration, topologyAwareSchedulingConfig configv1alpha1.TopologyAwareSchedulingConfiguration, networkConfig configv1alpha1.NetworkAcceleration) error {
+func RegisterControllers(mgr ctrl.Manager, config configv1alpha1.OperatorConfiguration) error {
+	controllerConfig := config.Controllers
+	topologyAwareSchedulingConfig := config.TopologyAwareScheduling


Is there a need to define a new variable topologyAwareSchedulingConfig?
You could have directly used it:

pcsReconciler := podcliqueset.NewReconciler(mgr, controllerConfig.PodCliqueSet, config.TopologyAwareScheduling, networkConfig)

You also do not need to define networkConfig. It is not bringing any additional value.

unmarshall · 2026-02-05T10:44:25Z

operator/internal/controller/register.go

+	if err != nil {
+		return fmt.Errorf("failed to create backend reconciler: %w", err)
+	}
+	if err := backendReconciler.RegisterWithManager(mgr); err != nil {


no need to short assign err since err variable is brought into function scope via backendReconciler, err := backendcontroller.NewReconciler(mgr)

unmarshall · 2026-02-05T10:54:32Z

operator/internal/controller/register_test.go

-			},
-			PodCliqueScalingGroup: configv1alpha1.PodCliqueScalingGroupControllerConfiguration{
-				ConcurrentSyncs: ptr.To(1),
+		operatorConfig := configv1alpha1.OperatorConfiguration{


there seems to be no difference between successful registration and registration with higher concurrency - what are we effectively testing with these 2 variations? Not clear.

unmarshall · 2026-02-06T10:08:45Z

operator/internal/schedulerbackend/kai/backend_test.go

+// limitations under the License.
+// */
+
+package kai


Only introduce the test when you have the implementation.

unmarshall · 2026-02-06T10:09:16Z

operator/internal/schedulerbackend/kube/backend.go

+
+package kube
+
+import (


Just having KAI place holder implementation is enough. You can perhaps remove kube package altogether. Both of these have empty implementation of methods so there is no point to have more than one.

If I remove kube package here then once we merge this PR, the main branch can't support default kube scheduler unless we add an implementation in another PR

unmarshall · 2026-02-06T10:23:20Z

operator/internal/schedulerbackend/manager.go

+// Initialize creates the global backend instance based on schedulerName
+// This should be called once during operator startup
+// Supported scheduler names: "kai-scheduler", "default-scheduler"
+func Initialize(client client.Client, scheme *runtime.Scheme, eventRecorder record.EventRecorder, schedulerName configv1alpha1.SchedulerName) error {


I am a bit confused about this function and this file.
We have a SchedulerBackend.Init method which was supposed to do this, right? The question is why is this here then?

unmarshall · 2026-02-06T10:39:44Z

operator/internal/webhook/admission/pcs/validation/podcliqueset.go


+	// Validate that the scheduler name matches the one Grove was configured with
+	if len(uniqueSchedulerNames) > 0 && v.schedulerName != "" {
+		userSchedulerName := uniqueSchedulerNames[0]


can you rename userSchedulerName to pcsSchedulerName

unmarshall · 2026-02-06T11:00:33Z

scheduler/api/core/v1alpha1/podgang.go

 	PodGangConditionTypeReady PodGangConditionType = "Ready"
+	// PodGangConditionTypeInitialized indicates that all Pods have been created and PodGang has been populated with pod references.
+	// This condition is set to True after all pods are created, signaling that scheduling gates can be removed.
+	PodGangConditionTypeInitialized PodGangConditionType = "Initialized"


We defined conditions in api/common/constants.go and there we have defined constant for both condition and condition reasons. Maybe we do similar changes for PodGang as well?

kangclzjc commented Dec 23, 2025

View reviewed changes

operator/internal/controller/common/component/types.go Outdated Show resolved Hide resolved

kangclzjc force-pushed the scheduler_backend branch from be3e7f3 to 1475638 Compare January 5, 2026 02:44

kangclzjc changed the title ~~Add Scheduler Backend with KAI as default~~ Add Scheduler Backend framework Jan 5, 2026

kangclzjc force-pushed the scheduler_backend branch 2 times, most recently from 418038f to 206f953 Compare January 8, 2026 01:17

kangclzjc marked this pull request as ready for review January 8, 2026 03:33

kangclzjc requested review from gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners January 8, 2026 03:33

kangclzjc force-pushed the scheduler_backend branch from a24032c to e9ec287 Compare January 8, 2026 04:41

Ronkahn21 reviewed Jan 8, 2026

View reviewed changes

kangclzjc force-pushed the scheduler_backend branch from c90b6bd to 10cf479 Compare January 18, 2026 04:40

kangclzjc force-pushed the scheduler_backend branch 3 times, most recently from aa4ca3b to b0b609c Compare January 29, 2026 03:12

unmarshall requested changes Jan 29, 2026

View reviewed changes

kangclzjc force-pushed the scheduler_backend branch from f6f5a9f to d2f7151 Compare January 30, 2026 02:17

unmarshall requested changes Jan 30, 2026

View reviewed changes

kangclzjc force-pushed the scheduler_backend branch 2 times, most recently from 5855c30 to 847e4dc Compare February 1, 2026 05:57

kangclzjc force-pushed the scheduler_backend branch 2 times, most recently from 75dfe3b to bbe66b0 Compare February 2, 2026 04:48

unmarshall reviewed Feb 2, 2026

View reviewed changes

operator/internal/controller/podclique/register.go Show resolved Hide resolved

kangclzjc added 7 commits February 3, 2026 16:54

add scheduler backend

79a68d9

Signed-off-by: kangclzjc <kangz@nvidia.com>

refactor scheduler name and fix it to required

2920652

Signed-off-by: kangclzjc <kangz@nvidia.com>

validation use kai-scheduler;default-scheduler

45f5d7b

remove unused functions

6969179

Signed-off-by: kangclzjc <kangz@nvidia.com>

mv init to main

7385a88

Signed-off-by: kangclzjc <kangz@nvidia.com>

move schedulername from schedulerbackend to pclq controller

7c8ceea

Signed-off-by: kangclzjc <kangz@nvidia.com>

reuse meta lib

08f5441

Signed-off-by: kangclzjc <kangz@nvidia.com>

kangclzjc force-pushed the scheduler_backend branch from e9198ce to 08f5441 Compare February 3, 2026 08:54

revert podgang reconcile order

e38221f

Signed-off-by: kangclzjc <kangz@nvidia.com>

enoodle reviewed Feb 5, 2026

View reviewed changes

kangclzjc commented Feb 6, 2026

View reviewed changes

unmarshall requested changes Feb 6, 2026

View reviewed changes

Conversation

kangclzjc commented Dec 22, 2025 • edited by unmarshall Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a API change?

Additional documentation e.g., enhancement proposals, usage docs, etc.:

Uh oh!

Uh oh!

gflarity commented Jan 2, 2026

Uh oh!

kangclzjc commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ronkahn21 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kangclzjc Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kangclzjc commented Jan 12, 2026

Uh oh!

unmarshall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

unmarshall commented Jan 29, 2026

Uh oh!

unmarshall left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

copy-pr-bot bot commented Feb 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kangclzjc commented Dec 22, 2025 •

edited by unmarshall

Loading

kangclzjc commented Jan 5, 2026 •

edited

Loading

kangclzjc Jan 9, 2026 •

edited

Loading