Conversation
|
Not sure I quite understand the goals of this, it's already possible to support different schedulers via the pod specs? (though Kai is the only gang scheduler currently working). I'd suggest kicking off work like this off with a github issue with plenty of detail and a discord discussion as well. |
be3e7f3 to
1475638
Compare
Sure, Let me create a new issue to introduce this. I can introduce some background here. This is a real request from one of our customer. We have some schedulers which want to integrate Grove. It would be great to have a unify scheduler backend. In that way, we can support other schedulers easily. Since we need to support multiple scheduler as backend especially we need to support k8s 1.34 workload API. Once we have this backend framework we can easily add new scheduler support like default-kube scheduler, Koordinator. In this PR I will only involve scheduler backend framework. For KAI scheduler backend, I won't change the currently workflow that means KAI will still handle podgang and create podgroups/pods. |
418038f to
206f953
Compare
a24032c to
e9ec287
Compare
Ronkahn21
left a comment
There was a problem hiding this comment.
Overall looks great! A few architectural points to consider:
-
Controller Responsibility: I don’t think the pcs-controller should be updating the PodGang status. Ideally, it should only handle the creation, leaving the podGang-controller to manage its own status.
-
Scaling & Performance: We should discuss the PodGang pod reference fields. Adding this to the pcs-controller increases its complexity. For better scalability, it might be better to let the PodGroup own the pod status before we move toward creating the backend API.
Since the API changes are currently out of scope, we can sync on this later. Amazing job overall, thanks!
| // Also verify that all PodGroups have enough podReferences to meet minReplicas | ||
| for _, pg := range podGang.Spec.PodGroups { | ||
| if int32(len(pg.PodReferences)) < pg.MinReplicas { | ||
| return false | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
can you explain why it is a requirement for filtering the events?,
I am afraid will be missing event handleling
There was a problem hiding this comment.
can you explain why it is a requirement for filtering the events?, I am afraid will be missing event handleling
For reconcilers that are registered to react to PodGang create/update events we should now consider looking at the PodGangInitialized condition with its value set to True to allow enqueue of events for the reconciler. I add it because I only want to handle event when we have a stable(enough replicas) status. If PodGangInitialized is false then it means PodGang isn't been completely create yet(without PodReferece), in this case we don't need to handle event I think.
There was a problem hiding this comment.
Yes, based on GREP, we don't need compare replicas since PodGangInitialized = true means we have enough pod references in podgang
operator/internal/controller/podcliqueset/components/podgang/syncflow.go
Outdated
Show resolved
Hide resolved
| sort.Slice(podReferences, func(i, j int) bool { | ||
| return podReferences[i].Name < podReferences[j].Name | ||
| }) |
There was a problem hiding this comment.
we might need to change the api field to ignore order, to reduce case where sorting big pods list each time
There was a problem hiding this comment.
we might need to change the api field to ignore order, to reduce case where sorting big pods list each time
That's a good idea since we have customer which have over 500 pods. But in this case, we won't have the order of the pods. I am not sure is it acceptable?
There was a problem hiding this comment.
And if you do not sort it then there will be unnecessary updates to the PodGangs.
operator/internal/controller/podcliqueset/components/podgang/syncflow.go
Outdated
Show resolved
Hide resolved
operator/internal/controller/podcliqueset/components/podgang/syncflow.go
Outdated
Show resolved
Hide resolved
operator/internal/controller/podcliqueset/components/podgang/syncflow.go
Outdated
Show resolved
Hide resolved
|
|
||
| const ( | ||
| // BackendName is the name of the KAI backend | ||
| BackendName = "kai" |
There was a problem hiding this comment.
KAI
| BackendName = "kai" | |
| BackendName = "KAI-Scheduler" |
There was a problem hiding this comment.
I know kai is better enough, only 1 thing, that in pod Spec we also use kai-scheduler/default-scheduler. So I unify all this to kai-scheduler/default-scheduler, even in values.yaml.
|
c90b6bd to
10cf479
Compare
aa4ca3b to
b0b609c
Compare
|
@kangclzjc please rebase your PR so that it becomes easier to review this PR. |
f6f5a9f to
d2f7151
Compare
| return err | ||
| } | ||
| if err := registerWebhooksWithMgr(mgr, operatorCfg.Authorizer, operatorCfg.TopologyAwareScheduling, operatorCfg.Network); err != nil { | ||
| if err := registerWebhooksWithMgr(mgr, operatorCfg.Authorizer, operatorCfg.TopologyAwareScheduling, operatorCfg.Network, string(operatorCfg.SchedulerName)); err != nil { |
There was a problem hiding this comment.
By your own arguments in comment can you replace arguments to registerWebhooksWithMgr with just 2 arguments namely mgr and operatorCfg?
There was a problem hiding this comment.
Yes, and since in this function, there are still other functions(many NewHandler) call with multiple parameters, do u think we should modify them together like below ?
pcsValidatingWebhook := pcsvalidation.NewHandler(mgr, operatorCfg.TopologyAwareScheduling, operatorCfg.Network, string(operatorCfg.SchedulerName))
5855c30 to
847e4dc
Compare
75dfe3b to
bbe66b0
Compare
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
e9198ce to
08f5441
Compare
Signed-off-by: kangclzjc <kangz@nvidia.com>
|
|
||
| // PreparePod adds KAI scheduler-specific configuration to the Pod | ||
| // This includes: labels, annotations, etc. | ||
| func (b *Backend) PreparePod(_ *corev1.Pod) { |
There was a problem hiding this comment.
I think we should add schedulerName: kai-scheduler to the pod spec here.
There was a problem hiding this comment.
For all schedulers backend, it all need to add schedulerName, so I unify this in buildResource function. Because I think it's should be common for every backends and every pods.
pod.Spec.SchedulerName = schedulerbackend.Get().Name()
// Use backend to prepare Pod spec based on scheduler requirements
// This adds labels, annotations, etc.
if err = schedulerbackend.PreparePod(pod); err != nil {
return groveerr.WrapError(err,
errCodeBuildPodResource,
component.OperationSync,
"failed to prepare pod spec with scheduler backend",
)
}|
|
||
| // NewHandler creates a new handler for PodCliqueSet Webhook. | ||
| func NewHandler(mgr manager.Manager, tasConfig configv1alpha1.TopologyAwareSchedulingConfiguration, networkConfig configv1alpha1.NetworkAcceleration) *Handler { | ||
| func NewHandler(mgr manager.Manager, tasConfig configv1alpha1.TopologyAwareSchedulingConfiguration, networkConfig configv1alpha1.NetworkAcceleration, schedulerName string) *Handler { |
There was a problem hiding this comment.
@unmarshall do u think we should also group them all as only two parameters since they may all come from one parent config
| concurrentSyncs: 3 | ||
| podCliqueScalingGroup: | ||
| concurrentSyncs: 2 | ||
| schedulerName: kai-scheduler |
There was a problem hiding this comment.
Ideally we should not be creating multiple configuration test data files. Instead just creating one is sufficient. Use go templates to create a test-config-template.yaml and in tests replace the template parameters with different values. But maybe we do this as a separate PR since you have not really introduced any new testdata file as part of this PR but only added schedulerName to an existing one.
| operatorConfig.SchedulerName, | ||
| ); err != nil { | ||
| logger.Error(err, "failed to initialize scheduler backend") | ||
| handleErrorAndExit(err, cli.ExitErrInitializeManager) |
There was a problem hiding this comment.
Can you introduce a new code ErrInitializeSchedulerBackend and use it here?
| func RegisterControllers(mgr ctrl.Manager, controllerConfig configv1alpha1.ControllerConfiguration, topologyAwareSchedulingConfig configv1alpha1.TopologyAwareSchedulingConfiguration, networkConfig configv1alpha1.NetworkAcceleration) error { | ||
| func RegisterControllers(mgr ctrl.Manager, config configv1alpha1.OperatorConfiguration) error { | ||
| controllerConfig := config.Controllers | ||
| topologyAwareSchedulingConfig := config.TopologyAwareScheduling |
There was a problem hiding this comment.
Is there a need to define a new variable topologyAwareSchedulingConfig?
You could have directly used it:
pcsReconciler := podcliqueset.NewReconciler(mgr, controllerConfig.PodCliqueSet, config.TopologyAwareScheduling, networkConfig)You also do not need to define networkConfig. It is not bringing any additional value.
| if err != nil { | ||
| return fmt.Errorf("failed to create backend reconciler: %w", err) | ||
| } | ||
| if err := backendReconciler.RegisterWithManager(mgr); err != nil { |
There was a problem hiding this comment.
no need to short assign err since err variable is brought into function scope via backendReconciler, err := backendcontroller.NewReconciler(mgr)
| }, | ||
| PodCliqueScalingGroup: configv1alpha1.PodCliqueScalingGroupControllerConfiguration{ | ||
| ConcurrentSyncs: ptr.To(1), | ||
| operatorConfig := configv1alpha1.OperatorConfiguration{ |
There was a problem hiding this comment.
there seems to be no difference between successful registration and registration with higher concurrency - what are we effectively testing with these 2 variations? Not clear.
| // limitations under the License. | ||
| // */ | ||
|
|
||
| package kai |
There was a problem hiding this comment.
Only introduce the test when you have the implementation.
|
|
||
| package kube | ||
|
|
||
| import ( |
There was a problem hiding this comment.
Just having KAI place holder implementation is enough. You can perhaps remove kube package altogether. Both of these have empty implementation of methods so there is no point to have more than one.
There was a problem hiding this comment.
If I remove kube package here then once we merge this PR, the main branch can't support default kube scheduler unless we add an implementation in another PR
| // Initialize creates the global backend instance based on schedulerName | ||
| // This should be called once during operator startup | ||
| // Supported scheduler names: "kai-scheduler", "default-scheduler" | ||
| func Initialize(client client.Client, scheme *runtime.Scheme, eventRecorder record.EventRecorder, schedulerName configv1alpha1.SchedulerName) error { |
There was a problem hiding this comment.
I am a bit confused about this function and this file.
We have a SchedulerBackend.Init method which was supposed to do this, right? The question is why is this here then?
|
|
||
| // Validate that the scheduler name matches the one Grove was configured with | ||
| if len(uniqueSchedulerNames) > 0 && v.schedulerName != "" { | ||
| userSchedulerName := uniqueSchedulerNames[0] |
There was a problem hiding this comment.
can you rename userSchedulerName to pcsSchedulerName
| PodGangConditionTypeReady PodGangConditionType = "Ready" | ||
| // PodGangConditionTypeInitialized indicates that all Pods have been created and PodGang has been populated with pod references. | ||
| // This condition is set to True after all pods are created, signaling that scheduling gates can be removed. | ||
| PodGangConditionTypeInitialized PodGangConditionType = "Initialized" |
There was a problem hiding this comment.
We defined conditions in api/common/constants.go and there we have defined constant for both condition and condition reasons. Maybe we do similar changes for PodGang as well?
What type of PR is this?
In order to support different scheduler as backends we modify Grove and import scheduler backend interface
What this PR does / why we need it:
In the current PodGang component's sync flow we do the following:
PodGangresource.So you can see
PodGangwill be created after pods. However, there is a problem with upcomingWorkloadAPI support andkube-schedulerbackend.We don't want break current
PodGangworking flow. We import this scheduler backend framework to leave theWorkloadmanagement work to scheduler backend in Grove. For other scheduler, scheduler backend in Grove may manage different CR based onPodGang(Just like KAI, it will createPodGroups. In the future, we will move this management from KAI scheduler to Grove scheduler backend).To create a
Workloadobject, you will need to createPodGangresource. ThePodGangresource cannot be created before the Pods have been created and have a back reference to the PodGang. The issue is that only after theWorkloadobject is created will thekube-schedulerchoose to run scoring/filtering plugins to reserve node capacity to schedule this workload PodGroups. The Pods need to have a reference to theWorkloadobject in their spec.So to accommodate
WorkloadAPI the flow needs to be changed as below in the PodGang component:PodGangwith PodGroups(having emptyPodReferencesas none will exist at this point) andInitializedcondition set toFalse.PodGangwill trigger the creation of theWorkloadobject in the schedulerbackend reconciler which will use thekubescheduler backend.PodGanghasInitializedcondition set toTrue. - done in the PCLQ reconciler.Which issue(s) this PR fixes:
Fixes #275
Special notes for your reviewer:
Does this PR introduce a API change?
Yes. We will introduce a new API
SchedulerBackendAdditional documentation e.g., enhancement proposals, usage docs, etc.: