GREP-375 add scheduler backend framework#372
GREP-375 add scheduler backend framework#372kangclzjc wants to merge 32 commits intoai-dynamo:mainfrom
Conversation
Signed-off-by: kangclzjc <[email protected]>
Signed-off-by: kangclzjc <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: kangclzjc <[email protected]>
Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Sanjay Chatterjee <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
remove useless words Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: kangclzjc <[email protected]>
Signed-off-by: kangclzjc <[email protected]>
Signed-off-by: Madhav Bhargava <[email protected]>
|
|
||
| ### Non-Goals | ||
|
|
||
| * **Extract PodGang Reconciler**: Moving the PodGang reconciliation logic from the PodCliqueSet reconciler into an independent reconciler is out of scope. The current reconciliation architecture will be maintained. |
There was a problem hiding this comment.
Actually this is also an implementation detail that perhaps should be removed as a non-goal from the GREP.
There was a problem hiding this comment.
Exactly, this need to be removed
|
|
||
| #### Story 3: Scheduler Migration Path | ||
|
|
||
| As a cluster administrator, I want to migrate from one scheduler to another (e.g., from a custom scheduler to KAI or vice versa) without significant disruption. The Scheduler Backend Framework should provide a clear migration path where I can update the OperatorConfiguration, restart Grove, and have new workloads use the new scheduler while existing workloads continue running. |
There was a problem hiding this comment.
Is it Scheduler migration or Workload migration across clusters using different schedulers?
There was a problem hiding this comment.
Yes, a little confusing. Let me change this. It should be both cluster using different scheduler and then how workload should use new scheduler.
Ronkahn21
left a comment
There was a problem hiding this comment.
Half way. I will complete the review tomorrow
|
|
||
| The current tight coupling between Grove and specific scheduler implementations creates several challenges: | ||
|
|
||
| * **High Integration Cost**: Adding support for a new scheduler requires extensive modifications across Grove's codebase, touching multiple components and requiring deep knowledge of both Grove and the target scheduler's internals. |
There was a problem hiding this comment.
Architectural Rigidity: Rather than adapting to various scheduler interfaces, Grove acts as a passive producer of PodGang resources. Integration is only possible if the target scheduler is modified to recognize and handle Grove's custom primitives. This shift of responsibility makes it difficult to support third-party or "off-the-shelf" schedulers without custom development on their end.
|
|
||
| As a cluster administrator, I want to migrate from one scheduler to another (e.g., from a custom scheduler to KAI or vice versa) without significant disruption. The Scheduler Backend Framework should provide a clear migration path where I can update the OperatorConfiguration, restart Grove, and have new workloads use the new scheduler while existing workloads continue running. | ||
|
|
||
| ### Limitations/Risks & Mitigations |
There was a problem hiding this comment.
One limitation that is missing is different schedulers has different capabilities what is the mitigation for that
There was a problem hiding this comment.
That is a valid point. PodGang exposes a uniform capability set across different schedulers via its API. There are now 3 possibilities:
- Scheduler also offers all the capabilities for which PodGang provides configurations.
- Scheduler does not provide lets say TAS but constraints are configured in PCS that flow to PodGang.
- Scheduler provides additional capabilities for which any additional configuration (if required) is missing in PodGang resource.
(1) is a perfect match and is therefore not an issue.
(2) PCS status should indicate that TAS is not supported by the selected scheduler via conditions. TAS is just one example and could be anything else in future.
(3) We can check during integrating the scheduler if the additional scheduler capabilities require additional configuration. This will potentially be an API change probably at the PCS + PodGang level. This can even be done in phases. This cannot be automatically provided out of the box.
There was a problem hiding this comment.
Agree on both 1,3 solutions
I am not sure if I fully agree on status indicate, As the user request are invalid for the chosen scheduler,
might we should failed submission when using unsupported feature for the given scheduler backend
Add a missing asterisk Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
add symbol Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
remove phase1 in limitation Co-authored-by: Madhav Bhargava <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
Signed-off-by: Kang Zhang <[email protected]>
|
|
||
| For detailed lifecycle flow, see [PodGang Lifecycle Changes](#podgang-lifecycle-changes). | ||
|
|
||
| ### Backend Interface Definition |
There was a problem hiding this comment.
The interface currently omits the relationship between ClusterTopology and secondary resources. How do you envision the navigational link from the main topology to other specific Topology CRDs?
There was a problem hiding this comment.
Yes, this is a good point. Per my understanding, for each scheduler backend, we should first define the mapping once the backend Initiation, and then we have several hooks like: PreparePod for modify topology label in spec, also SycPodGang hook to translate Topology to specific Topology in other CRDs.
There was a problem hiding this comment.
today we dont have controller for Cluster Topology, we would add it as part of multi cluster topology support,
so it might need extension point of its own
| // SchedulerName is the name of the scheduler backend with which this instance of Grove operator will run. | ||
| // Valid values: "kai-scheduler" or "default-scheduler" | ||
| // +required | ||
| // +kubebuilder:validation:Enum=kai-scheduler;default-scheduler |
There was a problem hiding this comment.
also should we support default scheduler with no workload api to one with workload api ?
There was a problem hiding this comment.
Yes, I believe so. When we support default scheduler, we may need to detect the version or whether have workload api or not.
|
|
||
| #### New Flow (With Framework): | ||
| 1. **Create PodGang early** with PodGroups having empty PodReferences and `Initialized=False` | ||
| 2. **Create Pods** (with scheduling gates to block scheduling) |
There was a problem hiding this comment.
Could we do this without using the scheduling Gate, In large scale this would be intensive to modify all the pods spec to remove the scheduling Gate
There was a problem hiding this comment.
I agree with u. If we could refine this scheduling gate, that's would be a good enhancement. Maybe we should raise this question and discuss it in another GREP?
There was a problem hiding this comment.
Maybe different question would would happend if would not use the scheduling gate at all (beside what we do today)
|
|
||
| We introduce Initialized as new PodGang Status Condition to signal that: | ||
| - All expected pods have been created | ||
| - PodGang.Spec.PodGroups[].PodReferences have been populated |
There was a problem hiding this comment.
Feature note, we need to talk about the existing off the PodReferences field
Move scheduler string to struct Co-authored-by: Ron Kahn <[email protected]> Signed-off-by: Kang Zhang <[email protected]>
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Add scheduler backend framework to support multiple scheduler backends
Which issue(s) this PR fixes:
Fixes #275
Fixes #375
Special notes for your reviewer:
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: