Skip to content

Add koord-scheduler backend with gang scheduling support #537

@shinedays

Description

@shinedays

What you would like to be added?

  • Add koord-scheduler as a supported Grove scheduler backend so that Grove workloads can use Koordinator for gang scheduling.
  • When this backend is selected, Grove PodGangs drive Koordinator's gang scheduling mechanism, and PodCliqueSet configurations incompatible with Koordinator (e.g. MNNVL) are rejected at admission.

Why is this needed?

Grove currently supports two scheduler backends: kai-scheduler and default-scheduler. Koordinator is a CNCF-hosted, widely deployed AI/ML-enhanced scheduler that is commonly found in production GPU clusters alongside Grove. Users in these environments today have no supported path to use Grove with Koordinator for gang scheduling.
Adding a koord-scheduler backend gives Grove users a third scheduling option and enables end-to-end gang scheduling (all-or-nothing pod placement) on clusters where Koordinator is already the scheduler of record.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions