KAI Scheduler is a Kubernetes scheduler optimized for GPU resource allocation in AI/ML workloads, built on kube-batch with a modular plugin architecture.
make build # Build all services (Docker-based)
make build-go SERVICE_NAME=scheduler # Build single servicemake lint # Run all linters (fmt, vet, golangci-lint)
make fmt-go # Format Go code
make vet-go # Run go vet- Test files MUST ALWAYS be in the same directory as code
- Test files names MUST ALWAYS end in
_test.go. Example:resolver_test.go
make test # Run all tests (unit + helm chart tests)
# Run a single test file
ginkgo -v ./pkg/scheduler/actions/allocate
# Run a specific test function
ginkgo -v --focus "TestHandleAllocation" ./pkg/scheduler/actions/allocate
# Run tests with Ginkgo (for integration tests)
ginkgo -v --focus "test name pattern" ./pkg/binder/controllers/integration_tests
# Run tests with envtest (requires setup-envtest)
make envtest
KUBEBUILDER_ASSETS="$(bin/setup-envtest use 1.34.0 -p path --bin-dir bin)" go test ./pkg/... -timeout 30mE2E tests run against a real Kubernetes using kind cluster and are located in test/e2e/suites/.
# Run locally with Kind (recommended for development)
./hack/run-e2e-kind.sh # Full e2e suite
./hack/run-e2e-kind.sh --preserve-cluster # Keep cluster after tests
./hack/run-e2e-kind.sh --local-images-build # Build images locally
# Run specific test suites (requires cluster with KAI installed)
ginkgo -r --randomize-all ./test/e2e/suites/allocate
ginkgo -r --randomize-all --focus "quota" ./test/e2e/suites
# Run with verbose output and trace
ginkgo -r --randomize-all --trace -vv ./test/e2e/suites/preemptmake generate # Generate DeepCopy methods
make manifests # Generate CRDs and RBAC
make clients # Generate client code
make generate-mocks # Generate mock implementations
make validate # Verify generated code is up to date. Also format and vet codebasescheduler- Core GPU-aware batch scheduler with plugins, actions, and cachebinder- Pod binding execution with GPU sharing supportoperator- Lifecycle management of all KAI componentspodgrouper- PodGroup creation from workloads (Kubeflow, Spark, Ray, etc.)admission- Validating/mutating webhooks for KAI resourcesqueuecontroller- Queue resource management and status updatespodgroupcontroller- PodGroup lifecycle and status managementresourcereservation- GPU resource reservation for pending podsnodescaleadjuster- Node scaling integration for autoscalers
apis- Custom resource definitions (Queue, PodGroup, BindRequest)common- Shared utilities and constants
fairshare-simulator- Simulate fairshare scheduling decisionstime-based-fairshare-simulator- Time-based scheduling simulationsnapshot-tool- Cluster state snapshot utilitiesscalingpod- Helper for scaling pod operations
Organize imports in three groups separated by blank lines:
import (
// 1. Standard library
"context"
"fmt"
// 2. External dependencies
v1 "k8s.io/api/core/v1"
"sigs.k8s.io/controller-runtime/pkg/client"
// 3. Internal packages
"github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api"
)- Files: snake_case (
bind_request_controller.go,node_info.go) - Types: PascalCase (
SchedulerCache,NodeInfo) - Interfaces:
-ersuffix orInterface(Plugin,SchedulerLogger) - Functions: PascalCase exported, camelCase unexported
- Boolean functions:
is/has/shouldprefix (IsTaskAllocatable) - Constants: PascalCase exported, camelCase unexported
log.InfraLogger.V(6).Infof("Task <%s/%s> allocatable on node <%s>", ...) // V(2-3) operational, V(5-6) debug
logger := log.FromContext(ctx) // Controller logging
logger.Info("Binding pod", "namespace", pod.Namespace, "name", pod.Name)- Apache 2.0 + NVIDIA copyright headers on all files
- GoDoc-style for exported functions/types
- kubebuilder RBAC markers:
// +kubebuilder:rbac:groups=core,resources=pods,verbs=get - Avoid obvious comments; explain "why" not "what"
- Context as first parameter:
func Foo(ctx context.Context, ...) - Pointer receivers for methods that modify state
- Interfaces in
interface.gofiles - Constructor functions return interface types:
func New(...) Cache - Use
deferfor cleanup and state restoration
Enabled linters (see build/lint/.golangci.yaml):
gofmtunusedgoconsterrcheckgovet
Test files (*_test.go) have relaxed rules for goconst, errcheck, govet.
PR titles must follow semantic format: <type>(<scope>): <description>
Types: feat, fix, docs, style, refactor, perf, test, build, ci, chore, revert
Scopes (optional): scheduler, binder, podgrouper, admission, operator, queue-controller, pod-group-controller, resource-reservation, chart, api, node-scale-adjuster, ci, release, docs, deps
When opening a PR, use the template in .github/pull_request_template.md for the PR description
- You must update
CHANGELOG.mdfor PRs tomainor version branches (v*.*) for behavior changes: ones that add functionality, fix bugs, change APIs, or introduce significant performance improvements. Not needed for refactors, documentations, tests, and CI changes. - Add
skip-changelogordependencieslabel to skip this check
PRs trigger: make validate → make test → make build → E2E tests
- Docs-only changes (
.mdfiles,docs/) skip build/test - E2E tests run with Ginkgo:
ginkgo -r --randomize-all ./test/e2e/suites
- Use
git mvwhen moving files to preserve history - DO NOT add obvious comments that duplicate the code.
- Write comments ONLY when ABSOLUTELY nessessary
- Keep Comments short, explicit and concise
- DO NOT use
I,weor any other pronoun
- When performing a major change, run
make validateafter changes
-
Documentation can be found in
docsfolder -
Design documents for major features are in
docs/developer/designs/: -
Usage Examples are in
examples