- AI agents: Automate repository tasks with minimal context
- Contributors: Humans using AI assistants or working directly
- Maintainers: Ensure assistants follow project conventions and CI rules
AI agents should:
- Make atomic, minimal, and reversible changes.
- Prefer local analysis (e.g.
make generate,make fmt,make test,make test-python) before proposing commits. - NEVER modify configuration, CI/CD, or release automation unless explicitly requested.
- Scan the generated code for vulnerabilities and dependency upgrades.
- Avoid non-deterministic code or random seeds without fixtures.
- Use
AGENTS.mdandMakefileas the source of truth for development commands.
Agents must NOT:
- Bypass tests or linters
- Introduce dependencies without updating
go.mod(Go) orpyproject.toml(Python) orCargo.toml(Rust) - Generate or commit large autogenerated files
- Modify CRD schemas or API versions without explicit instruction
Before writing code, agents should:
- Read existing test cases and docstrings for pattern alignment
- Match import patterns from neighboring files
- Preserve existing logging and error-handling conventions
- Understand the plugin architecture and extension framework before modifying runtime code
- Review CRD schemas in
pkg/apis/before changing API structures - Call out for any breaking changes introduced and follow the deprecation policy
For additional context see the Kubeflow Trainer docs.
kubeflow/trainer/
├── .github/ # GitHub actions for CI/CD
├── charts/ # Helm charts for deployment
├── cmd/ # Command-line applications and binaries
│ ├── trainer-controller-manager/ # Main Trainer controller (Go)
│ ├── initializers/ # Dataset/model initializers (Python)
│ │ ├── dataset/
│ │ └── model/
│ ├── runtimes/ # Builtin ML training runtimes
│ │ ├── deepspeed/ # DeepSpeed runtime
│ │ └── mlx/ # MLX runtime
│ ├── trainers/ # Builtin trainers for LLM fine-tuning
│ │ └── torchtune/ # TorchTune fine-tuning trainer
│ └── data_cache/ # Distributed data cache service (Rust)
└── docs/ # Documentation and proposals
├── examples/ # Examples with TrainJobs
├── hack/ # Scripts to manage CI/CD and installation
├── manifests/ # Kustomize manifests for deployment
├── pkg/ # Core library packages (Go)
│ ├── apis/ # Kubernetes CRD API definitions
│ │ ├── trainer/v1alpha1/ # TrainJob, TrainingRuntime, and ClusterTrainingRuntime APIs
│ │ └── config/v1alpha1/ # Trainer config APIs
│ ├── config/ # Trainer config logic
│ ├── controller/ # Trainer Kubernetes controllers logic
│ ├── runtime/ # Trainer Extension Framework
│ │ ├── core/ # Core runtime implementation
│ │ └── framework/ # Implementation for the framework
│ │ ├── plugins/ # Implementation for the builtin plugins
│ │ │ ├── torch/ # PyTorch plugin
│ │ │ ├── mpi/ # MPI plugin
│ │ │ ├── jobset/ # JobSet plugin
│ │ │ └── ...
│ │ └── interface.go # Framework interface definitions
│ │ └── runtime.go # Implementation of Info object which carries information trough the plugin chain.
│ ├── webhooks/ # Kubernetes validation/mutation webhooks for Trainer
│ ├── data_cache/ # Distributed in-memory cache (Rust)
│ ├── initializers/ # Dataset and model initializers (Python)
│ └── util/ # Utility functions (Go)
├── test/ # Integration and E2E tests
│ ├── integration/ # Ginkgo integration tests
│ └── e2e/ # End-to-end tests
- Go: primary language for controller, APIs, plugins
- Python: dataset and model initializer
- Rust: data cache
- Build:
make(orchestration),go build,cargo,docker - Lint/format:
golangci-lint,gofmt(Go),ruff(Python),cargo fmt(Rust) - Tests:
go test,ginkgo(integration),pytest(Python),cargo test(Rust) - Code generation:
controller-gen,openapi-gen - Pre-commit: Config provided and enforced in CI
Use available container runtime to build an image. For example:
docker build . -f cmd/trainer-controller-manager/Dockerfile -t trainer-controller-manager:test
docker build . -f cmd/runtimes/deepspeed/Dockerfile -t deepspeed-runtime:testmake test # Go unit tests
make test-integration # Go integration test
make test-python # Python unit tests
make test-python-integration # Python integration tests
make test-rust # Rust unit tests
make test-e2e # End-to-end tests (requires Kind cluster)
# Targeted tests
go test ./pkg/controller/... # Run all controller tests
go test -v -run TestTrainJobController ./pkg/controller/ # Run specific test functionmake fmt # Format Go code
make vet # Vet the Go code
make golangci-lint # Verify the Go codeCode generation (always run after modifying the APIs):
make generate # Generate the required filesPre-commit:
pre-commit install # Install hooks
pre-commit run --all-files # Run all hooks manuallyPreferred commands: Use make targets to ensure consistency with CI
Before making changes:
- Read existing code patterns, comments, and tests for alignment
- Check the Core Development Principles below
- Run quick start command for validation and testing
Commit/PR hygiene:
- Follow Conventional Commits in titles and messages.
- See the check-pr-title.yaml for PR titles conventions.
- Include rationale ("why") in commit messages/PR descriptions
- Do not push secrets or change git config
- Scope discipline: only modify files relevant to the task; keep diffs minimal
Always preserve API compatibility for released versions. APIs are in alpha and evolving.
API Stability Rules:
- CRD schemas (
pkg/apis/trainer/v1alpha1): Changes require careful review- Adding fields: Use
+optionalmarker and provide defaults - ALWAYS use the CEL validation whenever applicable
- Removing/renaming fields: Requires API version bump and migration plan
- Changing field types: Breaking change, requires deprecation period
- Adding fields: Use
- Go public APIs: Exported types, functions, interfaces
- Check if exported (capitalized names)
- Look for usage in examples, tests, and documentation
- Use deprecation comments for gradual removal
- Plugin interfaces (
pkg/runtime/framework/interface.go): Breaking changes affect all plugins
❌ Bad - Breaking Change:
// Changed field name in CRD without migration
type TrainJobSpec struct {
// Changed from `Suspend` to `Paused`
Paused *bool `json:"paused,omitempty"`
}✅ Good - Backward Compatible:
type TrainJobSpec struct {
// Suspend pauses job execution without deleting resources.
// Useful for debugging or resource optimization.
// +optional
Suspend *bool `json:"suspend,omitempty"`
// NewFeature enables experimental capability
// +optional
NewFeature *bool `json:"newFeature,omitempty"`
}ALWAYS follow the existing patterns in the codebase.
❌ Bad:
func p(u, d interface{}) interface{} {
return u
}✅ Good:
// ReconcileTrainJob reconciles a TrainJob object
func (r *TrainJobReconciler) ReconcileTrainJob(ctx context.Context, trainJob *trainv1alpha1.TrainJob) error {
log := ctrl.LoggerFrom(ctx)
log.V(1).Info("Reconciling TrainJob", "name", trainJob.Name, "namespace", trainJob.Namespace)
// Implementation...
return nil
}Go Style Requirements:
- Follow Kubernetes code conventions, Effective Go, and Kubernetes API best practices.
- Use structured logging with
ctrl.LoggerFrom(ctx)(Zap-based) - Error handling: Always check errors, use
fmt.Errorffor wrapping - Naming:
camelCasefor unexported,PascalCasefor exported - Package names: Short, lowercase, no underscores
❌ Bad - Missing provider pattern:
class CustomModel: # Not inheriting from ModelProvider(ABC)
def download(self):
pass✅ Good - Following provider pattern:
class HuggingFace(utils.ModelProvider):
"""HuggingFace model initializer."""
def load_config(self) -> None:
config_dict = utils.get_config_from_env(types.HuggingFaceModelInitializer)
self.config = types.HuggingFaceModelInitializer(**config_dict)
def download_model(self) -> None:
"""Download model from HuggingFace Hub."""
# Implementation...Python Style Requirements:
- Line length 100, Python 3.11+, double quotes, spaces indent
- Imports: isort via ruff; prefer absolute imports
- Naming:
snake_casefor functions/vars,PascalCasefor classes,UPPER_SNAKE_CASEfor constants - Use descriptive variable names; break up complex functions (>20 lines)
- Use logging module (not print statements) for output
- Follow Cargo conventions and rustfmt defaults
/// Distributed cache server implementation
pub struct CacheServer {
config: ServerConfig,
state: Arc<RwLock<CacheState>>,
}
impl CacheServer {
/// Create new cache server instance
pub fn new(config: ServerConfig) -> Result<Self> {
Ok(Self {
config,
state: Arc::new(RwLock::new(CacheState::default())),
})
}
}- Every new feature or bugfix MUST be covered by tests
- Every new test MUST follow the existing tests structure
- Unit tests should go to the same folder as source code
- Integration tests should go to the
test/integration/directory
- File names must have
*_test.gopostfix - Use dictionary to define test cases
- Every new function must have a corresponding test function prefixed with
Test- Example:
func RunEnforceMLPolicyPlugins()->func TestRunEnforceMLPolicyPlugins()
- Example:
- Integration tests use Ginkgo framework
- File names must have
*_test.pypostfix - Use pytest with fixtures
- Every new function must have a corresponding test function prefixed with
test_- Example:
def calculate_total()->def test_calculate_total()
- Example:
- Use
pytest.mark.parametrizewithTestCasedataclass for multiple test scenarios:
@pytest.mark.parametrize(
"test_case",
[
TestCase(
name="valid dataset URI",
expected_status=SUCCESS,
config={"uri": "hf://meta-llama/model"},
expected_output={"scheme": "hf"},
),
TestCase(
name="invalid URI format",
expected_status=FAILED,
config={"uri": "invalid"},
expected_error=ValueError,
),
],
)
def test_parse_dataset_uri(test_case):
# Test implementation using test_case attributes
result = parse_dataset_uri(**test_case.config)
assert result == test_case.expected_output