trainer/AGENTS.md at master · kubeflow/trainer

Who This Is For

AI agents: Automate repository tasks with minimal context
Contributors: Humans using AI assistants or working directly
Maintainers: Ensure assistants follow project conventions and CI rules

Agent Behavior Policy

AI agents should:

Make atomic, minimal, and reversible changes.
Prefer local analysis (e.g. make generate, make fmt, make test, make test-python) before proposing commits.
NEVER modify configuration, CI/CD, or release automation unless explicitly requested.
Scan the generated code for vulnerabilities and dependency upgrades.
Avoid non-deterministic code or random seeds without fixtures.
Use AGENTS.md and Makefile as the source of truth for development commands.

Agents must NOT:

Bypass tests or linters
Introduce dependencies without updating go.mod (Go) or pyproject.toml (Python) or Cargo.toml (Rust)
Generate or commit large autogenerated files
Modify CRD schemas or API versions without explicit instruction

Context Awareness

Before writing code, agents should:

Read existing test cases and docstrings for pattern alignment
Match import patterns from neighboring files
Preserve existing logging and error-handling conventions
Understand the plugin architecture and extension framework before modifying runtime code
Review CRD schemas in pkg/apis/ before changing API structures
Call out for any breaking changes introduced and follow the deprecation policy

For additional context see the Kubeflow Trainer docs.

Repository Map

kubeflow/trainer/
├── .github/                         # GitHub actions for CI/CD
├── charts/                          # Helm charts for deployment
├── cmd/                             # Command-line applications and binaries
│   ├── trainer-controller-manager/    # Main Trainer controller (Go)
│   ├── initializers/                  # Dataset/model initializers (Python)
│   │   ├── dataset/
│   │   └── model/
│   ├── runtimes/                      # Builtin ML training runtimes
│   │   ├── deepspeed/                   # DeepSpeed runtime
│   │   └── mlx/                         # MLX runtime
│   ├── trainers/                      # Builtin trainers for LLM fine-tuning
│   │   └── torchtune/                   # TorchTune fine-tuning trainer
│   └── data_cache/                    # Distributed data cache service (Rust)
└── docs/                            # Documentation and proposals
├── examples/                        # Examples with TrainJobs
├── hack/                            # Scripts to manage CI/CD and installation
├── manifests/                       # Kustomize manifests for deployment
├── pkg/                             # Core library packages (Go)
│   ├── apis/                          # Kubernetes CRD API definitions
│   │   ├── trainer/v1alpha1/            # TrainJob, TrainingRuntime, and ClusterTrainingRuntime APIs
│   │   └── config/v1alpha1/             # Trainer config APIs
│   ├── config/                        # Trainer config logic
│   ├── controller/                    # Trainer Kubernetes controllers logic
│   ├── runtime/                     # Trainer Extension Framework
│   │   ├── core/                      # Core runtime implementation
│   │   └── framework/                 # Implementation for the framework
│   │       ├── plugins/                 # Implementation for the builtin plugins
│   │       │   ├── torch/                 # PyTorch plugin
│   │       │   ├── mpi/                   # MPI plugin
│   │       │   ├── jobset/                # JobSet plugin
│   │       │   └── ...
│   │       └── interface.go           # Framework interface definitions
│   │       └── runtime.go             # Implementation of Info object which carries information trough the plugin chain.
│   ├── webhooks/                    # Kubernetes validation/mutation webhooks for Trainer
│   ├── data_cache/                  # Distributed in-memory cache (Rust)
│   ├── initializers/                # Dataset and model initializers (Python)
│   └── util/                        # Utility functions (Go)
├── test/                          # Integration and E2E tests
│   ├── integration/                 # Ginkgo integration tests
│   └── e2e/                         # End-to-end tests

Environment & Tooling

Go: primary language for controller, APIs, plugins
Python: dataset and model initializer
Rust: data cache
Build: make (orchestration), go build, cargo, docker
Lint/format: golangci-lint, gofmt (Go), ruff (Python), cargo fmt (Rust)
Tests: go test, ginkgo (integration), pytest (Python), cargo test (Rust)
Code generation: controller-gen, openapi-gen
Pre-commit: Config provided and enforced in CI

Commands

Build

Use available container runtime to build an image. For example:

docker build . -f cmd/trainer-controller-manager/Dockerfile -t trainer-controller-manager:test
docker build . -f cmd/runtimes/deepspeed/Dockerfile -t deepspeed-runtime:test

Testing

make test                     # Go unit tests
make test-integration         # Go integration test
make test-python              # Python unit tests
make test-python-integration  # Python integration tests
make test-rust                # Rust unit tests
make test-e2e                 # End-to-end tests (requires Kind cluster)

# Targeted tests
go test ./pkg/controller/...                              # Run all controller tests
go test -v -run TestTrainJobController ./pkg/controller/  # Run specific test function

Local lint/format

make fmt                      # Format Go code
make vet                      # Vet the Go code
make golangci-lint            # Verify the Go code

Code generation (always run after modifying the APIs):

make generate                # Generate the required files

Pre-commit:

pre-commit install            # Install hooks
pre-commit run --all-files    # Run all hooks manually

Development Workflow for AI Agents

Preferred commands: Use make targets to ensure consistency with CI

Before making changes:

Read existing code patterns, comments, and tests for alignment
Check the Core Development Principles below
Run quick start command for validation and testing

Commit/PR hygiene:

Follow Conventional Commits in titles and messages.
See the check-pr-title.yaml for PR titles conventions.
Include rationale ("why") in commit messages/PR descriptions
Do not push secrets or change git config
Scope discipline: only modify files relevant to the task; keep diffs minimal

Core Development Principles

1. Maintain Stable Public Interfaces ⚠️ CRITICAL

Always preserve API compatibility for released versions. APIs are in alpha and evolving.

API Stability Rules:

CRD schemas (pkg/apis/trainer/v1alpha1): Changes require careful review
- Adding fields: Use +optional marker and provide defaults
- ALWAYS use the CEL validation whenever applicable
- Removing/renaming fields: Requires API version bump and migration plan
- Changing field types: Breaking change, requires deprecation period
Go public APIs: Exported types, functions, interfaces
- Check if exported (capitalized names)
- Look for usage in examples, tests, and documentation
- Use deprecation comments for gradual removal
Plugin interfaces (pkg/runtime/framework/interface.go): Breaking changes affect all plugins

❌ Bad - Breaking Change:

// Changed field name in CRD without migration
type TrainJobSpec struct {
    // Changed from `Suspend` to `Paused`
    Paused *bool `json:"paused,omitempty"`
}

✅ Good - Backward Compatible:

type TrainJobSpec struct {
    // Suspend pauses job execution without deleting resources.
    // Useful for debugging or resource optimization.
    // +optional
    Suspend *bool `json:"suspend,omitempty"`

    // NewFeature enables experimental capability
    // +optional
    NewFeature *bool `json:"newFeature,omitempty"`
}

2. Code Quality Standards

ALWAYS follow the existing patterns in the codebase.

Go Code Standards

❌ Bad:

func p(u, d interface{}) interface{} {
    return u
}

✅ Good:

// ReconcileTrainJob reconciles a TrainJob object
func (r *TrainJobReconciler) ReconcileTrainJob(ctx context.Context, trainJob *trainv1alpha1.TrainJob) error {
    log := ctrl.LoggerFrom(ctx)
    log.V(1).Info("Reconciling TrainJob", "name", trainJob.Name, "namespace", trainJob.Namespace)

    // Implementation...
    return nil
}

Go Style Requirements:

Follow Kubernetes code conventions, Effective Go, and Kubernetes API best practices.
Use structured logging with ctrl.LoggerFrom(ctx) (Zap-based)
Error handling: Always check errors, use fmt.Errorf for wrapping
Naming: camelCase for unexported, PascalCase for exported
Package names: Short, lowercase, no underscores

Python Code Standards

❌ Bad - Missing provider pattern:

class CustomModel:  # Not inheriting from ModelProvider(ABC)
    def download(self):
        pass

✅ Good - Following provider pattern:

class HuggingFace(utils.ModelProvider):
    """HuggingFace model initializer."""

    def load_config(self) -> None:
        config_dict = utils.get_config_from_env(types.HuggingFaceModelInitializer)
        self.config = types.HuggingFaceModelInitializer(**config_dict)

    def download_model(self) -> None:
        """Download model from HuggingFace Hub."""
        # Implementation...

Python Style Requirements:

Line length 100, Python 3.11+, double quotes, spaces indent
Imports: isort via ruff; prefer absolute imports
Naming: snake_case for functions/vars, PascalCase for classes, UPPER_SNAKE_CASE for constants
Use descriptive variable names; break up complex functions (>20 lines)
Use logging module (not print statements) for output

Rust Code Standards

Follow Cargo conventions and rustfmt defaults

/// Distributed cache server implementation
pub struct CacheServer {
    config: ServerConfig,
    state: Arc<RwLock<CacheState>>,
}

impl CacheServer {
    /// Create new cache server instance
    pub fn new(config: ServerConfig) -> Result<Self> {
        Ok(Self {
            config,
            state: Arc::new(RwLock::new(CacheState::default())),
        })
    }
}

3. Testing Requirements

Every new feature or bugfix MUST be covered by tests
Every new test MUST follow the existing tests structure
Unit tests should go to the same folder as source code
Integration tests should go to the test/integration/ directory

Go Testing Patterns

File names must have *_test.go postfix
Use dictionary to define test cases
Every new function must have a corresponding test function prefixed with Test
- Example: func RunEnforceMLPolicyPlugins() -> func TestRunEnforceMLPolicyPlugins()
Integration tests use Ginkgo framework

Python Testing Patterns

File names must have *_test.py postfix
Use pytest with fixtures
Every new function must have a corresponding test function prefixed with test_
- Example: def calculate_total() -> def test_calculate_total()
Use pytest.mark.parametrize with TestCase dataclass for multiple test scenarios:

@pytest.mark.parametrize(
    "test_case",
    [
        TestCase(
            name="valid dataset URI",
            expected_status=SUCCESS,
            config={"uri": "hf://meta-llama/model"},
            expected_output={"scheme": "hf"},
        ),
        TestCase(
            name="invalid URI format",
            expected_status=FAILED,
            config={"uri": "invalid"},
            expected_error=ValueError,
        ),
    ],
)
def test_parse_dataset_uri(test_case):
    # Test implementation using test_case attributes
    result = parse_dataset_uri(**test_case.config)
    assert result == test_case.expected_output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Who This Is For

Agent Behavior Policy

Context Awareness

Repository Map

Environment & Tooling

Commands

Build

Testing

Local lint/format

Development Workflow for AI Agents

Core Development Principles

1. Maintain Stable Public Interfaces ⚠️ CRITICAL

2. Code Quality Standards

Go Code Standards

Python Code Standards

Rust Code Standards

3. Testing Requirements

Go Testing Patterns

Python Testing Patterns

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

Who This Is For

Agent Behavior Policy

Context Awareness

Repository Map

Environment & Tooling

Commands

Build

Testing

Local lint/format

Development Workflow for AI Agents

Core Development Principles

1. Maintain Stable Public Interfaces ⚠️ CRITICAL

2. Code Quality Standards

Go Code Standards

Python Code Standards

Rust Code Standards

3. Testing Requirements

Go Testing Patterns

Python Testing Patterns