Skip to content

Latest commit

 

History

History
346 lines (271 loc) · 12.6 KB

File metadata and controls

346 lines (271 loc) · 12.6 KB

Who This Is For

  • AI agents: Automate repository tasks with minimal context
  • Contributors: Humans using AI assistants or working directly
  • Maintainers: Ensure assistants follow project conventions and CI rules

Agent Behavior Policy

AI agents should:

  • Make atomic, minimal, and reversible changes.
  • Prefer local analysis (e.g. make generate, make fmt, make test, make test-python) before proposing commits.
  • NEVER modify configuration, CI/CD, or release automation unless explicitly requested.
  • Scan the generated code for vulnerabilities and dependency upgrades.
  • Avoid non-deterministic code or random seeds without fixtures.
  • Use AGENTS.md and Makefile as the source of truth for development commands.

Agents must NOT:

  • Bypass tests or linters
  • Introduce dependencies without updating go.mod (Go) or pyproject.toml (Python) or Cargo.toml (Rust)
  • Generate or commit large autogenerated files
  • Modify CRD schemas or API versions without explicit instruction

Context Awareness

Before writing code, agents should:

  • Read existing test cases and docstrings for pattern alignment
  • Match import patterns from neighboring files
  • Preserve existing logging and error-handling conventions
  • Understand the plugin architecture and extension framework before modifying runtime code
  • Review CRD schemas in pkg/apis/ before changing API structures
  • Call out for any breaking changes introduced and follow the deprecation policy

For additional context see the Kubeflow Trainer docs.

Repository Map

kubeflow/trainer/
├── .github/                         # GitHub actions for CI/CD
├── charts/                          # Helm charts for deployment
├── cmd/                             # Command-line applications and binaries
│   ├── trainer-controller-manager/    # Main Trainer controller (Go)
│   ├── initializers/                  # Dataset/model initializers (Python)
│   │   ├── dataset/
│   │   └── model/
│   ├── runtimes/                      # Builtin ML training runtimes
│   │   ├── deepspeed/                   # DeepSpeed runtime
│   │   └── mlx/                         # MLX runtime
│   ├── trainers/                      # Builtin trainers for LLM fine-tuning
│   │   └── torchtune/                   # TorchTune fine-tuning trainer
│   └── data_cache/                    # Distributed data cache service (Rust)
└── docs/                            # Documentation and proposals
├── examples/                        # Examples with TrainJobs
├── hack/                            # Scripts to manage CI/CD and installation
├── manifests/                       # Kustomize manifests for deployment
├── pkg/                             # Core library packages (Go)
│   ├── apis/                          # Kubernetes CRD API definitions
│   │   ├── trainer/v1alpha1/            # TrainJob, TrainingRuntime, and ClusterTrainingRuntime APIs
│   │   └── config/v1alpha1/             # Trainer config APIs
│   ├── config/                        # Trainer config logic
│   ├── controller/                    # Trainer Kubernetes controllers logic
│   ├── runtime/                     # Trainer Extension Framework
│   │   ├── core/                      # Core runtime implementation
│   │   └── framework/                 # Implementation for the framework
│   │       ├── plugins/                 # Implementation for the builtin plugins
│   │       │   ├── torch/                 # PyTorch plugin
│   │       │   ├── mpi/                   # MPI plugin
│   │       │   ├── jobset/                # JobSet plugin
│   │       │   └── ...
│   │       └── interface.go           # Framework interface definitions
│   │       └── runtime.go             # Implementation of Info object which carries information trough the plugin chain.
│   ├── webhooks/                    # Kubernetes validation/mutation webhooks for Trainer
│   ├── data_cache/                  # Distributed in-memory cache (Rust)
│   ├── initializers/                # Dataset and model initializers (Python)
│   └── util/                        # Utility functions (Go)
├── test/                          # Integration and E2E tests
│   ├── integration/                 # Ginkgo integration tests
│   └── e2e/                         # End-to-end tests

Environment & Tooling

  • Go: primary language for controller, APIs, plugins
  • Python: dataset and model initializer
  • Rust: data cache
  • Build: make (orchestration), go build, cargo, docker
  • Lint/format: golangci-lint, gofmt (Go), ruff (Python), cargo fmt (Rust)
  • Tests: go test, ginkgo (integration), pytest (Python), cargo test (Rust)
  • Code generation: controller-gen, openapi-gen
  • Pre-commit: Config provided and enforced in CI

Commands

Build

Use available container runtime to build an image. For example:

docker build . -f cmd/trainer-controller-manager/Dockerfile -t trainer-controller-manager:test
docker build . -f cmd/runtimes/deepspeed/Dockerfile -t deepspeed-runtime:test

Testing

make test                     # Go unit tests
make test-integration         # Go integration test
make test-python              # Python unit tests
make test-python-integration  # Python integration tests
make test-rust                # Rust unit tests
make test-e2e                 # End-to-end tests (requires Kind cluster)

# Targeted tests
go test ./pkg/controller/...                              # Run all controller tests
go test -v -run TestTrainJobController ./pkg/controller/  # Run specific test function

Local lint/format

make fmt                      # Format Go code
make vet                      # Vet the Go code
make golangci-lint            # Verify the Go code

Code generation (always run after modifying the APIs):

make generate                # Generate the required files

Pre-commit:

pre-commit install            # Install hooks
pre-commit run --all-files    # Run all hooks manually

Development Workflow for AI Agents

Preferred commands: Use make targets to ensure consistency with CI

Before making changes:

  1. Read existing code patterns, comments, and tests for alignment
  2. Check the Core Development Principles below
  3. Run quick start command for validation and testing

Commit/PR hygiene:

  • Follow Conventional Commits in titles and messages.
  • See the check-pr-title.yaml for PR titles conventions.
  • Include rationale ("why") in commit messages/PR descriptions
  • Do not push secrets or change git config
  • Scope discipline: only modify files relevant to the task; keep diffs minimal

Core Development Principles

1. Maintain Stable Public Interfaces ⚠️ CRITICAL

Always preserve API compatibility for released versions. APIs are in alpha and evolving.

API Stability Rules:

  • CRD schemas (pkg/apis/trainer/v1alpha1): Changes require careful review
    • Adding fields: Use +optional marker and provide defaults
    • ALWAYS use the CEL validation whenever applicable
    • Removing/renaming fields: Requires API version bump and migration plan
    • Changing field types: Breaking change, requires deprecation period
  • Go public APIs: Exported types, functions, interfaces
    • Check if exported (capitalized names)
    • Look for usage in examples, tests, and documentation
    • Use deprecation comments for gradual removal
  • Plugin interfaces (pkg/runtime/framework/interface.go): Breaking changes affect all plugins

Bad - Breaking Change:

// Changed field name in CRD without migration
type TrainJobSpec struct {
    // Changed from `Suspend` to `Paused`
    Paused *bool `json:"paused,omitempty"`
}

Good - Backward Compatible:

type TrainJobSpec struct {
    // Suspend pauses job execution without deleting resources.
    // Useful for debugging or resource optimization.
    // +optional
    Suspend *bool `json:"suspend,omitempty"`

    // NewFeature enables experimental capability
    // +optional
    NewFeature *bool `json:"newFeature,omitempty"`
}

2. Code Quality Standards

ALWAYS follow the existing patterns in the codebase.

Go Code Standards

Bad:

func p(u, d interface{}) interface{} {
    return u
}

Good:

// ReconcileTrainJob reconciles a TrainJob object
func (r *TrainJobReconciler) ReconcileTrainJob(ctx context.Context, trainJob *trainv1alpha1.TrainJob) error {
    log := ctrl.LoggerFrom(ctx)
    log.V(1).Info("Reconciling TrainJob", "name", trainJob.Name, "namespace", trainJob.Namespace)

    // Implementation...
    return nil
}

Go Style Requirements:

  • Follow Kubernetes code conventions, Effective Go, and Kubernetes API best practices.
  • Use structured logging with ctrl.LoggerFrom(ctx) (Zap-based)
  • Error handling: Always check errors, use fmt.Errorf for wrapping
  • Naming: camelCase for unexported, PascalCase for exported
  • Package names: Short, lowercase, no underscores

Python Code Standards

Bad - Missing provider pattern:

class CustomModel:  # Not inheriting from ModelProvider(ABC)
    def download(self):
        pass

Good - Following provider pattern:

class HuggingFace(utils.ModelProvider):
    """HuggingFace model initializer."""

    def load_config(self) -> None:
        config_dict = utils.get_config_from_env(types.HuggingFaceModelInitializer)
        self.config = types.HuggingFaceModelInitializer(**config_dict)

    def download_model(self) -> None:
        """Download model from HuggingFace Hub."""
        # Implementation...

Python Style Requirements:

  • Line length 100, Python 3.11+, double quotes, spaces indent
  • Imports: isort via ruff; prefer absolute imports
  • Naming: snake_case for functions/vars, PascalCase for classes, UPPER_SNAKE_CASE for constants
  • Use descriptive variable names; break up complex functions (>20 lines)
  • Use logging module (not print statements) for output

Rust Code Standards

  • Follow Cargo conventions and rustfmt defaults
/// Distributed cache server implementation
pub struct CacheServer {
    config: ServerConfig,
    state: Arc<RwLock<CacheState>>,
}

impl CacheServer {
    /// Create new cache server instance
    pub fn new(config: ServerConfig) -> Result<Self> {
        Ok(Self {
            config,
            state: Arc::new(RwLock::new(CacheState::default())),
        })
    }
}

3. Testing Requirements

  • Every new feature or bugfix MUST be covered by tests
  • Every new test MUST follow the existing tests structure
  • Unit tests should go to the same folder as source code
  • Integration tests should go to the test/integration/ directory

Go Testing Patterns

  • File names must have *_test.go postfix
  • Use dictionary to define test cases
  • Every new function must have a corresponding test function prefixed with Test
    • Example: func RunEnforceMLPolicyPlugins() -> func TestRunEnforceMLPolicyPlugins()
  • Integration tests use Ginkgo framework

Python Testing Patterns

  • File names must have *_test.py postfix
  • Use pytest with fixtures
  • Every new function must have a corresponding test function prefixed with test_
    • Example: def calculate_total() -> def test_calculate_total()
  • Use pytest.mark.parametrize with TestCase dataclass for multiple test scenarios:
@pytest.mark.parametrize(
    "test_case",
    [
        TestCase(
            name="valid dataset URI",
            expected_status=SUCCESS,
            config={"uri": "hf://meta-llama/model"},
            expected_output={"scheme": "hf"},
        ),
        TestCase(
            name="invalid URI format",
            expected_status=FAILED,
            config={"uri": "invalid"},
            expected_error=ValueError,
        ),
    ],
)
def test_parse_dataset_uri(test_case):
    # Test implementation using test_case attributes
    result = parse_dataset_uri(**test_case.config)
    assert result == test_case.expected_output