This repo hosts a kubernetes operator that is responsible for creating and managing llama-stack server.
- Automated deployment of Llama Stack servers
- Support for multiple distributions (includes Ollama, vLLM, and others)
- Customizable server configurations
- Volume management for model storage
- Kubernetes-native resource management
You can install the operator directly from a released version or the latest main branch using kubectl apply -f.
To install the latest version from the main branch:
kubectl apply -f https://raw.githubusercontent.com/llamastack/llama-stack-k8s-operator/main/release/operator.yamlTo install a specific released version (e.g., v1.0.0), replace main with the desired tag:
kubectl apply -f https://raw.githubusercontent.com/llamastack/llama-stack-k8s-operator/v1.0.0/release/operator.yaml- Deploy the inference provider server (ollama, vllm)
Ollama Examples:
Deploy Ollama with default model llama3.2:1b
./hack/deploy-quickstart.shDeploy Ollama with other model:
./hack/deploy-quickstart.sh --provider ollama --model llama3.2:7bvLLM Examples:
This would require a secret "hf-token-secret" in namespace "vllm-dist" for HuggingFace token (required for downloading models) to be created in advance.
Deploy vLLM with default model (meta-llama/Llama-3.2-1B):
./hack/deploy-quickstart.sh --provider vllmDeploy vLLM with GPU support:
./hack/deploy-quickstart.sh --provider vllm --runtime-env "VLLM_TARGET_DEVICE=gpu,CUDA_VISIBLE_DEVICES=0"- Create LlamaStackDistribution CR to get the server running. Example:
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: llamastackdistribution-sample
spec:
replicas: 1
server:
distribution:
name: starter
containerSpec:
env:
- name: OLLAMA_INFERENCE_MODEL
value: "llama3.2:1b"
- name: OLLAMA_URL
value: "http://ollama-server-service.ollama-dist.svc.cluster.local:11434"
storage:
size: "20Gi"
mountPath: "/home/lls/.lls"
- Verify the server pod is running in the user defined namespace.
A ConfigMap can be used to store config.yaml configuration for each LlamaStackDistribution. Updates to the ConfigMap will restart the Pod to load the new data.
Example to create a config.yaml ConfigMap, and a LlamaStackDistribution that references it:
kubectl apply -f config/samples/example-with-configmap.yaml
The operator can create an ingress-only NetworkPolicy for each LlamaStackDistribution. By default, traffic is limited to:
- Pods with label
app.kubernetes.io/part-of: llama-stackin the same namespace - The operator namespace (
llama-stack-k8s-operator-system)
Network policies are disabled by default. Enable via ConfigMap:
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: llama-stack-operator-config
namespace: llama-stack-k8s-operator-system
data:
featureFlags: |
enableNetworkPolicy:
enabled: true
EOFUse spec.network to customize access controls:
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
name: my-llsd
spec:
server:
distribution:
name: starter
network:
exposeRoute: false # Set true to create an Ingress for external access
allowedFrom:
namespaces: # Explicit namespace names
- my-app-namespace
- monitoring
labels: # Namespaces matching these label keys
- team=frontend| Field | Description |
|---|---|
network.exposeRoute |
When true, creates an Ingress for external access (default: false) |
network.allowedFrom.namespaces |
List of namespace names allowed to access the service. Use "*" to allow all namespaces |
network.allowedFrom.labels |
List of namespace label keys. Namespaces with these labels are allowed |
Set enabled: false in the ConfigMap to disable; the operator will delete the managed policies.
The operator supports ConfigMap-driven image updates for LLS Distribution images. This allows independent patching for security fixes or bug fixes without requiring a new operator version.
Create or update the operator ConfigMap with an image-overrides key:
image-overrides: |
starter-gpu: quay.io/custom/llama-stack:starter-gpu
starter: quay.io/custom/llama-stack:starterUse the distribution name directly as the key (e.g., starter-gpu, starter). The operator will apply these overrides automatically
To update the LLS Distribution image for all starter distributions:
kubectl patch configmap llama-stack-operator-config -n llama-stack-k8s-operator-system --type merge -p '{"data":{"image-overrides":"starter: quay.io/opendatahub/llama-stack:latest"}}'This will cause all LlamaStackDistribution resources using the starter distribution to restart with the new image.
- Kubernetes cluster (v1.20 or later)
- Go version go1.24
- operator-sdk v1.39.2 (v4 layout) or newer
- kubectl configured to access your cluster
- A running inference server:
- For local development, you can use the provided script:
/hack/deploy-quickstart.sh
- For local development, you can use the provided script:
-
Prepare release files with specific versions
make release VERSION=0.2.1 LLAMASTACK_VERSION=0.2.12This command updates distribution configurations and generates release manifests with the specified versions.
-
Custom operator image can be built using your local repository
make image IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag>The default image used is
quay.io/llamastack/llama-stack-k8s-operator:latestwhen not supply argument formake imageTo create a local filelocal.mkwith env variables can overwrite the default values set in theMakefile. -
Building multi-architecture images (ARM64, AMD64, etc.)
The operator supports building for multiple architectures including ARM64. To build and push multi-arch images:
make image-buildx IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag>By default, this builds for
linux/amd64,linux/arm64. You can customize the platforms by setting thePLATFORMSvariable:# Build for specific platforms make image-buildx IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag> PLATFORMS=linux/amd64,linux/arm64 # Add more architectures (e.g., for future support) make image-buildx IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag> PLATFORMS=linux/amd64,linux/arm64,linux/s390x,linux/ppc64leNote:
-
The
image-buildxtarget works with both Docker and Podman. It will automatically detect which tool is being used. -
Native builds in CI: CI workflows use a matrix strategy with native runners for each architecture (AMD64 and ARM64). Each architecture is built on its own runner, avoiding QEMU emulation entirely. Per-architecture images are pushed separately, then combined into a single multi-arch manifest list. This ensures
CGO_ENABLED=1with full OpenSSL FIPS support for all architectures. -
Local cross-compilation: For local development, the Dockerfile uses
--platform=$BUILDPLATFORMto run Go compilation natively on the build host. When cross-compiling (e.g., building ARM64 on an AMD64 host),CGO_ENABLED=0is used with pure Go FIPS (viaGOEXPERIMENT=strictfipsruntime). Native local builds useCGO_ENABLED=1with full OpenSSL FIPS support. -
FIPS adherence: All CI-produced images use
CGO_ENABLED=1with full OpenSSL FIPS support via native builds on architecture-matched runners. -
For Docker: Multi-arch builds require Docker Buildx. Ensure Docker Buildx is set up:
docker buildx create --name x-builder --use -
For Podman: Podman 4.0+ supports
podman buildx(experimental). If buildx is unavailable, the Makefile will automatically fall back to using podman's native manifest-based multi-arch build approach. -
The resulting images are multi-arch manifest lists, which means Kubernetes will automatically select the correct architecture when pulling the image.
CI Build Targets:
The CI workflows use the following Makefile targets for the matrix-based build strategy:
# Build and push a single-arch image (used by each matrix job on its native runner) make image-build-push-single PLATFORM=linux/amd64 IMG=quay.io/<username>/llama-stack-k8s-operator:<tag>-amd64 # Create a multi-arch manifest from per-arch images (used by the final manifest job) make image-create-manifest IMG=quay.io/<username>/llama-stack-k8s-operator:<tag> \ ARCH_IMGS="quay.io/<username>/llama-stack-k8s-operator:<tag>-amd64 quay.io/<username>/llama-stack-k8s-operator:<tag>-arm64" -
-
Building ARM64-only images
To build a single ARM64 image (useful for testing or ARM-native systems):
make image-build-arm IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag> make image-push IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag>This works with both Docker and Podman.
-
Once the image is created, the operator can be deployed directly. For each deployment method a kubeconfig should be exported
export KUBECONFIG=<path to kubeconfig>
Deploying operator locally
-
Deploy the created image in your cluster using following command:
make deploy IMG=quay.io/<username>/llama-stack-k8s-operator:<custom-tag> -
To remove resources created during installation use:
make undeploy
The operator includes end-to-end (E2E) tests to verify the complete functionality of the operator. To run the E2E tests:
- Ensure you have a running Kubernetes cluster
- Run the E2E tests using one of the following commands:
- If you want to deploy the operator and run tests:
make deploy test-e2e - If the operator is already deployed:
make test-e2e
- If you want to deploy the operator and run tests:
The make target will handle prerequisites including deploying ollama server.
Please refer to api documentation