| sidebar-title | Local Installation |
|---|---|
| description | Install and run Dynamo on a local machine or VM with containers or PyPI |
This guide walks through installing and running Dynamo on a local machine or VM with one or more GPUs. By the end, you'll have a working OpenAI-compatible endpoint serving a model.
For production multi-node clusters, see the Kubernetes Deployment Guide. To build from source for development, see Building from Source.
| Requirement | Supported |
|---|---|
| GPU | NVIDIA Ampere, Ada Lovelace, Hopper, Blackwell |
| OS | Ubuntu 22.04, Ubuntu 24.04 |
| Architecture | x86_64, ARM64 (ARM64 requires Ubuntu 24.04) |
| CUDA | 12.9+ or 13.0+ (B300/GB300 require CUDA 13) |
| Python | 3.10, 3.12 |
| Driver | 575.51.03+ (CUDA 12) or 580.00.03+ (CUDA 13) |
TensorRT-LLM does not support Python 3.11.
For the full compatibility matrix including backend framework versions, see the Support Matrix.
Containers have all dependencies pre-installed. No setup required.
# SGLang
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
# TensorRT-LLM
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.0
# vLLM
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0To run frontend and worker in the same container, either:
- Run processes in background with
&(see Run Dynamo section below), or - Open a second terminal and use
docker exec -it <container_id> bash
See Release Artifacts for available versions and backend guides for run instructions: SGLang | TensorRT-LLM | vLLM
# Install uv (recommended Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment
uv venv venv
source venv/bin/activate
uv pip install pipInstall system dependencies and the Dynamo wheel for your chosen backend:
SGLang
sudo apt install python3-dev
uv pip install --prerelease=allow "ai-dynamo[sglang]"For CUDA 13 (B300/GB300), the container is recommended. See SGLang install docs for details.
TensorRT-LLM
sudo apt install python3-dev
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
pip install --pre --extra-index-url https://pypi.nvidia.com "ai-dynamo[trtllm]"TensorRT-LLM requires pip due to a transitive Git URL dependency that
uv doesn't resolve. We recommend using the TensorRT-LLM container for
broader compatibility. See the TRT-LLM backend guide
for details.
vLLM
sudo apt install python3-dev libxcb1
uv pip install --prerelease=allow "ai-dynamo[vllm]"Dynamo components discover each other through a shared backend. Two options are available:
| Backend | When to Use | Setup |
|---|---|---|
| File | Single machine, local development | No setup -- pass --discovery-backend file to all components |
| etcd | Multi-node, production | Requires a running etcd instance (default if no flag is specified) |
This guide uses --discovery-backend file. For etcd setup, see Service Discovery.
Verify the CLI is installed and callable:
python3 -m dynamo.frontend --helpIf you cloned the repository, you can run additional system checks:
python3 deploy/sanity_check.py# Start the OpenAI compatible frontend (default port is 8000)
python3 -m dynamo.frontend --discovery-backend fileTo run in a single terminal (useful in containers), append > logfile.log 2>&1 &
to run processes in background:
python3 -m dynamo.frontend --discovery-backend file > dynamo.frontend.log 2>&1 &In another terminal (or same terminal if using background mode), start a worker for your chosen backend:
SGLang
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend fileTensorRT-LLM
python3 -m dynamo.trtllm --model-path Qwen/Qwen3-0.6B --discovery-backend fileThe warning Cannot connect to ModelExpress server/transport error. Using direct download.
is expected in local deployments and can be safely ignored.
vLLM
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file \
--kv-events-config '{"enable_kv_cache_events": false}'For dependency-free local development, disable KV event publishing (avoids NATS):
- vLLM: Add
--kv-events-config '{"enable_kv_cache_events": false}' - SGLang: No flag needed (KV events disabled by default)
- TensorRT-LLM: No flag needed (KV events disabled by default)
KV events are disabled by default for all backends. Add --kv-events-config explicitly only when you want KV event publishing enabled.
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50}'CUDA/driver version mismatch
Run nvidia-smi to check your driver version. Dynamo requires driver 575.51.03+ for CUDA 12 or 580.00.03+ for CUDA 13. B300/GB300 GPUs require CUDA 13. See the Support Matrix for full requirements.
Model doesn't fit on GPU (OOM)
The default model Qwen/Qwen3-0.6B requires ~2GB of GPU memory. Larger models need more VRAM:
| Model Size | Approximate VRAM |
|---|---|
| 7B | 14-16 GB |
| 13B | 26-28 GB |
| 70B | 140+ GB (multi-GPU) |
Start with a small model and scale up based on your hardware.
Python 3.11 with TensorRT-LLM
TensorRT-LLM does not support Python 3.11. If you see installation failures with TensorRT-LLM, check your Python version with python3 --version. Use Python 3.10 or 3.12 instead.
Container runs but GPU not detected
Ensure you passed --gpus all to docker run. Without this flag, the container won't have access to GPUs:
# Correct
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0
# Wrong -- no GPU access
docker run --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.0- Backend Guides -- Backend-specific configuration and features
- Disaggregated Serving -- Scale prefill and decode independently
- KV Cache Aware Routing -- Smart request routing
- Kubernetes Deployment -- Production multi-node deployments