-
Notifications
You must be signed in to change notification settings - Fork 961
Description
Describe the Bug
The nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.1 image exhibits two critical issues when deployed on ARM64 systems with NVIDIA A100 GPUs:
- Broken Dependency Path: The image fails to load CuPy by default, falling back to CPU-based operations unless manually re-installed/configured.
- Missing Kernel Images: Even when CuPy is functional, the engine crashes with
cudaErrorNoKernelImageForDevice. This occurs in both the default "Graph" mode and "Eager" mode, indicating that the core vLLM/PyTorch binaries lacksm_80support for theaarch64platform.
Steps to Reproduce
- Launch the
vllm-runtime:1.0.1container on an ARM64 A100 host. - Run the worker:
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file& Observe the CuPy load failure in logs. - Manually install cupy-cuda12x
- Run the worker:
python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file
& Observe the crash during CUDA Graph capture. - Re-run with
--enforce-eagerand observe the crash during thelinearlayer execution in the profile run.
Expected Behavior
The runtime should include pre-compiled CUDA kernels for major NVIDIA architectures (sm_70, sm_80, sm_90) specifically for the aarch64 build of the image.
Actual Behavior
Issue 1: CuPy Initialization Failure
On initial launch, the runtime logs a failure to load CuPy, which is a requirement for Dynamo's nixl_connect GPU acceleration:
[2026-03-24 05:30:31] WARNING __init__.py:58: dynamo.nixl_connect: Failed to load CuPy for GPU acceleration, utilizing numpy to provide CPU based operations.
Observation: Manual installation of cupy-cuda12x resolved this specific warning, but triggered a cascade of NumPy dependency conflicts (NumPy 2.x requirement) that are incompatible with aiconfigurator and scipy versions pinned in the image.
Issue 2: CUDA "No Kernel Image" (Fatal)
The engine fails to execute any GPU kernels. The crash occurs at different stages depending on the mode:
Scenario A: Default Mode (CUDA Graphs Enabled)
Crashes during the warmup/capture phase inside flash_attn.
- Error:
torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device - Traceback Location:
vllm/v1/attention/backends/flash_attn.py, line 634, inforward(return output.fill_(0))
Scenario B: Eager Mode (--enforce-eager)
Crashes during the initial model profile run, even without graph capture.
- Error:
torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device - Traceback Location:
vllm/model_executor/layers/linear.py, line 604, inforwardduringtorch.nn.functional.linear.
Environment
- Hardware: ARM64 (aarch64) - 1x NVIDIA A100
- CUDA: 12.9
- Host OS: Rocky Linux (Kernel 5.x+)
- Container Image:
nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.1 - Software: Python 3.12, vLLM 0.16.0 (V1 Engine)
- Model:
Qwen/Qwen3-0.6B
Additional Context
No response
Screenshots
No response