Skip to content

[BUG]: Binary Incompatibility on ARM64 + A100 (sm_80) #7594

@madhur-fujitsu

Description

@madhur-fujitsu

Describe the Bug

The nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.1 image exhibits two critical issues when deployed on ARM64 systems with NVIDIA A100 GPUs:

  1. Broken Dependency Path: The image fails to load CuPy by default, falling back to CPU-based operations unless manually re-installed/configured.
  2. Missing Kernel Images: Even when CuPy is functional, the engine crashes with cudaErrorNoKernelImageForDevice. This occurs in both the default "Graph" mode and "Eager" mode, indicating that the core vLLM/PyTorch binaries lack sm_80 support for the aarch64 platform.

Steps to Reproduce

  1. Launch the vllm-runtime:1.0.1 container on an ARM64 A100 host.
  2. Run the worker:
    python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file & Observe the CuPy load failure in logs.
  3. Manually install cupy-cuda12x
  4. Run the worker:
    python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file
    & Observe the crash during CUDA Graph capture.
  5. Re-run with --enforce-eager and observe the crash during the linear layer execution in the profile run.

Expected Behavior

The runtime should include pre-compiled CUDA kernels for major NVIDIA architectures (sm_70, sm_80, sm_90) specifically for the aarch64 build of the image.

Actual Behavior

Issue 1: CuPy Initialization Failure

On initial launch, the runtime logs a failure to load CuPy, which is a requirement for Dynamo's nixl_connect GPU acceleration:

[2026-03-24 05:30:31] WARNING __init__.py:58: dynamo.nixl_connect: Failed to load CuPy for GPU acceleration, utilizing numpy to provide CPU based operations.

Observation: Manual installation of cupy-cuda12x resolved this specific warning, but triggered a cascade of NumPy dependency conflicts (NumPy 2.x requirement) that are incompatible with aiconfigurator and scipy versions pinned in the image.

Issue 2: CUDA "No Kernel Image" (Fatal)

The engine fails to execute any GPU kernels. The crash occurs at different stages depending on the mode:

Scenario A: Default Mode (CUDA Graphs Enabled)

Crashes during the warmup/capture phase inside flash_attn.

  • Error: torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
  • Traceback Location: vllm/v1/attention/backends/flash_attn.py, line 634, in forward (return output.fill_(0))

Scenario B: Eager Mode (--enforce-eager)

Crashes during the initial model profile run, even without graph capture.

  • Error: torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device
  • Traceback Location: vllm/model_executor/layers/linear.py, line 604, in forward during torch.nn.functional.linear.

Environment

  • Hardware: ARM64 (aarch64) - 1x NVIDIA A100
  • CUDA: 12.9
  • Host OS: Rocky Linux (Kernel 5.x+)
  • Container Image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.1
  • Software: Python 3.12, vLLM 0.16.0 (V1 Engine)
  • Model: Qwen/Qwen3-0.6B

Additional Context

No response

Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions