[BUG]: Binary Incompatibility on ARM64 + A100 (sm_80)

### Describe the Bug


The `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.1` image exhibits two critical issues when deployed on ARM64 systems with NVIDIA A100 GPUs:
1. **Broken Dependency Path:** The image fails to load CuPy by default, falling back to CPU-based operations unless manually re-installed/configured.
2. **Missing Kernel Images:** Even when CuPy is functional, the engine crashes with `cudaErrorNoKernelImageForDevice`. This occurs in both the default "Graph" mode and "Eager" mode, indicating that the core vLLM/PyTorch binaries lack `sm_80` support for the `aarch64` platform.





### Steps to Reproduce

1. Launch the `vllm-runtime:1.0.1` container on an ARM64 A100 host.
2.  Run the worker: 
   `python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file` & Observe the CuPy load failure in logs.
3. Manually install cupy-cuda12x
4. Run the worker: 
   `python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --discovery-backend file`
 & Observe the crash during CUDA Graph capture.
5. Re-run with `--enforce-eager` and observe the crash during the `linear` layer execution in the profile run.


### Expected Behavior

The runtime should include pre-compiled CUDA kernels for major NVIDIA architectures (`sm_70`, `sm_80`, `sm_90`) specifically for the `aarch64` build of the image.

### Actual Behavior


### **Issue 1: CuPy Initialization Failure**
On initial launch, the runtime logs a failure to load CuPy, which is a requirement for Dynamo's `nixl_connect` GPU acceleration:
```text
[2026-03-24 05:30:31] WARNING __init__.py:58: dynamo.nixl_connect: Failed to load CuPy for GPU acceleration, utilizing numpy to provide CPU based operations.
```
**Observation:** Manual installation of `cupy-cuda12x` resolved this specific warning, but triggered a cascade of NumPy dependency conflicts (NumPy 2.x requirement) that are incompatible with `aiconfigurator` and `scipy` versions pinned in the image.

### **Issue 2: CUDA "No Kernel Image" (Fatal)**
The engine fails to execute any GPU kernels. The crash occurs at different stages depending on the mode:

#### **Scenario A: Default Mode (CUDA Graphs Enabled)**
Crashes during the warmup/capture phase inside `flash_attn`.
* **Error:** `torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device`
* **Traceback Location:** `vllm/v1/attention/backends/flash_attn.py`, line 634, in `forward` (`return output.fill_(0)`)

#### **Scenario B: Eager Mode (`--enforce-eager`)**
Crashes during the initial model profile run, even without graph capture.
* **Error:** `torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device`
* **Traceback Location:** `vllm/model_executor/layers/linear.py`, line 604, in `forward` during `torch.nn.functional.linear`.


### Environment


* **Hardware:** ARM64 (aarch64) - 1x NVIDIA A100
* **CUDA:** 12.9
* **Host OS:** Rocky Linux (Kernel 5.x+)
* **Container Image:** `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.1`
* **Software:** Python 3.12, vLLM 0.16.0 (V1 Engine)
* **Model:** `Qwen/Qwen3-0.6B`

### Additional Context

_No response_

### Screenshots

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Binary Incompatibility on ARM64 + A100 (sm_80) #7594

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Issue 1: CuPy Initialization Failure

Issue 2: CUDA "No Kernel Image" (Fatal)

Scenario A: Default Mode (CUDA Graphs Enabled)

Scenario B: Eager Mode (`--enforce-eager`)

Environment

Additional Context

Screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: Binary Incompatibility on ARM64 + A100 (sm_80) #7594

Description

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Issue 1: CuPy Initialization Failure

Issue 2: CUDA "No Kernel Image" (Fatal)

Scenario A: Default Mode (CUDA Graphs Enabled)

Scenario B: Eager Mode (--enforce-eager)

Environment

Additional Context

Screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Scenario B: Eager Mode (`--enforce-eager`)