Skip to content

[BUG]: CUDA error: no kernel image is available for execution on the device #7569

@george-kuanli-peng

Description

@george-kuanli-peng

Describe the Bug

When I deployed a Qwen3 model using the NGC Dynamo vLLM 1.0.1 image with LMCache, the decode worker could start and register the model, but it crashed with CUDA error: no kernel image is available for execution on the device later when I made a chat request.

Steps to Reproduce

  1. Create a DynamoGraphDeployment that specifies disaggregated serving:
    1. using the nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.1 image
    2. using LMCache as KV cache offloading backend
  2. Wait until both workers finishing the startup process.
  3. Make a chat request with OpenAI API.

Expected Behavior

The system should return a successful chat respone.

Actual Behavior

The vLLM prefill worker started up succesfully.

2026-03-20T09:25:00.641663Z  INFO _core: Registered base model 'Qwen/Qwen3-235B-A22B-GPTQ-Int4' MDC
2026-03-20T09:25:00.643076Z  INFO dynamo_runtime::pipeline::network::ingress::shared_tcp_endpoint: Registered endpoint 'clear_kv_blocks' with shared TCP server on 10.233.112.57:44633
2026-03-20T09:25:00.643930Z  INFO dynamo_runtime::discovery::kube: Registering endpoint: namespace=dynamo-system-g4--disagg-kvbm-mem--qwen3-235b-6237ecc1, component=prefill, endpoint=clear_kv_blocks, instance_id=80354c2356f60
2026-03-20T09:25:00.644057Z  INFO dynamo_runtime::pipeline::network::ingress::shared_tcp_endpoint: Registered endpoint 'generate' with shared TCP server on 10.233.112.57:44633
2026-03-20T09:25:00.655272Z  INFO dynamo_runtime::discovery::kube: Registering endpoint: namespace=dynamo-system-g4--disagg-kvbm-mem--qwen3-235b-6237ecc1, component=prefill, endpoint=generate, instance_id=80354c2356f60
2026-03-20T09:25:08.094897Z  INFO dynamo_runtime::discovery::metadata: Snapshot (seq=4): 2 instances, added=["80354c2356f60"], removed=[], updated=[]

However, during processing the request, the vLLM prefill worker crashed with the following logs:

(EngineCore_DP0 pid=1168) [2026-03-20 09:25:24,346] LMCache INFO: Reqid: 32ead612-2bc6-43f8-b0d8-cbdc69d962ac-ab28c141, Total tokens 462, Inference Engine computed tokens: 0, LMCache hit tokens: 0, need to load: 0 (vllm_v1_adapter.py:1304:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=1168) INFO 03-20 09:26:24 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=1430) [2026-03-20 09:26:31,055] LMCache INFO: list_depth: 1, tensor_dim: 5 (utils.py:146:lmcache.v1.gpu_connector.utils)
(Worker_TP0 pid=1430) [2026-03-20 09:26:31,056] LMCache INFO: GPU KV Cache Dimensions: [94][2, 17838, 16, 2, 128] (utils.py:157:lmcache.v1.gpu_connector.utils)
(Worker_TP0 pid=1430) [2026-03-20 09:26:31,056] LMCache INFO: GPU KV Format: List[num_layers] of [2, num_blocks, block_size, num_heads, head_size] (utils.py:73:lmcache.v1.gpu_connector.utils)
(Worker_TP0 pid=1430) [2026-03-20 09:26:31,056] LMCache INFO: Currently used by:
(Worker_TP0 pid=1430)   - vLLM non-MLA flash attention (utils.py:78:lmcache.v1.gpu_connector.utils)
(Worker_TP1 pid=1435) [2026-03-20 09:26:31,055] LMCache INFO: list_depth: 1, tensor_dim: 5 (utils.py:146:lmcache.v1.gpu_connector.utils)
(Worker_TP1 pid=1435) [2026-03-20 09:26:31,056] LMCache INFO: GPU KV Cache Dimensions: [94][2, 17838, 16, 2, 128] (utils.py:157:lmcache.v1.gpu_connector.utils)
(Worker_TP1 pid=1435) [2026-03-20 09:26:31,057] LMCache INFO: GPU KV Format: List[num_layers] of [2, num_blocks, block_size, num_heads, head_size] (utils.py:73:lmcache.v1.gpu_connector.utils)
(Worker_TP1 pid=1435) [2026-03-20 09:26:31,057] LMCache INFO: Currently used by:
(Worker_TP1 pid=1435)   - vLLM non-MLA flash attention (utils.py:78:lmcache.v1.gpu_connector.utils)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] WorkerProc hit an exception.
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Traceback (most recent call last):
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 858, in worker_busy_loop
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     output = func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 361, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     return self.worker.execute_model(scheduler_output)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 652, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     output = self.model_runner.execute_model(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3523, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     with (
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     next(self.gen)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 101, in _get_kv_connector_output
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     kv_connector.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py", line 242, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     c.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 187, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     self._lmcache_engine.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/integration/vllm/vllm_v1_adapter.py", line 1152, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     self.lmcache_engine.store(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/cache_engine.py", line 500, in store
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     self.gpu_connector.batched_from_gpu(memory_objs, starts, ends, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/gpu_connector/gpu_connectors.py", line 377, in batched_from_gpu
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     self.from_gpu(memory_obj, start, end, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/gpu_connector/gpu_connectors.py", line 346, in from_gpu
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     lmc_ops.multi_layer_kv_transfer(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] RuntimeError: CUDA error: no kernel image is available for execution on the device
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7e4a8a37cb80 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #1: <unknown function> + 0x11fb7 (0x7e4a8a74bfb7 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #2: void multi_layer_kv_transfer_templated<long>(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x6d8 (0x7e3303298c27 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #3: multi_layer_kv_transfer(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x7d (0x7e330328b1e4 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #4: <unknown function> + 0x9802b (0x7e33032c102b in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #5: <unknown function> + 0x8e2f6 (0x7e33032b72f6 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #6: VLLM::Worker_TP0() [0x581fcf]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #7: _PyObject_MakeTpCall + 0x75 (0x548f35 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #8: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #9: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #10: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #11: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #12: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #13: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #14: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #15: VLLM::Worker_TP0() [0x5551f6]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #16: VLLM::Worker_TP0() [0x5d430c]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #17: _PyEval_EvalFrameDefault + 0x212e (0x5d898e in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #18: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #19: PyObject_Vectorcall + 0x35 (0x549935 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #20: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #21: VLLM::Worker_TP0() [0x54ca6d]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #22: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #23: PyEval_EvalCode + 0x15b (0x5d582b in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #24: PyRun_StringFlags + 0xd3 (0x6087b3 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #25: PyRun_SimpleStringFlags + 0x3e (0x6b392e in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #26: Py_RunMain + 0x481 (0x6bc5f1 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #27: Py_BytesMain + 0x2d (0x6bc00d in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #28: <unknown function> + 0x2a1ca (0x7e4a8b04a1ca in /lib/x86_64-linux-gnu/libc.so.6)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #29: __libc_start_main + 0x8b (0x7e4a8b04a28b in /lib/x86_64-linux-gnu/libc.so.6)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #30: _start + 0x25 (0x657445 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Traceback (most recent call last):
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 858, in worker_busy_loop
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     output = func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 361, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     return self.worker.execute_model(scheduler_output)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 652, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     output = self.model_runner.execute_model(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3523, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     with (
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     next(self.gen)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 101, in _get_kv_connector_output
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     kv_connector.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py", line 242, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     c.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 187, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     self._lmcache_engine.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/integration/vllm/vllm_v1_adapter.py", line 1152, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     self.lmcache_engine.store(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/cache_engine.py", line 500, in store
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     self.gpu_connector.batched_from_gpu(memory_objs, starts, ends, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/gpu_connector/gpu_connectors.py", line 377, in batched_from_gpu
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     self.from_gpu(memory_obj, start, end, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]   File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/gpu_connector/gpu_connectors.py", line 346, in from_gpu
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]     lmc_ops.multi_layer_kv_transfer(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] RuntimeError: CUDA error: no kernel image is available for execution on the device
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7e4a8a37cb80 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #1: <unknown function> + 0x11fb7 (0x7e4a8a74bfb7 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #2: void multi_layer_kv_transfer_templated<long>(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x6d8 (0x7e3303298c27 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #3: multi_layer_kv_transfer(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x7d (0x7e330328b1e4 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #4: <unknown function> + 0x9802b (0x7e33032c102b in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #5: <unknown function> + 0x8e2f6 (0x7e33032b72f6 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #6: VLLM::Worker_TP0() [0x581fcf]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #7: _PyObject_MakeTpCall + 0x75 (0x548f35 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #8: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #9: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #10: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #11: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #12: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #13: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #14: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #15: VLLM::Worker_TP0() [0x5551f6]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #16: VLLM::Worker_TP0() [0x5d430c]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #17: _PyEval_EvalFrameDefault + 0x212e (0x5d898e in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #18: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #19: PyObject_Vectorcall + 0x35 (0x549935 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #20: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #21: VLLM::Worker_TP0() [0x54ca6d]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #22: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #23: PyEval_EvalCode + 0x15b (0x5d582b in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #24: PyRun_StringFlags + 0xd3 (0x6087b3 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #25: PyRun_SimpleStringFlags + 0x3e (0x6b392e in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #26: Py_RunMain + 0x481 (0x6bc5f1 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #27: Py_BytesMain + 0x2d (0x6bc00d in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #28: <unknown function> + 0x2a1ca (0x7e4a8b04a1ca in /lib/x86_64-linux-gnu/libc.so.6)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #29: __libc_start_main + 0x8b (0x7e4a8b04a28b in /lib/x86_64-linux-gnu/libc.so.6)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #30: _start + 0x25 (0x657445 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) [2026-03-20 09:26:31,079] LMCache WARNING: MemoryObj at 0 is being garbage collected with ref_count=1, pin_count=0. This indicates ref_count_down()/unpin() was not called properly. (memory_management.py:470:lmcache.v1.memory_management)
(Worker_TP1 pid=1435) [2026-03-20 09:26:31,080] LMCache WARNING: MemoryObj at 0 is being garbage collected with ref_count=1, pin_count=0. This indicates ref_count_down()/unpin() was not called properly. (memory_management.py:470:lmcache.v1.memory_management)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.16.0) with config: model='Qwen/Qwen3-235B-A22B-GPTQ-Int4', speculative_config=None, tokenizer='Qwen/Qwen3-235B-A22B-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-235B-A22B-GPTQ-Int4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []},
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=32ead612-2bc6-43f8-b0d8-cbdc69d962ac-ab28c141,prompt_token_ids_len=462,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643, 151645], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=1, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args={'kv_transfer_params': {'do_remote_decode': True, 'do_remote_prefill': False, 'remote_engine_id': None, 'remote_block_ids': None, 'remote_host': None, 'remote_port': None}}),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={32ead612-2bc6-43f8-b0d8-cbdc69d962ac-ab28c141: 462}, total_num_scheduled_tokens=462, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[29], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=PdConnectorMetadata(metadata=[LMCacheConnectorMetadata(requests=[ReqMeta(req_id='32ead612-2bc6-43f8-b0d8-cbdc69d962ac-ab28c141', token_ids=[151644, 872, 198, 5501, 62079, 279, 2701, 1467, 510, 17, 13, 16, 13, 34807, 198, 2121, 6839, 304, 19037, 220, 16, 11, 2925, 1520, 374, 264, 34687, 4938, 4688, 6188, 311, 48706, 279, 4058, 4741, 55427, 553, 2036, 41924, 49445, 1099, 5383, 5819, 504, 8741, 34447, 13, 3377, 745, 11, 2661, 458, 1946, 8500, 81250, 239, 222, 284, 320, 147460, 16, 28675, 11, 147460, 146505, 8, 323, 8177, 5302, 472, 7, 147525, 8, 46363, 431, 146505, 79029, 148372, 518, 6193, 81250, 239, 225, 11, 279, 4688, 11364, 1817, 2309, 81250, 239, 227, 304, 1378, 15629, 34430, 25, 56370, 323, 36508, 13, 5512, 11, 438, 11682, 304, 11113, 220, 17, 13, 17, 11, 582, 8649, 323, 24611, 20525, 81250, 238, 123, 12, 50770, 311, 6315, 37110, 17179, 1099, 39088, 22879, 4566, 72355, 13, 3719, 38642, 11, 304, 11113, 220, 17, 13, 18, 11, 1493, 30403, 70547, 525, 42011, 1463, 7757, 553, 279, 1482, 8177, 1584, 323, 37191, 4566, 264, 29144, 55712, 13, 17375, 11, 582, 4263, 279, 17590, 448, 7299, 12, 17940, 77235, 304, 11113, 220, 17, 13, 19, 323, 279, 1849, 11591, 2884, 304, 11113, 220, 17, 13, 20, 624, 17, 13, 17, 13, 71794, 19470, 831, 4566, 6531, 291, 81250, 238, 123, 12, 50770, 198, 785, 1156, 10262, 14043, 2205, 37597, 311, 1099, 4938, 10695, 11, 15860, 45958, 25111, 323, 48224, 70547, 4566, 72349, 72355, 624, 37434, 66161, 5976, 81250, 238, 123, 12, 1520, 6507, 11136, 14476, 5961, 389, 45958, 16275, 11, 5297, 1186, 1158, 3950, 12230, 62552, 4709, 1717, 42638, 11, 3545, 60753, 84784], slot_mapping=Tensor(shape=torch.Size([256]), device=cpu,dtype=torch.int64), is_last_prefill=true, save_spec=SaveSpec(skip_leading_tokens=0, can_save=true), load_spec=null, disagg_spec=null, request_configs=null)]), NixlConnectorMetadata(reqs_to_recv={}, reqs_to_save={}, reqs_to_send={}, reqs_in_batch=['32ead612-2bc6-43f8-b0d8-cbdc69d962ac-ab28c141'], reqs_not_processed=[])], extra_async_saves=null), ec_connector_metadata=null)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0016258339406850508, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=462, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=462, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] Traceback (most recent call last):
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 999, in run_engine_core
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1026, in run_busy_loop
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]     self._process_engine_step()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1060, in _process_engine_step
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 514, in step_with_batch_queue
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]     model_output = future.result()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 81, in result
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]     return super().result()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]            ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]     return self.__get_result()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]     raise self._exception
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 85, in wait_for_response
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]     response = self.aggregate(get_response())
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]   File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in get_response
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]     raise RuntimeError(
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] RuntimeError: Worker failed with error 'CUDA error: no kernel image is available for execution on the device
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x760c5dd13b80 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #1: <unknown function> + 0x11fb7 (0x760cd8366fb7 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #2: void multi_layer_kv_transfer_templated<long>(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x6d8 (0x75f505a8cc27 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #3: multi_layer_kv_transfer(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x7d (0x75f505a7f1e4 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #4: <unknown function> + 0x9802b (0x75f505ab502b in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #5: <unknown function> + 0x8e2f6 (0x75f505aab2f6 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #6: VLLM::Worker_TP1() [0x581fcf]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #7: _PyObject_MakeTpCall + 0x75 (0x548f35 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #8: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #9: VLLM::Worker_TP1() [0x54cb34]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #10: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #11: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #12: VLLM::Worker_TP1() [0x54cb34]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #13: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #14: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #15: VLLM::Worker_TP1() [0x5551f6]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #16: VLLM::Worker_TP1() [0x5d430c]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #17: _PyEval_EvalFrameDefault + 0x212e (0x5d898e in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #18: VLLM::Worker_TP1() [0x54cb34]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #19: PyObject_Vectorcall + 0x35 (0x549935 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #20: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #21: VLLM::Worker_TP1() [0x54ca6d]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #22: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #23: PyEval_EvalCode + 0x15b (0x5d582b in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #24: PyRun_StringFlags + 0xd3 (0x6087b3 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #25: PyRun_SimpleStringFlags + 0x3e (0x6b392e in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #26: Py_RunMain + 0x481 (0x6bc5f1 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #27: Py_BytesMain + 0x2d (0x6bc00d in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #28: <unknown function> + 0x2a1ca (0x760cd8f441ca in /lib/x86_64-linux-gnu/libc.so.6)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #29: __libc_start_main + 0x8b (0x760cd8f4428b in /lib/x86_64-linux-gnu/libc.so.6)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #30: _start + 0x25 (0x657445 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] ', please check the stack trace above for the root cause
2026-03-20T09:26:31.089928Z ERROR async_llm.output_handler: AsyncLLM output_handler failed.

(Worker_TP0 pid=1430) INFO 03-20 09:26:31 [multiproc_executor.py:732] Parent process exited, terminating worker
(Worker_TP1 pid=1435) INFO 03-20 09:26:31 [multiproc_executor.py:732] Parent process exited, terminating worker
Traceback (most recent call last):
  File "/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/handlers.py", line 1527, in generate
(Worker_TP0 pid=1430) INFO 03-20 09:26:31 [multiproc_executor.py:785] WorkerProc shutting down.
(Worker_TP1 pid=1435) INFO 03-20 09:26:31 [multiproc_executor.py:785] WorkerProc shutting down.
    async for chunk in self._generate_token_mode(request, context, request_id):
  File "/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/handlers.py", line 1606, in _generate_token_mode
    async for res in gen:
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 589, in generate
    out = q.get_nowait() or await q.get()
                            ^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 85, in get
    raise output
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 666, in output_handler
    outputs = await engine_core.get_output_async()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 910, in get_output_async
    raise self._format_exception(outputs) from None
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
2026-03-20T09:26:32.392842Z ERROR engine_monitor._check_engine_health: Traceback: Traceback (most recent call last):
  File "/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/engine_monitor.py", line 92, in _check_engine_health
    await self.engine_client.check_health()
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 911, in check_health
    raise self.dead_error
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

2026-03-20T09:26:32.392981Z ERROR engine_monitor._check_engine_health: vLLM AsyncLLM health check failed: EngineCore encountered an issue. See stack trace (above) for the root cause.
2026-03-20T09:26:32.393048Z  WARN engine_monitor._check_engine_health: Initiating Dynamo Runtime shutdown.
2026-03-20T09:26:37.416992Z  INFO dynamo_runtime::runtime: Runtime shutdown initiated
2026-03-20T09:26:37.417171Z  INFO dynamo_runtime::runtime: Phase 1: Cancelling endpoint shutdown token
2026-03-20T09:26:37.417320Z  INFO dynamo_runtime::runtime: Phase 2: Waiting for graceful endpoints to complete
2026-03-20T09:26:37.417331Z  INFO dynamo_runtime::runtime: Active graceful endpoints: 3
2026-03-20T09:26:37.417665Z  INFO dynamo_runtime::pipeline::network::ingress::shared_tcp_endpoint: Unregistered TCP endpoint handler endpoint_name=worker_kv_indexer_query_dp0 endpoint_path=80354c2356f60/worker_kv_indexer_query_dp0
2026-03-20T09:26:37.418527Z  INFO dynamo_runtime::pipeline::network::ingress::shared_tcp_endpoint: Unregistered TCP endpoint handler endpoint_name=clear_kv_blocks endpoint_path=80354c2356f60/clear_kv_blocks

Environment

  1. Dynamo platform deployed using the v1.0.1 helm chart.
  2. Image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.1
    • Cuda library: 12.9
  3. RTX PRO 6000 Blackwell Server GPU
  4. NVIDIA driver: 590.48.01 (pre-installed on host)
  5. Cuda driver: 13.1
  6. GPU Operator: v25.10.1

Additional Context

  • The same error occurs if I used the nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0 image instead.

Screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions