-
Notifications
You must be signed in to change notification settings - Fork 961
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the Bug
When I deployed a Qwen3 model using the NGC Dynamo vLLM 1.0.1 image with LMCache, the decode worker could start and register the model, but it crashed with CUDA error: no kernel image is available for execution on the device later when I made a chat request.
Steps to Reproduce
- Create a
DynamoGraphDeploymentthat specifies disaggregated serving:- using the
nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.1image - using LMCache as KV cache offloading backend
- using the
- Wait until both workers finishing the startup process.
- Make a chat request with OpenAI API.
Expected Behavior
The system should return a successful chat respone.
Actual Behavior
The vLLM prefill worker started up succesfully.
2026-03-20T09:25:00.641663Z INFO _core: Registered base model 'Qwen/Qwen3-235B-A22B-GPTQ-Int4' MDC
2026-03-20T09:25:00.643076Z INFO dynamo_runtime::pipeline::network::ingress::shared_tcp_endpoint: Registered endpoint 'clear_kv_blocks' with shared TCP server on 10.233.112.57:44633
2026-03-20T09:25:00.643930Z INFO dynamo_runtime::discovery::kube: Registering endpoint: namespace=dynamo-system-g4--disagg-kvbm-mem--qwen3-235b-6237ecc1, component=prefill, endpoint=clear_kv_blocks, instance_id=80354c2356f60
2026-03-20T09:25:00.644057Z INFO dynamo_runtime::pipeline::network::ingress::shared_tcp_endpoint: Registered endpoint 'generate' with shared TCP server on 10.233.112.57:44633
2026-03-20T09:25:00.655272Z INFO dynamo_runtime::discovery::kube: Registering endpoint: namespace=dynamo-system-g4--disagg-kvbm-mem--qwen3-235b-6237ecc1, component=prefill, endpoint=generate, instance_id=80354c2356f60
2026-03-20T09:25:08.094897Z INFO dynamo_runtime::discovery::metadata: Snapshot (seq=4): 2 instances, added=["80354c2356f60"], removed=[], updated=[]However, during processing the request, the vLLM prefill worker crashed with the following logs:
(EngineCore_DP0 pid=1168) [2026-03-20 09:25:24,346] LMCache INFO: Reqid: 32ead612-2bc6-43f8-b0d8-cbdc69d962ac-ab28c141, Total tokens 462, Inference Engine computed tokens: 0, LMCache hit tokens: 0, need to load: 0 (vllm_v1_adapter.py:1304:lmcache.integration.vllm.vllm_v1_adapter)
(EngineCore_DP0 pid=1168) INFO 03-20 09:26:24 [shm_broadcast.py:542] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=1430) [2026-03-20 09:26:31,055] LMCache INFO: list_depth: 1, tensor_dim: 5 (utils.py:146:lmcache.v1.gpu_connector.utils)
(Worker_TP0 pid=1430) [2026-03-20 09:26:31,056] LMCache INFO: GPU KV Cache Dimensions: [94][2, 17838, 16, 2, 128] (utils.py:157:lmcache.v1.gpu_connector.utils)
(Worker_TP0 pid=1430) [2026-03-20 09:26:31,056] LMCache INFO: GPU KV Format: List[num_layers] of [2, num_blocks, block_size, num_heads, head_size] (utils.py:73:lmcache.v1.gpu_connector.utils)
(Worker_TP0 pid=1430) [2026-03-20 09:26:31,056] LMCache INFO: Currently used by:
(Worker_TP0 pid=1430) - vLLM non-MLA flash attention (utils.py:78:lmcache.v1.gpu_connector.utils)
(Worker_TP1 pid=1435) [2026-03-20 09:26:31,055] LMCache INFO: list_depth: 1, tensor_dim: 5 (utils.py:146:lmcache.v1.gpu_connector.utils)
(Worker_TP1 pid=1435) [2026-03-20 09:26:31,056] LMCache INFO: GPU KV Cache Dimensions: [94][2, 17838, 16, 2, 128] (utils.py:157:lmcache.v1.gpu_connector.utils)
(Worker_TP1 pid=1435) [2026-03-20 09:26:31,057] LMCache INFO: GPU KV Format: List[num_layers] of [2, num_blocks, block_size, num_heads, head_size] (utils.py:73:lmcache.v1.gpu_connector.utils)
(Worker_TP1 pid=1435) [2026-03-20 09:26:31,057] LMCache INFO: Currently used by:
(Worker_TP1 pid=1435) - vLLM non-MLA flash attention (utils.py:78:lmcache.v1.gpu_connector.utils)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] WorkerProc hit an exception.
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Traceback (most recent call last):
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 858, in worker_busy_loop
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] output = func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 361, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] return self.worker.execute_model(scheduler_output)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 652, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] output = self.model_runner.execute_model(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3523, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] with (
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] next(self.gen)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 101, in _get_kv_connector_output
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] kv_connector.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py", line 242, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] c.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 187, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] self._lmcache_engine.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/integration/vllm/vllm_v1_adapter.py", line 1152, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] self.lmcache_engine.store(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/cache_engine.py", line 500, in store
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] self.gpu_connector.batched_from_gpu(memory_objs, starts, ends, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/gpu_connector/gpu_connectors.py", line 377, in batched_from_gpu
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] self.from_gpu(memory_obj, start, end, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/gpu_connector/gpu_connectors.py", line 346, in from_gpu
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] lmc_ops.multi_layer_kv_transfer(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] RuntimeError: CUDA error: no kernel image is available for execution on the device
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7e4a8a37cb80 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #1: <unknown function> + 0x11fb7 (0x7e4a8a74bfb7 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #2: void multi_layer_kv_transfer_templated<long>(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x6d8 (0x7e3303298c27 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #3: multi_layer_kv_transfer(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x7d (0x7e330328b1e4 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #4: <unknown function> + 0x9802b (0x7e33032c102b in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #5: <unknown function> + 0x8e2f6 (0x7e33032b72f6 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #6: VLLM::Worker_TP0() [0x581fcf]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #7: _PyObject_MakeTpCall + 0x75 (0x548f35 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #8: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #9: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #10: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #11: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #12: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #13: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #14: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #15: VLLM::Worker_TP0() [0x5551f6]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #16: VLLM::Worker_TP0() [0x5d430c]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #17: _PyEval_EvalFrameDefault + 0x212e (0x5d898e in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #18: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #19: PyObject_Vectorcall + 0x35 (0x549935 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #20: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #21: VLLM::Worker_TP0() [0x54ca6d]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #22: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #23: PyEval_EvalCode + 0x15b (0x5d582b in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #24: PyRun_StringFlags + 0xd3 (0x6087b3 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #25: PyRun_SimpleStringFlags + 0x3e (0x6b392e in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #26: Py_RunMain + 0x481 (0x6bc5f1 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #27: Py_BytesMain + 0x2d (0x6bc00d in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #28: <unknown function> + 0x2a1ca (0x7e4a8b04a1ca in /lib/x86_64-linux-gnu/libc.so.6)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #29: __libc_start_main + 0x8b (0x7e4a8b04a28b in /lib/x86_64-linux-gnu/libc.so.6)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #30: _start + 0x25 (0x657445 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Traceback (most recent call last):
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 858, in worker_busy_loop
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] output = func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 361, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] return self.worker.execute_model(scheduler_output)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 652, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] output = self.model_runner.execute_model(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3523, in execute_model
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] with (
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] next(self.gen)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/worker/kv_connector_model_runner_mixin.py", line 101, in _get_kv_connector_output
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] kv_connector.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py", line 242, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] c.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py", line 187, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] self._lmcache_engine.wait_for_save()
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/integration/vllm/vllm_v1_adapter.py", line 1152, in wait_for_save
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] self.lmcache_engine.store(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] return func(*args, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/cache_engine.py", line 500, in store
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] self.gpu_connector.batched_from_gpu(memory_objs, starts, ends, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/gpu_connector/gpu_connectors.py", line 377, in batched_from_gpu
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] self.from_gpu(memory_obj, start, end, **kwargs)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] File "/opt/dynamo/venv/lib/python3.12/site-packages/lmcache/v1/gpu_connector/gpu_connectors.py", line 346, in from_gpu
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] lmc_ops.multi_layer_kv_transfer(
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] RuntimeError: CUDA error: no kernel image is available for execution on the device
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7e4a8a37cb80 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #1: <unknown function> + 0x11fb7 (0x7e4a8a74bfb7 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #2: void multi_layer_kv_transfer_templated<long>(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x6d8 (0x7e3303298c27 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #3: multi_layer_kv_transfer(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x7d (0x7e330328b1e4 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #4: <unknown function> + 0x9802b (0x7e33032c102b in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #5: <unknown function> + 0x8e2f6 (0x7e33032b72f6 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #6: VLLM::Worker_TP0() [0x581fcf]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #7: _PyObject_MakeTpCall + 0x75 (0x548f35 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #8: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #9: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #10: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #11: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #12: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #13: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #14: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #15: VLLM::Worker_TP0() [0x5551f6]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #16: VLLM::Worker_TP0() [0x5d430c]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #17: _PyEval_EvalFrameDefault + 0x212e (0x5d898e in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #18: VLLM::Worker_TP0() [0x54cb34]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #19: PyObject_Vectorcall + 0x35 (0x549935 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #20: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #21: VLLM::Worker_TP0() [0x54ca6d]
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #22: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #23: PyEval_EvalCode + 0x15b (0x5d582b in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #24: PyRun_StringFlags + 0xd3 (0x6087b3 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #25: PyRun_SimpleStringFlags + 0x3e (0x6b392e in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #26: Py_RunMain + 0x481 (0x6bc5f1 in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #27: Py_BytesMain + 0x2d (0x6bc00d in VLLM::Worker_TP0)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #28: <unknown function> + 0x2a1ca (0x7e4a8b04a1ca in /lib/x86_64-linux-gnu/libc.so.6)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #29: __libc_start_main + 0x8b (0x7e4a8b04a28b in /lib/x86_64-linux-gnu/libc.so.6)
(Worker_TP0 pid=1430) ERROR 03-20 09:26:31 [multiproc_executor.py:863] frame #30: _start + 0x25 (0x657445 in VLLM::Worker_TP0)(Worker_TP0 pid=1430) [2026-03-20 09:26:31,079] LMCache WARNING: MemoryObj at 0 is being garbage collected with ref_count=1, pin_count=0. This indicates ref_count_down()/unpin() was not called properly. (memory_management.py:470:lmcache.v1.memory_management)
(Worker_TP1 pid=1435) [2026-03-20 09:26:31,080] LMCache WARNING: MemoryObj at 0 is being garbage collected with ref_count=1, pin_count=0. This indicates ref_count_down()/unpin() was not called properly. (memory_management.py:470:lmcache.v1.memory_management)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.16.0) with config: model='Qwen/Qwen3-235B-A22B-GPTQ-Int4', speculative_config=None, tokenizer='Qwen/Qwen3-235B-A22B-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-235B-A22B-GPTQ-Int4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []},
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=32ead612-2bc6-43f8-b0d8-cbdc69d962ac-ab28c141,prompt_token_ids_len=462,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643, 151645], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=1, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args={'kv_transfer_params': {'do_remote_decode': True, 'do_remote_prefill': False, 'remote_engine_id': None, 'remote_block_ids': None, 'remote_host': None, 'remote_port': None}}),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29],),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={32ead612-2bc6-43f8-b0d8-cbdc69d962ac-ab28c141: 462}, total_num_scheduled_tokens=462, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[29], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=PdConnectorMetadata(metadata=[LMCacheConnectorMetadata(requests=[ReqMeta(req_id='32ead612-2bc6-43f8-b0d8-cbdc69d962ac-ab28c141', token_ids=[151644, 872, 198, 5501, 62079, 279, 2701, 1467, 510, 17, 13, 16, 13, 34807, 198, 2121, 6839, 304, 19037, 220, 16, 11, 2925, 1520, 374, 264, 34687, 4938, 4688, 6188, 311, 48706, 279, 4058, 4741, 55427, 553, 2036, 41924, 49445, 1099, 5383, 5819, 504, 8741, 34447, 13, 3377, 745, 11, 2661, 458, 1946, 8500, 81250, 239, 222, 284, 320, 147460, 16, 28675, 11, 147460, 146505, 8, 323, 8177, 5302, 472, 7, 147525, 8, 46363, 431, 146505, 79029, 148372, 518, 6193, 81250, 239, 225, 11, 279, 4688, 11364, 1817, 2309, 81250, 239, 227, 304, 1378, 15629, 34430, 25, 56370, 323, 36508, 13, 5512, 11, 438, 11682, 304, 11113, 220, 17, 13, 17, 11, 582, 8649, 323, 24611, 20525, 81250, 238, 123, 12, 50770, 311, 6315, 37110, 17179, 1099, 39088, 22879, 4566, 72355, 13, 3719, 38642, 11, 304, 11113, 220, 17, 13, 18, 11, 1493, 30403, 70547, 525, 42011, 1463, 7757, 553, 279, 1482, 8177, 1584, 323, 37191, 4566, 264, 29144, 55712, 13, 17375, 11, 582, 4263, 279, 17590, 448, 7299, 12, 17940, 77235, 304, 11113, 220, 17, 13, 19, 323, 279, 1849, 11591, 2884, 304, 11113, 220, 17, 13, 20, 624, 17, 13, 17, 13, 71794, 19470, 831, 4566, 6531, 291, 81250, 238, 123, 12, 50770, 198, 785, 1156, 10262, 14043, 2205, 37597, 311, 1099, 4938, 10695, 11, 15860, 45958, 25111, 323, 48224, 70547, 4566, 72349, 72355, 624, 37434, 66161, 5976, 81250, 238, 123, 12, 1520, 6507, 11136, 14476, 5961, 389, 45958, 16275, 11, 5297, 1186, 1158, 3950, 12230, 62552, 4709, 1717, 42638, 11, 3545, 60753, 84784], slot_mapping=Tensor(shape=torch.Size([256]), device=cpu,dtype=torch.int64), is_last_prefill=true, save_spec=SaveSpec(skip_leading_tokens=0, can_save=true), load_spec=null, disagg_spec=null, request_configs=null)]), NixlConnectorMetadata(reqs_to_recv={}, reqs_to_save={}, reqs_to_send={}, reqs_in_batch=['32ead612-2bc6-43f8-b0d8-cbdc69d962ac-ab28c141'], reqs_not_processed=[])], extra_async_saves=null), ec_connector_metadata=null)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0016258339406850508, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=462, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=462, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] Traceback (most recent call last):
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 999, in run_engine_core
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] engine_core.run_busy_loop()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1026, in run_busy_loop
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] self._process_engine_step()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1060, in _process_engine_step
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 514, in step_with_batch_queue
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] model_output = future.result()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 81, in result
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] return super().result()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] return self.__get_result()
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] raise self._exception
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 85, in wait_for_response
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] response = self.aggregate(get_response())
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 358, in get_response
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] raise RuntimeError(
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] RuntimeError: Worker failed with error 'CUDA error: no kernel image is available for execution on the device
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] Search for `cudaErrorNoKernelImageForDevice' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x760c5dd13b80 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #1: <unknown function> + 0x11fb7 (0x760cd8366fb7 in /opt/dynamo/venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #2: void multi_layer_kv_transfer_templated<long>(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x6d8 (0x75f505a8cc27 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #3: multi_layer_kv_transfer(at::Tensor&, at::Tensor const&, at::Tensor const&, c10::Device const&, int, TransferDirection, GPUKVFormat, int) + 0x7d (0x75f505a7f1e4 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #4: <unknown function> + 0x9802b (0x75f505ab502b in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #5: <unknown function> + 0x8e2f6 (0x75f505aab2f6 in /opt/dynamo/venv/lib/python3.12/site-packages/lmcache/c_ops.cpython-312-x86_64-linux-gnu.so)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #6: VLLM::Worker_TP1() [0x581fcf]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #7: _PyObject_MakeTpCall + 0x75 (0x548f35 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #8: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #9: VLLM::Worker_TP1() [0x54cb34]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #10: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #11: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #12: VLLM::Worker_TP1() [0x54cb34]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #13: PyObject_Call + 0x115 (0x54b155 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #14: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #15: VLLM::Worker_TP1() [0x5551f6]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #16: VLLM::Worker_TP1() [0x5d430c]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #17: _PyEval_EvalFrameDefault + 0x212e (0x5d898e in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #18: VLLM::Worker_TP1() [0x54cb34]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #19: PyObject_Vectorcall + 0x35 (0x549935 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #20: _PyEval_EvalFrameDefault + 0xadf (0x5d733f in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #21: VLLM::Worker_TP1() [0x54ca6d]
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #22: _PyEval_EvalFrameDefault + 0x4cb0 (0x5db510 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #23: PyEval_EvalCode + 0x15b (0x5d582b in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #24: PyRun_StringFlags + 0xd3 (0x6087b3 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #25: PyRun_SimpleStringFlags + 0x3e (0x6b392e in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #26: Py_RunMain + 0x481 (0x6bc5f1 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #27: Py_BytesMain + 0x2d (0x6bc00d in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #28: <unknown function> + 0x2a1ca (0x760cd8f441ca in /lib/x86_64-linux-gnu/libc.so.6)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #29: __libc_start_main + 0x8b (0x760cd8f4428b in /lib/x86_64-linux-gnu/libc.so.6)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] frame #30: _start + 0x25 (0x657445 in VLLM::Worker_TP1)
(EngineCore_DP0 pid=1168) ERROR 03-20 09:26:31 [core.py:1008] ', please check the stack trace above for the root cause
2026-03-20T09:26:31.089928Z ERROR async_llm.output_handler: AsyncLLM output_handler failed.
(Worker_TP0 pid=1430) INFO 03-20 09:26:31 [multiproc_executor.py:732] Parent process exited, terminating worker
(Worker_TP1 pid=1435) INFO 03-20 09:26:31 [multiproc_executor.py:732] Parent process exited, terminating worker
Traceback (most recent call last):
File "/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/handlers.py", line 1527, in generate
(Worker_TP0 pid=1430) INFO 03-20 09:26:31 [multiproc_executor.py:785] WorkerProc shutting down.
(Worker_TP1 pid=1435) INFO 03-20 09:26:31 [multiproc_executor.py:785] WorkerProc shutting down.
async for chunk in self._generate_token_mode(request, context, request_id):
File "/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/handlers.py", line 1606, in _generate_token_mode
async for res in gen:
File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 589, in generate
out = q.get_nowait() or await q.get()
^^^^^^^^^^^^^
File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 85, in get
raise output
File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 666, in output_handler
outputs = await engine_core.get_output_async()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 910, in get_output_async
raise self._format_exception(outputs) from None
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
2026-03-20T09:26:32.392842Z ERROR engine_monitor._check_engine_health: Traceback: Traceback (most recent call last):
File "/opt/dynamo/venv/lib/python3.12/site-packages/dynamo/vllm/engine_monitor.py", line 92, in _check_engine_health
await self.engine_client.check_health()
File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 911, in check_health
raise self.dead_error
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
2026-03-20T09:26:32.392981Z ERROR engine_monitor._check_engine_health: vLLM AsyncLLM health check failed: EngineCore encountered an issue. See stack trace (above) for the root cause.
2026-03-20T09:26:32.393048Z WARN engine_monitor._check_engine_health: Initiating Dynamo Runtime shutdown.
2026-03-20T09:26:37.416992Z INFO dynamo_runtime::runtime: Runtime shutdown initiated
2026-03-20T09:26:37.417171Z INFO dynamo_runtime::runtime: Phase 1: Cancelling endpoint shutdown token
2026-03-20T09:26:37.417320Z INFO dynamo_runtime::runtime: Phase 2: Waiting for graceful endpoints to complete
2026-03-20T09:26:37.417331Z INFO dynamo_runtime::runtime: Active graceful endpoints: 3
2026-03-20T09:26:37.417665Z INFO dynamo_runtime::pipeline::network::ingress::shared_tcp_endpoint: Unregistered TCP endpoint handler endpoint_name=worker_kv_indexer_query_dp0 endpoint_path=80354c2356f60/worker_kv_indexer_query_dp0
2026-03-20T09:26:37.418527Z INFO dynamo_runtime::pipeline::network::ingress::shared_tcp_endpoint: Unregistered TCP endpoint handler endpoint_name=clear_kv_blocks endpoint_path=80354c2356f60/clear_kv_blocksEnvironment
- Dynamo platform deployed using the
v1.0.1helm chart. - Image:
nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.1- Cuda library:
12.9
- Cuda library:
- RTX PRO 6000 Blackwell Server GPU
- NVIDIA driver:
590.48.01(pre-installed on host) - Cuda driver:
13.1 - GPU Operator:
v25.10.1
Additional Context
- The same error occurs if I used the
nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.0image instead.
Screenshots
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working