feat(vllm): add CPU Encode for dual/multiple encoder EPD#7667
feat(vllm): add CPU Encode for dual/multiple encoder EPD#7667ZhengHongming888 wants to merge 2 commits intoai-dynamo:mainfrom
Conversation
Add CPU encoder support for disaggregated multimodal EPD: - Add device parameter to load_vision_model() for CPU/GPU selection - Add DYN_ENCODER_DEVICE environment variable - Fix spatial_merge_size attribute access for HuggingFace models - Add device verification logging - Add cpu_encoder.sh launch script - Fix script path in disagg_multimodal_epd_xpu.sh Signed-off-by: Hongming Zheng <[email protected]>
Relocated CPU encoder launch script to xpu subdirectory and updated relative paths to common utilities (../../../../common/). Co-Authored-By: Claude Sonnet 4.5 <[email protected]> Signed-off-by: Hongming Zheng <[email protected]>
|
👋 Hi ZhengHongming888! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughThese changes implement environment-driven device override for vision encoders in vLLM multimodal serving. The modifications add device parameter support to model loading, extensive runtime logging for device placement verification, and a new launch script for CPU-based vision encoder deployment in disaggregated encode/prefill/decode configurations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
components/src/dynamo/vllm/multimodal_utils/encode_utils.py (1)
109-117: Remove redundant logger assignment.Line 110 shadows the module-level
logger(defined at line 25) with an identical assignment. This is unnecessary and creates shadowing.♻️ Proposed fix
with torch.no_grad(): # Log encoder device during inference - logger = logging.getLogger(__name__) try: encoder_device = next(vision_encoder.parameters()).device logger.info(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/vllm/multimodal_utils/encode_utils.py` around lines 109 - 117, The local assignment logger = logging.getLogger(__name__) in the encode_device logging block is shadowing the module-level logger; remove that redundant assignment and use the existing module-level logger variable (logger) when logging the vision encoder device in the try/except around vision_encoder.parameters().device (keep the same try/except and log messages, just delete the duplicate logging.getLogger call).examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh (1)
132-148: Consider using dynamic port allocation to avoid collisions.The hardcoded ports (
20097-20099for NIXL side channels,20080-20082for KV events) are identical todisagg_multimodal_epd_xpu.sh. Running both scripts simultaneously on the same host would cause port binding failures.Consider using
alloc_portfrom the common utilities for dynamic port allocation, or at minimum, parameterize these ports via environment variables (e.g.,VLLM_NIXL_SIDE_CHANNEL_PORT=${VLLM_NIXL_SIDE_CHANNEL_PORT:-20097}).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh` around lines 132 - 148, The script hardcodes ports (VLLM_NIXL_SIDE_CHANNEL_PORT values 20097–20099 and KV events endpoints tcp://*:20080–20082) which can collide with other scripts; update the launch commands in cpu_encoder_for_epd.sh (the VLLM_NIXL_SIDE_CHANNEL_PORT assignments and the --kv-events-config endpoints) to obtain ports dynamically or from env vars: call the shared alloc_port helper to allocate unique ports at runtime (or fallback to environment variables like VLLM_NIXL_SIDE_CHANNEL_PORT, KV_EVENTS_PORT_*), then inject those allocated/env ports into the VLLM_NIXL_SIDE_CHANNEL_PORT assignments and the --kv-events-config JSON strings used by the python -m dynamo.vllm invocations so the script no longer uses the fixed 20097–20099 and 20080–20082 values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh`:
- Around line 144-148: Update the decode worker launch command that starts with
"VLLM_NIXL_SIDE_CHANNEL_PORT=20099 env
$DEVICE_AFFINITY_ENV=$DYN_DECODE_WORKER_GPU python -m dynamo.vllm
--multimodal-decode-worker ..." to include the explicit flag
"--disaggregation-mode decode" so the worker runs in decode-only disaggregation
mode (matching other decode worker scripts); ensure the new flag is placed among
the existing CLI flags (alongside --enable-multimodal, --model $MODEL_NAME,
etc.) so disaggregation_mode is not left at the default AGGREGATED.
---
Nitpick comments:
In `@components/src/dynamo/vllm/multimodal_utils/encode_utils.py`:
- Around line 109-117: The local assignment logger = logging.getLogger(__name__)
in the encode_device logging block is shadowing the module-level logger; remove
that redundant assignment and use the existing module-level logger variable
(logger) when logging the vision encoder device in the try/except around
vision_encoder.parameters().device (keep the same try/except and log messages,
just delete the duplicate logging.getLogger call).
In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh`:
- Around line 132-148: The script hardcodes ports (VLLM_NIXL_SIDE_CHANNEL_PORT
values 20097–20099 and KV events endpoints tcp://*:20080–20082) which can
collide with other scripts; update the launch commands in cpu_encoder_for_epd.sh
(the VLLM_NIXL_SIDE_CHANNEL_PORT assignments and the --kv-events-config
endpoints) to obtain ports dynamically or from env vars: call the shared
alloc_port helper to allocate unique ports at runtime (or fallback to
environment variables like VLLM_NIXL_SIDE_CHANNEL_PORT, KV_EVENTS_PORT_*), then
inject those allocated/env ports into the VLLM_NIXL_SIDE_CHANNEL_PORT
assignments and the --kv-events-config JSON strings used by the python -m
dynamo.vllm invocations so the script no longer uses the fixed 20097–20099 and
20080–20082 values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e7361a41-901b-4f11-aaed-a4d5fdf8a6f3
📒 Files selected for processing (5)
components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.pycomponents/src/dynamo/vllm/multimodal_utils/encode_utils.pycomponents/src/dynamo/vllm/multimodal_utils/model.pyexamples/backends/vllm/launch/xpu/cpu_encoder_for_epd.shexamples/backends/vllm/launch/xpu/disagg_multimodal_epd_xpu.sh
| # Start decode worker | ||
| echo "Starting decode worker on GPU $DYN_DECODE_WORKER_GPU (GPU mem: $DYN_DECODE_GPU_MEM)..." | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT=20099 \ | ||
| env $DEVICE_AFFINITY_ENV=$DYN_DECODE_WORKER_GPU \ | ||
| python -m dynamo.vllm --multimodal-decode-worker --enable-multimodal --enable-mm-embeds --model $MODEL_NAME --gpu-memory-utilization $DYN_DECODE_GPU_MEM $EXTRA_ARGS $PD_EXTRA_ARGS --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device": "'"$DEVICE_PLATFORM"'"}' --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20082"}' & |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how other decode workers are configured in the repo
rg -n --type=sh 'multimodal-decode-worker|disaggregation-mode decode' examples/Repository: ai-dynamo/dynamo
Length of output: 3720
🏁 Script executed:
rg -n 'multimodal-decode-worker|disaggregation-mode' --type=py dynamo/vllm | head -50Repository: ai-dynamo/dynamo
Length of output: 111
🏁 Script executed:
find . -type f -name "*.py" -path "*/vllm/*" | head -20Repository: ai-dynamo/dynamo
Length of output: 1127
🏁 Script executed:
rg -n 'multimodal-decode-worker|disaggregation-mode' ./components/src/dynamo/vllm/Repository: ai-dynamo/dynamo
Length of output: 2884
🏁 Script executed:
sed -n '20,110p' ./components/src/dynamo/vllm/backend_args.pyRepository: ai-dynamo/dynamo
Length of output: 3683
🏁 Script executed:
sed -n '220,300p' ./components/src/dynamo/vllm/backend_args.pyRepository: ai-dynamo/dynamo
Length of output: 3609
🏁 Script executed:
rg -n 'multimodal_decode_worker' ./components/src/dynamo/vllm/ -A 3 -B 1Repository: ai-dynamo/dynamo
Length of output: 6465
🏁 Script executed:
sed -n '150,180p' ./components/src/dynamo/vllm/args.pyRepository: ai-dynamo/dynamo
Length of output: 1296
Add --disaggregation-mode decode to the decode worker command.
The decode worker at line 148 uses --multimodal-decode-worker but omits --disaggregation-mode decode. These flags are independent—--multimodal-decode-worker only sets the component type, not the disaggregation mode. Without the explicit flag, disaggregation_mode defaults to AGGREGATED, which conflicts with the intended decode-only behavior. All other decode workers in the repository explicitly specify --disaggregation-mode decode for consistency. Add this flag to match the sibling script disagg_multimodal_epd_xpu.sh:142.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh` around lines 144 -
148, Update the decode worker launch command that starts with
"VLLM_NIXL_SIDE_CHANNEL_PORT=20099 env
$DEVICE_AFFINITY_ENV=$DYN_DECODE_WORKER_GPU python -m dynamo.vllm
--multimodal-decode-worker ..." to include the explicit flag
"--disaggregation-mode decode" so the worker runs in decode-only disaggregation
mode (matching other decode worker scripts); ensure the new flag is placed among
the existing CLI flags (alongside --enable-multimodal, --model $MODEL_NAME,
etc.) so disaggregation_mode is not left at the default AGGREGATED.
Overview:
This PR is to add CPU encode for EPD disaggregation case that CPU can help offload the workload for dual/multiple Encoder EPD scenario. It can help the performance different from purely GPU/XPU encoding scenario.
The problem solved here: Default encoder in dynamo will automatically discovers the device platform and you can not setup the additional CPU encoder for offload for example under Cuda/XPU device environment. By this PR you can setup additional encoder with CPU for encoding offloading in multiple encoder case.
Details:
You can use the below example code for testing cpu encoding -
DEVICE_PLATFORM='xpu' bash examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh --model Qwen/Qwen2.5-VL-3B-InstructYou will see the encoding device in terminal like

Also with the output -

Thanks.
Summary by CodeRabbit
New Features
Improvements
Chores