feat(vllm): add CPU Encode for dual/multiple encoder EPD by ZhengHongming888 · Pull Request #7667 · ai-dynamo/dynamo

ZhengHongming888 · 2026-03-27T18:13:28Z

Overview:

This PR is to add CPU encode for EPD disaggregation case that CPU can help offload the workload for dual/multiple Encoder EPD scenario. It can help the performance different from purely GPU/XPU encoding scenario.

The problem solved here: Default encoder in dynamo will automatically discovers the device platform and you can not setup the additional CPU encoder for offload for example under Cuda/XPU device environment. By this PR you can setup additional encoder with CPU for encoding offloading in multiple encoder case.

Details:

You can use the below example code for testing cpu encoding -
DEVICE_PLATFORM='xpu' bash examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh --model Qwen/Qwen2.5-VL-3B-Instruct

You will see the encoding device in terminal like

Also with the output -

Thanks.

Summary by CodeRabbit

New Features
- Added environment-driven configuration to control vision encoder device placement (CPU or GPU).
- Introduced new disaggregated serving deployment option combining CPU-based vision encoder with GPU workers.
Improvements
- Enhanced debugging with detailed logging for device placement and CUDA configuration.
- Improved robustness for Qwen model vision configurations with flexible attribute resolution.
Chores
- Updated launch script path references.

Add CPU encoder support for disaggregated multimodal EPD: - Add device parameter to load_vision_model() for CPU/GPU selection - Add DYN_ENCODER_DEVICE environment variable - Fix spatial_merge_size attribute access for HuggingFace models - Add device verification logging - Add cpu_encoder.sh launch script - Fix script path in disagg_multimodal_epd_xpu.sh Signed-off-by: Hongming Zheng <[email protected]>

Relocated CPU encoder launch script to xpu subdirectory and updated relative paths to common utilities (../../../../common/). Co-Authored-By: Claude Sonnet 4.5 <[email protected]> Signed-off-by: Hongming Zheng <[email protected]>

copy-pr-bot · 2026-03-27T18:13:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-03-27T18:13:37Z

👋 Hi ZhengHongming888! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2026-03-27T18:20:28Z

Walkthrough

These changes implement environment-driven device override for vision encoders in vLLM multimodal serving. The modifications add device parameter support to model loading, extensive runtime logging for device placement verification, and a new launch script for CPU-based vision encoder deployment in disaggregated encode/prefill/decode configurations.

Changes

Cohort / File(s)	Summary
Vision Model Loading `components/src/dynamo/vllm/multimodal_utils/model.py`	Updated `load_vision_model()` signature to accept `device` parameter (defaulting to `"auto"`); bypasses vLLM encoder path when device is explicitly `"cpu"` and passes requested device to HuggingFace loading via `device_map=device`.
Encoder Device & Logging `components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py`, `components/src/dynamo/vllm/multimodal_utils/encode_utils.py`	Added environment-driven device override reading `DYN_ENCODER_DEVICE`; inserted runtime logging to verify encoder device placement and CUDA availability; updated Qwen spatial merge size resolution with fallback chain checking multiple attribute locations.
Disaggregated Serving Scripts `examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh`, `examples/backends/vllm/launch/xpu/disagg_multimodal_epd_xpu.sh`	Added new launch script for disaggregated E/P/D serving with CPU-loaded vision encoder, environment variable configuration for worker GPU assignment, KV cache setup, and conditional flags for single-GPU/XPU modes; updated existing script source paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main feature added: CPU encoding support for disaggregated EPD scenarios with dual/multiple encoders.
Description check	✅ Passed	The description covers the overview, details with example usage, and includes images demonstrating the feature works as expected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

components/src/dynamo/vllm/multimodal_utils/encode_utils.py (1)
109-117: Remove redundant logger assignment.

Line 110 shadows the module-level logger (defined at line 25) with an identical assignment. This is unnecessary and creates shadowing.
♻️ Proposed fix
     with torch.no_grad():
         # Log encoder device during inference
-        logger = logging.getLogger(__name__)
         try:
             encoder_device = next(vision_encoder.parameters()).device
             logger.info(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/multimodal_utils/encode_utils.py` around lines 109
- 117, The local assignment logger = logging.getLogger(__name__) in the
encode_device logging block is shadowing the module-level logger; remove that
redundant assignment and use the existing module-level logger variable (logger)
when logging the vision encoder device in the try/except around
vision_encoder.parameters().device (keep the same try/except and log messages,
just delete the duplicate logging.getLogger call).
examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh (1)
132-148: Consider using dynamic port allocation to avoid collisions.

The hardcoded ports (20097-20099 for NIXL side channels, 20080-20082 for KV events) are identical to disagg_multimodal_epd_xpu.sh. Running both scripts simultaneously on the same host would cause port binding failures.

Consider using alloc_port from the common utilities for dynamic port allocation, or at minimum, parameterize these ports via environment variables (e.g., VLLM_NIXL_SIDE_CHANNEL_PORT=${VLLM_NIXL_SIDE_CHANNEL_PORT:-20097}).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh` around lines 132 -
148, The script hardcodes ports (VLLM_NIXL_SIDE_CHANNEL_PORT values 20097–20099
and KV events endpoints tcp://*:20080–20082) which can collide with other
scripts; update the launch commands in cpu_encoder_for_epd.sh (the
VLLM_NIXL_SIDE_CHANNEL_PORT assignments and the --kv-events-config endpoints) to
obtain ports dynamically or from env vars: call the shared alloc_port helper to
allocate unique ports at runtime (or fallback to environment variables like
VLLM_NIXL_SIDE_CHANNEL_PORT, KV_EVENTS_PORT_*), then inject those allocated/env
ports into the VLLM_NIXL_SIDE_CHANNEL_PORT assignments and the
--kv-events-config JSON strings used by the python -m dynamo.vllm invocations so
the script no longer uses the fixed 20097–20099 and 20080–20082 values.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh`:
- Around line 144-148: Update the decode worker launch command that starts with
"VLLM_NIXL_SIDE_CHANNEL_PORT=20099 env
$DEVICE_AFFINITY_ENV=$DYN_DECODE_WORKER_GPU python -m dynamo.vllm
--multimodal-decode-worker ..." to include the explicit flag
"--disaggregation-mode decode" so the worker runs in decode-only disaggregation
mode (matching other decode worker scripts); ensure the new flag is placed among
the existing CLI flags (alongside --enable-multimodal, --model $MODEL_NAME,
etc.) so disaggregation_mode is not left at the default AGGREGATED.

---

Nitpick comments:
In `@components/src/dynamo/vllm/multimodal_utils/encode_utils.py`:
- Around line 109-117: The local assignment logger = logging.getLogger(__name__)
in the encode_device logging block is shadowing the module-level logger; remove
that redundant assignment and use the existing module-level logger variable
(logger) when logging the vision encoder device in the try/except around
vision_encoder.parameters().device (keep the same try/except and log messages,
just delete the duplicate logging.getLogger call).

In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh`:
- Around line 132-148: The script hardcodes ports (VLLM_NIXL_SIDE_CHANNEL_PORT
values 20097–20099 and KV events endpoints tcp://*:20080–20082) which can
collide with other scripts; update the launch commands in cpu_encoder_for_epd.sh
(the VLLM_NIXL_SIDE_CHANNEL_PORT assignments and the --kv-events-config
endpoints) to obtain ports dynamically or from env vars: call the shared
alloc_port helper to allocate unique ports at runtime (or fallback to
environment variables like VLLM_NIXL_SIDE_CHANNEL_PORT, KV_EVENTS_PORT_*), then
inject those allocated/env ports into the VLLM_NIXL_SIDE_CHANNEL_PORT
assignments and the --kv-events-config JSON strings used by the python -m
dynamo.vllm invocations so the script no longer uses the fixed 20097–20099 and
20080–20082 values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e7361a41-901b-4f11-aaed-a4d5fdf8a6f3

📥 Commits

Reviewing files that changed from the base of the PR and between 310f8ca and dd07f4a.

📒 Files selected for processing (5)

components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py
components/src/dynamo/vllm/multimodal_utils/encode_utils.py
components/src/dynamo/vllm/multimodal_utils/model.py
examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh
examples/backends/vllm/launch/xpu/disagg_multimodal_epd_xpu.sh

coderabbitai · 2026-03-27T18:20:31Z

examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh

+# Start decode worker
+echo "Starting decode worker on GPU $DYN_DECODE_WORKER_GPU (GPU mem: $DYN_DECODE_GPU_MEM)..."
+VLLM_NIXL_SIDE_CHANNEL_PORT=20099 \
+env $DEVICE_AFFINITY_ENV=$DYN_DECODE_WORKER_GPU \
+python -m dynamo.vllm --multimodal-decode-worker --enable-multimodal --enable-mm-embeds --model $MODEL_NAME --gpu-memory-utilization $DYN_DECODE_GPU_MEM $EXTRA_ARGS $PD_EXTRA_ARGS --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device": "'"$DEVICE_PLATFORM"'"}' --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20082"}' &


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check how other decode workers are configured in the repo rg -n --type=sh 'multimodal-decode-worker|disaggregation-mode decode' examples/

Repository: ai-dynamo/dynamo

Length of output: 3720

🏁 Script executed:

rg -n 'multimodal-decode-worker|disaggregation-mode' --type=py dynamo/vllm | head -50

Repository: ai-dynamo/dynamo

Length of output: 111

🏁 Script executed:

find . -type f -name "*.py" -path "*/vllm/*" | head -20

Repository: ai-dynamo/dynamo

Length of output: 1127

🏁 Script executed:

rg -n 'multimodal-decode-worker|disaggregation-mode' ./components/src/dynamo/vllm/

Repository: ai-dynamo/dynamo

Length of output: 2884

🏁 Script executed:

sed -n '20,110p' ./components/src/dynamo/vllm/backend_args.py

Repository: ai-dynamo/dynamo

Length of output: 3683

🏁 Script executed:

sed -n '220,300p' ./components/src/dynamo/vllm/backend_args.py

Repository: ai-dynamo/dynamo

Length of output: 3609

🏁 Script executed:

rg -n 'multimodal_decode_worker' ./components/src/dynamo/vllm/ -A 3 -B 1

Repository: ai-dynamo/dynamo

Length of output: 6465

🏁 Script executed:

sed -n '150,180p' ./components/src/dynamo/vllm/args.py

Repository: ai-dynamo/dynamo

Length of output: 1296

Add --disaggregation-mode decode to the decode worker command.

The decode worker at line 148 uses --multimodal-decode-worker but omits --disaggregation-mode decode. These flags are independent—--multimodal-decode-worker only sets the component type, not the disaggregation mode. Without the explicit flag, disaggregation_mode defaults to AGGREGATED, which conflicts with the intended decode-only behavior. All other decode workers in the repository explicitly specify --disaggregation-mode decode for consistency. Add this flag to match the sibling script disagg_multimodal_epd_xpu.sh:142.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh` around lines 144 - 148, Update the decode worker launch command that starts with "VLLM_NIXL_SIDE_CHANNEL_PORT=20099 env $DEVICE_AFFINITY_ENV=$DYN_DECODE_WORKER_GPU python -m dynamo.vllm --multimodal-decode-worker ..." to include the explicit flag "--disaggregation-mode decode" so the worker runs in decode-only disaggregation mode (matching other decode worker scripts); ensure the new flag is placed among the existing CLI flags (alongside --enable-multimodal, --model $MODEL_NAME, etc.) so disaggregation_mode is not left at the default AGGREGATED.

ZhengHongming888 and others added 2 commits March 26, 2026 22:47

ZhengHongming888 requested review from a team as code owners March 27, 2026 18:13

pull-request-size bot added the size/L label Mar 27, 2026

github-actions bot added external-contribution Pull request is from an external contributor feat backend::vllm Relates to the vllm backend multimodal labels Mar 27, 2026

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vllm): add CPU Encode for dual/multiple encoder EPD#7667

feat(vllm): add CPU Encode for dual/multiple encoder EPD#7667
ZhengHongming888 wants to merge 2 commits intoai-dynamo:mainfrom
ZhengHongming888:cpu_encode_for_epd

ZhengHongming888 commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

coderabbitai bot commented Mar 27, 2026

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZhengHongming888 commented Mar 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

coderabbitai bot commented Mar 27, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZhengHongming888 commented Mar 27, 2026 •

edited by coderabbitai bot

Loading