Skip to content

feat(vllm): Add Per Request Scheduler for dual/multiple encoders EPD#7668

Open
ZhengHongming888 wants to merge 6 commits intoai-dynamo:mainfrom
ZhengHongming888:add_per_request_scheduler_for_encoders
Open

feat(vllm): Add Per Request Scheduler for dual/multiple encoders EPD#7668
ZhengHongming888 wants to merge 6 commits intoai-dynamo:mainfrom
ZhengHongming888:add_per_request_scheduler_for_encoders

Conversation

@ZhengHongming888
Copy link
Copy Markdown
Contributor

@ZhengHongming888 ZhengHongming888 commented Mar 27, 2026

Overview:

This PR is to add the Per Request scheduler for dual/multiple encoders for EPD disaggregation scenario.

The problem solved here:

Default encoder scheduler in EPD with vllm backend is to use the per image based within one request. for example, one request with 2 image, assume 2 encoders (1 for xpu and 1 for cpu) then 1 image to xpu encoder and 1 image to cpu encode and then they will merge together into one embedding for kv cache. So this case the xpu encode will wait for the finish of cpu encode and slow the process much.

We provide per request scheduler like as in example , 8 request for xpu and 1 request for cpu then their process time may be similar in parallel processing which will greatly offload the encode working with help of CPU encode.

In XPU/Cuda environment the CPU still can help the offloading work from XPU/GPU for this EPD scenario. Here one example result below.

Details:

  1. per image scheduler -
    ./examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh
image
  1. per request scheduler -
    ./examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh
image

You will see per request scheduler is better than per image from throughput/TTFT/TPOT point of view.

Thanks.

Summary by CodeRabbit

  • New Features

    • Added environment variable support for encoder device override and scheduling mode configuration
    • Introduced configurable encoder scheduling with per-request and per-image modes
    • Added disaggregated multimodal serving pipeline launch utilities
    • Introduced benchmark scripts for performance testing with different encoder configurations
  • Improvements

    • Enhanced runtime device verification and logging for better diagnostics
    • Improved robustness of embedding splitting logic with fallback mechanisms

ZhengHongming888 and others added 6 commits March 26, 2026 22:47
Add CPU encoder support for disaggregated multimodal EPD:
- Add device parameter to load_vision_model() for CPU/GPU selection
- Add DYN_ENCODER_DEVICE environment variable
- Fix spatial_merge_size attribute access for HuggingFace models
- Add device verification logging
- Add cpu_encoder.sh launch script
- Fix script path in disagg_multimodal_epd_xpu.sh

Signed-off-by: Hongming Zheng <[email protected]>
Relocated CPU encoder launch script to xpu subdirectory and updated
relative paths to common utilities (../../../../common/).

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Signed-off-by: Hongming Zheng <[email protected]>
Co-authored-by: Mittal Kushal <[email protected]>
Co-authored-by: Robin Zhang <[email protected]>
Signed-off-by: Hongming Zheng <[email protected]>
@ZhengHongming888 ZhengHongming888 requested review from a team as code owners March 27, 2026 20:04
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi ZhengHongming888! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added external-contribution Pull request is from an external contributor backend::vllm Relates to the vllm backend multimodal labels Mar 27, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 27, 2026

Walkthrough

The changes implement a disaggregated multimodal encoding pipeline with environment-driven device selection, improved logging, and configurable encoder scheduling. They add support for CPU/accelerator device overrides, runtime device verification, and per-request versus per-image scheduler modes. New launch scripts demonstrate the complete pipeline configuration.

Changes

Cohort / File(s) Summary
Vision encoder device management
components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py, components/src/dynamo/vllm/multimodal_utils/encode_utils.py, components/src/dynamo/vllm/multimodal_utils/model.py
Added environment-driven encoder device override via DYN_ENCODER_DEVICE with fallback to "auto", passed through load_vision_model(..., device=...). Introduced runtime device verification logging (requested device, actual model device, CUDA status). Updated Qwen spatial merge size detection with fallback chain for robustness.
Prefill worker scheduler and embedding accumulation
components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py
Implemented configurable encoder scheduling modes (ENCODER_SCHEDULER_MODE: "per_request" or "per_image") and split ratio configuration. Refactored _fetch_from_encode_workers() to support weighted round-robin selection in per-request mode and per-batch routing in per-image mode. Updated multi-modal embedding accumulation to use defaultdict(list) and reworked image merging logic to conditionally assign or concatenate tensors.
Disaggregated dual-encoder pipeline scripts
examples/backends/vllm/launch/dual_encoder_epd.sh, examples/backends/vllm/launch/xpu/dual_encoders_for_epd.sh
New launch scripts orchestrating disaggregated E/P/D pipelines with dual encoders (CPU and platform-device), prefill, and decode workers. Include device affinity configuration, NIXL/ZMQ side-channel setup, scheduler mode/split ratio environment variable passing, and process lifecycle management.
CPU-based encoder and platform-specific launch scripts
examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh, examples/backends/vllm/launch/xpu/disagg_multimodal_epd_xpu.sh, examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh, examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh
Added CPU-based encoder launch script with DYN_ENCODER_DEVICE=cpu configuration, XPU-specific disaggregation script, and benchmark automation scripts for per-image and per-request scheduler modes with configurable split ratios. Updated relative import paths in disaggregation script.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a per-request scheduler for dual/multiple encoders in EPD disaggregation, which is the primary focus of the PR across code changes and new launch scripts.
Description check ✅ Passed The description covers the Overview section explaining the problem and solution, provides Details with benchmark examples, and includes Related Issues section. However, it lacks the 'Where should the reviewer start?' section which is part of the template.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (13)
components/src/dynamo/vllm/multimodal_utils/encode_utils.py (1)

109-117: Remove redundant logger declaration.

Line 110 re-declares logger = logging.getLogger(__name__) which shadows the module-level logger already defined at line 25. Since getLogger(__name__) returns the same logger instance, this line is unnecessary.

♻️ Proposed fix
     with torch.no_grad():
         # Log encoder device during inference
-        logger = logging.getLogger(__name__)
         try:
             encoder_device = next(vision_encoder.parameters()).device
             logger.info(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/multimodal_utils/encode_utils.py` around lines 109
- 117, Remove the redundant local logger declaration inside the encode block:
delete the line "logger = logging.getLogger(__name__)" so the module-level
logger (defined earlier) is used; keep the try/except that accesses
vision_encoder.parameters() and the logger.info calls unchanged, referencing the
existing module-level logger and the vision_encoder symbol in encode_utils.py.
examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh (3)

15-15: Hard-coded /workspace path reduces portability.

Consider using relative paths via $SCRIPT_DIR for consistency with other launch scripts.

Based on learnings: "Flag hard-coded portability-reducing constants in shell/scripts across the repository."

Also applies to: 40-41

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh`
at line 15, The OUTPUT_DIR assignment uses a hard-coded /workspace path which
reduces portability; change the OUTPUT_DIR expression in this script
(referencing the OUTPUT_DIR variable) to build the path relative to the script
directory (use the existing SCRIPT_DIR variable like other launch scripts) and
update the other occurrences mentioned (around the other OUTPUT_DIR uses at
lines ~40-41) so all output paths are derived from SCRIPT_DIR instead of
/workspace, preserving the timestamp suffix behavior.

7-7: Add EXIT trap for cleanup consistency.

Same as the per-request benchmark script, this lacks cleanup on exit.

♻️ Proposed fix
 set -e
+trap 'echo Cleaning up...; kill 0' EXIT
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh`
at line 7, The script currently only has "set -e" and lacks an EXIT trap for
consistent cleanup; add a cleanup function (e.g., cleanup) that undoes temporary
state and kills any background processes started by
run_per_image_encode_schedule_benchmark.sh, register it with trap to run on EXIT
(trap 'cleanup' EXIT), and ensure the cleanup function is defined before
starting any background work so it reliably runs on any exit path.

1-90: Consider extracting shared logic to reduce duplication.

This script shares significant structure with run_per_request_encode_schedule_benchmark.sh. Consider extracting the common benchmark logic into a shared helper script that accepts scheduler mode and split ratio as parameters.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh`
around lines 1 - 90, The two scripts duplicate startup, readiness, wait/warmup
and benchmarking logic—extract that common flow into a shared helper (e.g.,
launch_benchmark.sh) and call it from run_per_image_encode_schedule_benchmark.sh
and run_per_request_encode_schedule_benchmark.sh, passing parameters like
DYN_ENCODER_SCHEDULER (per_image / per_request), DEVICE_PLATFORM, PORT, MODEL,
DATASET, NUM_PROMPTS, SEED and optional split ratio; move the server start
sequence (export DYN_ENCODER_SCHEDULER..., bash dual_encoders_for_epd.sh,
capturing SERVER_PID), the readiness loop (curl health check), the warmup/sleep
handling, and the vllm bench serve invocation into reusable functions inside the
helper so the two scripts only set args and delegate to functions such as
start_server_and_wait, run_benchmark, and stop_server.
examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh (3)

16-16: Hard-coded /workspace path reduces portability.

The script uses absolute paths (/workspace/...) which ties it to a specific environment. Consider using $SCRIPT_DIR relative paths like other scripts in this directory.

♻️ Proposed fix
+SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
+
-OUTPUT_DIR="/workspace/benchmark_results_per_request_8_1_$(date +%Y%m%d_%H%M%S)"
+OUTPUT_DIR="${BENCHMARK_OUTPUT_DIR:-$PWD}/benchmark_results_per_request_8_1_$(date +%Y%m%d_%H%M%S)"
 mkdir -p "$OUTPUT_DIR"

And for the launch script path:

-bash /workspace/examples/backends/vllm/launch/xpu/dual_encoders_for_epd.sh \
+bash "$SCRIPT_DIR/dual_encoders_for_epd.sh" \
     --model "$MODEL" > "$OUTPUT_DIR/server.log" 2>&1 &

Based on learnings: "Flag hard-coded portability-reducing constants in shell/scripts across the repository."

Also applies to: 43-44

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh`
at line 16, The OUTPUT_DIR variable is hard-coded to /workspace which reduces
portability; change its assignment to build the path relative to the script
directory (use the existing SCRIPT_DIR variable or compute SCRIPT_DIR="$(cd
"$(dirname "${BASH_SOURCE[0]}")" && pwd)") and then set
OUTPUT_DIR="${SCRIPT_DIR}/benchmark_results_per_request_8_1_$(date
+%Y%m%d_%H%M%S)"; also update any other hard-coded launch script path
occurrences (the launch path referenced later in the script) to use SCRIPT_DIR
similarly so all paths are relative to the script location.

7-7: Add EXIT trap for cleanup consistency.

The script starts background processes but lacks an EXIT trap to clean them up on failure or interruption. This differs from other launch scripts in the repo that use trap 'echo Cleaning up...; kill 0' EXIT.

♻️ Proposed fix
 set -e
+trap 'echo Cleaning up...; kill 0' EXIT
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh`
at line 7, Add an EXIT trap to ensure background processes are cleaned up on
failure/interrupt; in the run_per_request_encode_schedule_benchmark.sh script,
immediately after the existing set -e line register a trap like the repo's
pattern (e.g., trap 'echo Cleaning up...; kill 0' EXIT) so any spawned
background jobs are killed when the script exits; ensure the trap runs before
any background processes are started so it covers failures during startup and
normal execution.

77-77: Consider reducing or parameterizing the 100-second sleep.

The sleep 100 is a lengthy fixed delay. Consider making this configurable via an environment variable or reducing it based on actual server readiness signals.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh`
at line 77, The fixed 100-second delay in
run_per_request_encode_schedule_benchmark.sh (the line with "sleep 100") should
be made configurable and/or reduced; replace the hardcoded sleep with a
configurable environment variable (e.g., SLEEP_SECONDS) that defaults to a
smaller value and/or implement a readiness check loop that polls server
readiness (health endpoint or port) before proceeding, updating references to
"sleep 100" accordingly in the script.
components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py (3)

46-60: Avoid catching blind Exception; catch specific error types.

The except Exception is overly broad. The parsing logic can only raise ValueError (from invalid format or int conversion) and IndexError (from split). Catching these specifically provides clearer error handling.

♻️ Proposed fix
 def _parse_split_ratio(ratio_str: str) -> tuple[int, int]:
     """Parse split ratio string like '7:3' into tuple (7, 3)."""
     try:
         parts = ratio_str.split(":")
         if len(parts) != 2:
             raise ValueError(f"Invalid format: {ratio_str}")
         ratio1, ratio2 = int(parts[0]), int(parts[1])
         if ratio1 <= 0 or ratio2 <= 0:
             raise ValueError(f"Ratios must be positive: {ratio_str}")
         return (ratio1, ratio2)
-    except Exception as e:
+    except (ValueError, IndexError) as e:
         logger.warning(
             f"Failed to parse DYN_ENCODER_SPLIT_RATIO='{ratio_str}': {e}. Using default 1:1"
         )
         return (1, 1)

As per coding guidelines: "fail fast—avoid broad/bare exception swallowing (except:, except Exception: pass)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py` around
lines 46 - 60, The _parse_split_ratio function currently uses a broad except
Exception which hides other issues; update its exception handler to catch only
the expected parsing errors (e.g., ValueError and IndexError) so that only
invalid-format/int-conversion/split errors are handled and logged, preserving
the existing logger.warning and default return (1,1) behavior; reference
_parse_split_ratio to locate the try/except block and replace except Exception
as e with except (ValueError, IndexError) as e.

128-151: Use truthiness checks instead of == [] for emptiness.

Comparing multi_modal_data["image"] == [] is less idiomatic than checking emptiness directly. However, note that defaultdict(list) returns [] for missing keys, so the current approach works correctly.

♻️ Proposed fix for readability
-        if multi_modal_data["image"] == []:
+        if not multi_modal_data["image"]:
             multi_modal_data["image"] = mm_data["image"]

Apply similarly to line 146.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py` around
lines 128 - 151, Replace the explicit list equality checks with truthiness
checks to test emptiness: change both occurrences of `multi_modal_data["image"]
== []` to `if not multi_modal_data["image"]:` (and the corresponding else branch
to `if multi_modal_data["image"]:` if needed) so the code in
prefill_worker_utils.py that manipulates `multi_modal_data["image"]` and
`mm_data["image"]` uses idiomatic emptiness checks for lists/tensors in the
`multi_modal_data` handling logic.

65-66: Module-level mutable state is acceptable here but add a note about concurrency.

The comment on line 65 correctly notes GIL safety for a single event loop. Consider adding a brief note that this counter is not process-safe if multiple prefill workers run in the same process.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py` around
lines 65 - 66, The module-level counter _encoder_round_robin_index is currently
documented as GIL/thread-safe for a single event loop but lacks a note that it
is not safe across multiple processes; update the comment above
_encoder_round_robin_index to explicitly state it is not process-safe if
multiple prefill workers run in the same process or across processes and suggest
using a process-safe primitive (e.g., multiprocessing.Value/Manager or an
external coordinator) or redesigning scheduling if multi-process concurrency is
required.
examples/backends/vllm/launch/dual_encoder_epd.sh (1)

99-125: Hard-coded NIXL ports may cause conflicts.

The VLLM_NIXL_SIDE_CHANNEL_PORT values (20097, 20100) and ZMQ endpoint ports (20080, 20081, 20082, 20083) are hard-coded. In multi-instance or shared-node scenarios, these could conflict. Consider using alloc_port from launch_utils.sh or making these configurable via environment variables.

Based on learnings: "Flag hard-coded portability-reducing constants in shell/scripts across the repository... use alloc_port for dynamic ports."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/backends/vllm/launch/dual_encoder_epd.sh` around lines 99 - 125,
Replace hard-coded ports (VLLM_NIXL_SIDE_CHANNEL_PORT values 20097/20100 and the
ZMQ endpoints tcp://*:20080..20083) with dynamically allocated ports using
alloc_port from launch_utils.sh or by reading port env vars; update the two
launch blocks that set VLLM_NIXL_SIDE_CHANNEL_PORT and the --kv-events-config
endpoints to call alloc_port (or use provided env vars) and export those port
values so both the VLLM_NIXL_SIDE_CHANNEL_PORT variable and the
--kv-events-config JSON strings use the allocated ports, ensuring the script
uses alloc_port where available and falls back to an env-provided port if set.
examples/backends/vllm/launch/xpu/dual_encoders_for_epd.sh (1)

127-127: Fixed sleep for encoder registration may be unreliable.

The sleep 5 assumes encoders register within 5 seconds, which may be insufficient on slower systems or under memory pressure, or excessive on fast systems. Consider using a readiness probe or health check instead of a fixed delay.

For example, poll for encoder availability before starting the prefill worker:

# Wait for encoders to be ready (up to 30s)
for i in {1..30}; do
    if check_encoder_ready; then break; fi
    sleep 1
done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/backends/vllm/launch/xpu/dual_encoders_for_epd.sh` at line 127,
Replace the fixed "sleep 5" pause with a readiness polling loop that waits for
encoder registration before starting the prefill worker: implement a helper
(e.g., check_encoder_ready) that probes each encoder's health/registration
endpoint or socket, poll it every 1s for up to 30s (break when all encoders
report ready), and fail/exit with a clear error if the timeout is reached;
update the startup sequence that currently relies on "sleep 5" to call this
polling helper and only proceed to start the prefill worker when readiness is
confirmed.
examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh (1)

132-148: Consider making NIXL/ZMQ ports configurable for multi-instance scenarios.

The script uses hard-coded ports (20097-20099 for NIXL side channels, 20080-20082 for ZMQ endpoints). While acceptable for example/benchmark scripts, this prevents running multiple instances on the same node.

For future flexibility, consider exposing these as environment variables with defaults:

NIXL_PORT_ENCODE=${NIXL_PORT_ENCODE:-20097}
NIXL_PORT_PREFILL=${NIXL_PORT_PREFILL:-20098}
NIXL_PORT_DECODE=${NIXL_PORT_DECODE:-20099}

Based on learnings: "Flag hard-coded portability-reducing constants in shell/scripts across the repository (e.g., fixed memory sizes, fixed memory fractions, static ports...)"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh` around lines 132 -
148, Replace the hard-coded NIXL and ZMQ ports with configurable environment
variables and defaults: introduce variables like NIXL_PORT_ENCODE,
NIXL_PORT_PREFILL, NIXL_PORT_DECODE and ZMQ_PORT_ENCODE, ZMQ_PORT_PREFILL,
ZMQ_PORT_DECODE (with sensible default values matching current ports) and use
those variables wherever VLLM_NIXL_SIDE_CHANNEL_PORT and the hard-coded
tcp://*:2008X endpoints are set in the script (affecting the blocks that set
VLLM_NIXL_SIDE_CHANNEL_PORT and the --kv-events-config strings in the
encode/prefill/decode launches); ensure the existing variable names
(VLLM_NIXL_SIDE_CHANNEL_PORT, --kv-events-config) are preserved but sourced from
the new env vars so multiple instances can run without port collisions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh`:
- Around line 122-125: PD_EXTRA_ARGS is being overwritten when DEVICE_PLATFORM
== "xpu", which discards previously set single-GPU flags; instead append the XPU
option to preserve earlier settings by updating the branch that checks
DEVICE_PLATFORM to add to PD_EXTRA_ARGS (e.g., PD_EXTRA_ARGS="$PD_EXTRA_ARGS
--max-model-len 10240") rather than replacing it, and ensure PD_EXTRA_ARGS is
initialized before use so the append works when --single-gpu flags (KV cache
bytes, limit-mm-per-prompt, max-model-len 4096) are present.

---

Nitpick comments:
In `@components/src/dynamo/vllm/multimodal_utils/encode_utils.py`:
- Around line 109-117: Remove the redundant local logger declaration inside the
encode block: delete the line "logger = logging.getLogger(__name__)" so the
module-level logger (defined earlier) is used; keep the try/except that accesses
vision_encoder.parameters() and the logger.info calls unchanged, referencing the
existing module-level logger and the vision_encoder symbol in encode_utils.py.

In `@components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py`:
- Around line 46-60: The _parse_split_ratio function currently uses a broad
except Exception which hides other issues; update its exception handler to catch
only the expected parsing errors (e.g., ValueError and IndexError) so that only
invalid-format/int-conversion/split errors are handled and logged, preserving
the existing logger.warning and default return (1,1) behavior; reference
_parse_split_ratio to locate the try/except block and replace except Exception
as e with except (ValueError, IndexError) as e.
- Around line 128-151: Replace the explicit list equality checks with truthiness
checks to test emptiness: change both occurrences of `multi_modal_data["image"]
== []` to `if not multi_modal_data["image"]:` (and the corresponding else branch
to `if multi_modal_data["image"]:` if needed) so the code in
prefill_worker_utils.py that manipulates `multi_modal_data["image"]` and
`mm_data["image"]` uses idiomatic emptiness checks for lists/tensors in the
`multi_modal_data` handling logic.
- Around line 65-66: The module-level counter _encoder_round_robin_index is
currently documented as GIL/thread-safe for a single event loop but lacks a note
that it is not safe across multiple processes; update the comment above
_encoder_round_robin_index to explicitly state it is not process-safe if
multiple prefill workers run in the same process or across processes and suggest
using a process-safe primitive (e.g., multiprocessing.Value/Manager or an
external coordinator) or redesigning scheduling if multi-process concurrency is
required.

In `@examples/backends/vllm/launch/dual_encoder_epd.sh`:
- Around line 99-125: Replace hard-coded ports (VLLM_NIXL_SIDE_CHANNEL_PORT
values 20097/20100 and the ZMQ endpoints tcp://*:20080..20083) with dynamically
allocated ports using alloc_port from launch_utils.sh or by reading port env
vars; update the two launch blocks that set VLLM_NIXL_SIDE_CHANNEL_PORT and the
--kv-events-config endpoints to call alloc_port (or use provided env vars) and
export those port values so both the VLLM_NIXL_SIDE_CHANNEL_PORT variable and
the --kv-events-config JSON strings use the allocated ports, ensuring the script
uses alloc_port where available and falls back to an env-provided port if set.

In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh`:
- Around line 132-148: Replace the hard-coded NIXL and ZMQ ports with
configurable environment variables and defaults: introduce variables like
NIXL_PORT_ENCODE, NIXL_PORT_PREFILL, NIXL_PORT_DECODE and ZMQ_PORT_ENCODE,
ZMQ_PORT_PREFILL, ZMQ_PORT_DECODE (with sensible default values matching current
ports) and use those variables wherever VLLM_NIXL_SIDE_CHANNEL_PORT and the
hard-coded tcp://*:2008X endpoints are set in the script (affecting the blocks
that set VLLM_NIXL_SIDE_CHANNEL_PORT and the --kv-events-config strings in the
encode/prefill/decode launches); ensure the existing variable names
(VLLM_NIXL_SIDE_CHANNEL_PORT, --kv-events-config) are preserved but sourced from
the new env vars so multiple instances can run without port collisions.

In `@examples/backends/vllm/launch/xpu/dual_encoders_for_epd.sh`:
- Line 127: Replace the fixed "sleep 5" pause with a readiness polling loop that
waits for encoder registration before starting the prefill worker: implement a
helper (e.g., check_encoder_ready) that probes each encoder's
health/registration endpoint or socket, poll it every 1s for up to 30s (break
when all encoders report ready), and fail/exit with a clear error if the timeout
is reached; update the startup sequence that currently relies on "sleep 5" to
call this polling helper and only proceed to start the prefill worker when
readiness is confirmed.

In
`@examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh`:
- Line 15: The OUTPUT_DIR assignment uses a hard-coded /workspace path which
reduces portability; change the OUTPUT_DIR expression in this script
(referencing the OUTPUT_DIR variable) to build the path relative to the script
directory (use the existing SCRIPT_DIR variable like other launch scripts) and
update the other occurrences mentioned (around the other OUTPUT_DIR uses at
lines ~40-41) so all output paths are derived from SCRIPT_DIR instead of
/workspace, preserving the timestamp suffix behavior.
- Line 7: The script currently only has "set -e" and lacks an EXIT trap for
consistent cleanup; add a cleanup function (e.g., cleanup) that undoes temporary
state and kills any background processes started by
run_per_image_encode_schedule_benchmark.sh, register it with trap to run on EXIT
(trap 'cleanup' EXIT), and ensure the cleanup function is defined before
starting any background work so it reliably runs on any exit path.
- Around line 1-90: The two scripts duplicate startup, readiness, wait/warmup
and benchmarking logic—extract that common flow into a shared helper (e.g.,
launch_benchmark.sh) and call it from run_per_image_encode_schedule_benchmark.sh
and run_per_request_encode_schedule_benchmark.sh, passing parameters like
DYN_ENCODER_SCHEDULER (per_image / per_request), DEVICE_PLATFORM, PORT, MODEL,
DATASET, NUM_PROMPTS, SEED and optional split ratio; move the server start
sequence (export DYN_ENCODER_SCHEDULER..., bash dual_encoders_for_epd.sh,
capturing SERVER_PID), the readiness loop (curl health check), the warmup/sleep
handling, and the vllm bench serve invocation into reusable functions inside the
helper so the two scripts only set args and delegate to functions such as
start_server_and_wait, run_benchmark, and stop_server.

In
`@examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh`:
- Line 16: The OUTPUT_DIR variable is hard-coded to /workspace which reduces
portability; change its assignment to build the path relative to the script
directory (use the existing SCRIPT_DIR variable or compute SCRIPT_DIR="$(cd
"$(dirname "${BASH_SOURCE[0]}")" && pwd)") and then set
OUTPUT_DIR="${SCRIPT_DIR}/benchmark_results_per_request_8_1_$(date
+%Y%m%d_%H%M%S)"; also update any other hard-coded launch script path
occurrences (the launch path referenced later in the script) to use SCRIPT_DIR
similarly so all paths are relative to the script location.
- Line 7: Add an EXIT trap to ensure background processes are cleaned up on
failure/interrupt; in the run_per_request_encode_schedule_benchmark.sh script,
immediately after the existing set -e line register a trap like the repo's
pattern (e.g., trap 'echo Cleaning up...; kill 0' EXIT) so any spawned
background jobs are killed when the script exits; ensure the trap runs before
any background processes are started so it covers failures during startup and
normal execution.
- Line 77: The fixed 100-second delay in
run_per_request_encode_schedule_benchmark.sh (the line with "sleep 100") should
be made configurable and/or reduced; replace the hardcoded sleep with a
configurable environment variable (e.g., SLEEP_SECONDS) that defaults to a
smaller value and/or implement a readiness check loop that polls server
readiness (health endpoint or port) before proceeding, updating references to
"sleep 100" accordingly in the script.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d0b1ecf5-37ef-464a-9251-b6ed816a8f49

📥 Commits

Reviewing files that changed from the base of the PR and between 310f8ca and 3766d5f.

📒 Files selected for processing (10)
  • components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py
  • components/src/dynamo/vllm/multimodal_utils/encode_utils.py
  • components/src/dynamo/vllm/multimodal_utils/model.py
  • components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py
  • examples/backends/vllm/launch/dual_encoder_epd.sh
  • examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh
  • examples/backends/vllm/launch/xpu/disagg_multimodal_epd_xpu.sh
  • examples/backends/vllm/launch/xpu/dual_encoders_for_epd.sh
  • examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh
  • examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh

Comment on lines +122 to +125
if [[ "${DEVICE_PLATFORM,,}" == "xpu" ]]; then
EXTRA_ARGS="$EXTRA_ARGS --block-size 64"
PD_EXTRA_ARGS="--max-model-len 10240"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

PD_EXTRA_ARGS is overwritten instead of appended for XPU mode.

When DEVICE_PLATFORM is xpu, line 124 overwrites PD_EXTRA_ARGS entirely. If --single-gpu mode is combined with XPU, the single-GPU settings (KV cache bytes, limit-mm-per-prompt, max-model-len 4096) from line 119 will be lost.

🐛 Proposed fix
 if [[ "${DEVICE_PLATFORM,,}" == "xpu" ]]; then
     EXTRA_ARGS="$EXTRA_ARGS --block-size 64"
-    PD_EXTRA_ARGS="--max-model-len 10240"
+    PD_EXTRA_ARGS="$PD_EXTRA_ARGS --max-model-len 10240"
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [[ "${DEVICE_PLATFORM,,}" == "xpu" ]]; then
EXTRA_ARGS="$EXTRA_ARGS --block-size 64"
PD_EXTRA_ARGS="--max-model-len 10240"
fi
if [[ "${DEVICE_PLATFORM,,}" == "xpu" ]]; then
EXTRA_ARGS="$EXTRA_ARGS --block-size 64"
PD_EXTRA_ARGS="$PD_EXTRA_ARGS --max-model-len 10240"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh` around lines 122 -
125, PD_EXTRA_ARGS is being overwritten when DEVICE_PLATFORM == "xpu", which
discards previously set single-GPU flags; instead append the XPU option to
preserve earlier settings by updating the branch that checks DEVICE_PLATFORM to
add to PD_EXTRA_ARGS (e.g., PD_EXTRA_ARGS="$PD_EXTRA_ARGS --max-model-len
10240") rather than replacing it, and ensure PD_EXTRA_ARGS is initialized before
use so the append works when --single-gpu flags (KV cache bytes,
limit-mm-per-prompt, max-model-len 4096) are present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend external-contribution Pull request is from an external contributor feat multimodal size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant