feat(vllm): Add Per Request Scheduler for dual/multiple encoders EPD#7668
feat(vllm): Add Per Request Scheduler for dual/multiple encoders EPD#7668ZhengHongming888 wants to merge 6 commits intoai-dynamo:mainfrom
Conversation
Add CPU encoder support for disaggregated multimodal EPD: - Add device parameter to load_vision_model() for CPU/GPU selection - Add DYN_ENCODER_DEVICE environment variable - Fix spatial_merge_size attribute access for HuggingFace models - Add device verification logging - Add cpu_encoder.sh launch script - Fix script path in disagg_multimodal_epd_xpu.sh Signed-off-by: Hongming Zheng <[email protected]>
Signed-off-by: Hongming Zheng <[email protected]>
Signed-off-by: Hongming Zheng <[email protected]>
Relocated CPU encoder launch script to xpu subdirectory and updated relative paths to common utilities (../../../../common/). Co-Authored-By: Claude Sonnet 4.5 <[email protected]> Signed-off-by: Hongming Zheng <[email protected]>
Co-authored-by: Mittal Kushal <[email protected]> Co-authored-by: Robin Zhang <[email protected]> Signed-off-by: Hongming Zheng <[email protected]>
|
👋 Hi ZhengHongming888! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughThe changes implement a disaggregated multimodal encoding pipeline with environment-driven device selection, improved logging, and configurable encoder scheduling. They add support for CPU/accelerator device overrides, runtime device verification, and per-request versus per-image scheduler modes. New launch scripts demonstrate the complete pipeline configuration. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (13)
components/src/dynamo/vllm/multimodal_utils/encode_utils.py (1)
109-117: Remove redundant logger declaration.Line 110 re-declares
logger = logging.getLogger(__name__)which shadows the module-level logger already defined at line 25. SincegetLogger(__name__)returns the same logger instance, this line is unnecessary.♻️ Proposed fix
with torch.no_grad(): # Log encoder device during inference - logger = logging.getLogger(__name__) try: encoder_device = next(vision_encoder.parameters()).device logger.info(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/vllm/multimodal_utils/encode_utils.py` around lines 109 - 117, Remove the redundant local logger declaration inside the encode block: delete the line "logger = logging.getLogger(__name__)" so the module-level logger (defined earlier) is used; keep the try/except that accesses vision_encoder.parameters() and the logger.info calls unchanged, referencing the existing module-level logger and the vision_encoder symbol in encode_utils.py.examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh (3)
15-15: Hard-coded/workspacepath reduces portability.Consider using relative paths via
$SCRIPT_DIRfor consistency with other launch scripts.Based on learnings: "Flag hard-coded portability-reducing constants in shell/scripts across the repository."
Also applies to: 40-41
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh` at line 15, The OUTPUT_DIR assignment uses a hard-coded /workspace path which reduces portability; change the OUTPUT_DIR expression in this script (referencing the OUTPUT_DIR variable) to build the path relative to the script directory (use the existing SCRIPT_DIR variable like other launch scripts) and update the other occurrences mentioned (around the other OUTPUT_DIR uses at lines ~40-41) so all output paths are derived from SCRIPT_DIR instead of /workspace, preserving the timestamp suffix behavior.
7-7: Add EXIT trap for cleanup consistency.Same as the per-request benchmark script, this lacks cleanup on exit.
♻️ Proposed fix
set -e +trap 'echo Cleaning up...; kill 0' EXIT🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh` at line 7, The script currently only has "set -e" and lacks an EXIT trap for consistent cleanup; add a cleanup function (e.g., cleanup) that undoes temporary state and kills any background processes started by run_per_image_encode_schedule_benchmark.sh, register it with trap to run on EXIT (trap 'cleanup' EXIT), and ensure the cleanup function is defined before starting any background work so it reliably runs on any exit path.
1-90: Consider extracting shared logic to reduce duplication.This script shares significant structure with
run_per_request_encode_schedule_benchmark.sh. Consider extracting the common benchmark logic into a shared helper script that accepts scheduler mode and split ratio as parameters.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh` around lines 1 - 90, The two scripts duplicate startup, readiness, wait/warmup and benchmarking logic—extract that common flow into a shared helper (e.g., launch_benchmark.sh) and call it from run_per_image_encode_schedule_benchmark.sh and run_per_request_encode_schedule_benchmark.sh, passing parameters like DYN_ENCODER_SCHEDULER (per_image / per_request), DEVICE_PLATFORM, PORT, MODEL, DATASET, NUM_PROMPTS, SEED and optional split ratio; move the server start sequence (export DYN_ENCODER_SCHEDULER..., bash dual_encoders_for_epd.sh, capturing SERVER_PID), the readiness loop (curl health check), the warmup/sleep handling, and the vllm bench serve invocation into reusable functions inside the helper so the two scripts only set args and delegate to functions such as start_server_and_wait, run_benchmark, and stop_server.examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh (3)
16-16: Hard-coded/workspacepath reduces portability.The script uses absolute paths (
/workspace/...) which ties it to a specific environment. Consider using$SCRIPT_DIRrelative paths like other scripts in this directory.♻️ Proposed fix
+SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" + -OUTPUT_DIR="/workspace/benchmark_results_per_request_8_1_$(date +%Y%m%d_%H%M%S)" +OUTPUT_DIR="${BENCHMARK_OUTPUT_DIR:-$PWD}/benchmark_results_per_request_8_1_$(date +%Y%m%d_%H%M%S)" mkdir -p "$OUTPUT_DIR"And for the launch script path:
-bash /workspace/examples/backends/vllm/launch/xpu/dual_encoders_for_epd.sh \ +bash "$SCRIPT_DIR/dual_encoders_for_epd.sh" \ --model "$MODEL" > "$OUTPUT_DIR/server.log" 2>&1 &Based on learnings: "Flag hard-coded portability-reducing constants in shell/scripts across the repository."
Also applies to: 43-44
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh` at line 16, The OUTPUT_DIR variable is hard-coded to /workspace which reduces portability; change its assignment to build the path relative to the script directory (use the existing SCRIPT_DIR variable or compute SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)") and then set OUTPUT_DIR="${SCRIPT_DIR}/benchmark_results_per_request_8_1_$(date +%Y%m%d_%H%M%S)"; also update any other hard-coded launch script path occurrences (the launch path referenced later in the script) to use SCRIPT_DIR similarly so all paths are relative to the script location.
7-7: Add EXIT trap for cleanup consistency.The script starts background processes but lacks an EXIT trap to clean them up on failure or interruption. This differs from other launch scripts in the repo that use
trap 'echo Cleaning up...; kill 0' EXIT.♻️ Proposed fix
set -e +trap 'echo Cleaning up...; kill 0' EXIT🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh` at line 7, Add an EXIT trap to ensure background processes are cleaned up on failure/interrupt; in the run_per_request_encode_schedule_benchmark.sh script, immediately after the existing set -e line register a trap like the repo's pattern (e.g., trap 'echo Cleaning up...; kill 0' EXIT) so any spawned background jobs are killed when the script exits; ensure the trap runs before any background processes are started so it covers failures during startup and normal execution.
77-77: Consider reducing or parameterizing the 100-second sleep.The
sleep 100is a lengthy fixed delay. Consider making this configurable via an environment variable or reducing it based on actual server readiness signals.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh` at line 77, The fixed 100-second delay in run_per_request_encode_schedule_benchmark.sh (the line with "sleep 100") should be made configurable and/or reduced; replace the hardcoded sleep with a configurable environment variable (e.g., SLEEP_SECONDS) that defaults to a smaller value and/or implement a readiness check loop that polls server readiness (health endpoint or port) before proceeding, updating references to "sleep 100" accordingly in the script.components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py (3)
46-60: Avoid catching blindException; catch specific error types.The
except Exceptionis overly broad. The parsing logic can only raiseValueError(from invalid format or int conversion) andIndexError(from split). Catching these specifically provides clearer error handling.♻️ Proposed fix
def _parse_split_ratio(ratio_str: str) -> tuple[int, int]: """Parse split ratio string like '7:3' into tuple (7, 3).""" try: parts = ratio_str.split(":") if len(parts) != 2: raise ValueError(f"Invalid format: {ratio_str}") ratio1, ratio2 = int(parts[0]), int(parts[1]) if ratio1 <= 0 or ratio2 <= 0: raise ValueError(f"Ratios must be positive: {ratio_str}") return (ratio1, ratio2) - except Exception as e: + except (ValueError, IndexError) as e: logger.warning( f"Failed to parse DYN_ENCODER_SPLIT_RATIO='{ratio_str}': {e}. Using default 1:1" ) return (1, 1)As per coding guidelines: "fail fast—avoid broad/bare exception swallowing (
except:,except Exception: pass)".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py` around lines 46 - 60, The _parse_split_ratio function currently uses a broad except Exception which hides other issues; update its exception handler to catch only the expected parsing errors (e.g., ValueError and IndexError) so that only invalid-format/int-conversion/split errors are handled and logged, preserving the existing logger.warning and default return (1,1) behavior; reference _parse_split_ratio to locate the try/except block and replace except Exception as e with except (ValueError, IndexError) as e.
128-151: Use truthiness checks instead of== []for emptiness.Comparing
multi_modal_data["image"] == []is less idiomatic than checking emptiness directly. However, note thatdefaultdict(list)returns[]for missing keys, so the current approach works correctly.♻️ Proposed fix for readability
- if multi_modal_data["image"] == []: + if not multi_modal_data["image"]: multi_modal_data["image"] = mm_data["image"]Apply similarly to line 146.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py` around lines 128 - 151, Replace the explicit list equality checks with truthiness checks to test emptiness: change both occurrences of `multi_modal_data["image"] == []` to `if not multi_modal_data["image"]:` (and the corresponding else branch to `if multi_modal_data["image"]:` if needed) so the code in prefill_worker_utils.py that manipulates `multi_modal_data["image"]` and `mm_data["image"]` uses idiomatic emptiness checks for lists/tensors in the `multi_modal_data` handling logic.
65-66: Module-level mutable state is acceptable here but add a note about concurrency.The comment on line 65 correctly notes GIL safety for a single event loop. Consider adding a brief note that this counter is not process-safe if multiple prefill workers run in the same process.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py` around lines 65 - 66, The module-level counter _encoder_round_robin_index is currently documented as GIL/thread-safe for a single event loop but lacks a note that it is not safe across multiple processes; update the comment above _encoder_round_robin_index to explicitly state it is not process-safe if multiple prefill workers run in the same process or across processes and suggest using a process-safe primitive (e.g., multiprocessing.Value/Manager or an external coordinator) or redesigning scheduling if multi-process concurrency is required.examples/backends/vllm/launch/dual_encoder_epd.sh (1)
99-125: Hard-coded NIXL ports may cause conflicts.The VLLM_NIXL_SIDE_CHANNEL_PORT values (20097, 20100) and ZMQ endpoint ports (20080, 20081, 20082, 20083) are hard-coded. In multi-instance or shared-node scenarios, these could conflict. Consider using
alloc_portfromlaunch_utils.shor making these configurable via environment variables.Based on learnings: "Flag hard-coded portability-reducing constants in shell/scripts across the repository... use alloc_port for dynamic ports."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/backends/vllm/launch/dual_encoder_epd.sh` around lines 99 - 125, Replace hard-coded ports (VLLM_NIXL_SIDE_CHANNEL_PORT values 20097/20100 and the ZMQ endpoints tcp://*:20080..20083) with dynamically allocated ports using alloc_port from launch_utils.sh or by reading port env vars; update the two launch blocks that set VLLM_NIXL_SIDE_CHANNEL_PORT and the --kv-events-config endpoints to call alloc_port (or use provided env vars) and export those port values so both the VLLM_NIXL_SIDE_CHANNEL_PORT variable and the --kv-events-config JSON strings use the allocated ports, ensuring the script uses alloc_port where available and falls back to an env-provided port if set.examples/backends/vllm/launch/xpu/dual_encoders_for_epd.sh (1)
127-127: Fixed sleep for encoder registration may be unreliable.The
sleep 5assumes encoders register within 5 seconds, which may be insufficient on slower systems or under memory pressure, or excessive on fast systems. Consider using a readiness probe or health check instead of a fixed delay.For example, poll for encoder availability before starting the prefill worker:
# Wait for encoders to be ready (up to 30s) for i in {1..30}; do if check_encoder_ready; then break; fi sleep 1 done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/backends/vllm/launch/xpu/dual_encoders_for_epd.sh` at line 127, Replace the fixed "sleep 5" pause with a readiness polling loop that waits for encoder registration before starting the prefill worker: implement a helper (e.g., check_encoder_ready) that probes each encoder's health/registration endpoint or socket, poll it every 1s for up to 30s (break when all encoders report ready), and fail/exit with a clear error if the timeout is reached; update the startup sequence that currently relies on "sleep 5" to call this polling helper and only proceed to start the prefill worker when readiness is confirmed.examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh (1)
132-148: Consider making NIXL/ZMQ ports configurable for multi-instance scenarios.The script uses hard-coded ports (20097-20099 for NIXL side channels, 20080-20082 for ZMQ endpoints). While acceptable for example/benchmark scripts, this prevents running multiple instances on the same node.
For future flexibility, consider exposing these as environment variables with defaults:
NIXL_PORT_ENCODE=${NIXL_PORT_ENCODE:-20097} NIXL_PORT_PREFILL=${NIXL_PORT_PREFILL:-20098} NIXL_PORT_DECODE=${NIXL_PORT_DECODE:-20099}Based on learnings: "Flag hard-coded portability-reducing constants in shell/scripts across the repository (e.g., fixed memory sizes, fixed memory fractions, static ports...)"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh` around lines 132 - 148, Replace the hard-coded NIXL and ZMQ ports with configurable environment variables and defaults: introduce variables like NIXL_PORT_ENCODE, NIXL_PORT_PREFILL, NIXL_PORT_DECODE and ZMQ_PORT_ENCODE, ZMQ_PORT_PREFILL, ZMQ_PORT_DECODE (with sensible default values matching current ports) and use those variables wherever VLLM_NIXL_SIDE_CHANNEL_PORT and the hard-coded tcp://*:2008X endpoints are set in the script (affecting the blocks that set VLLM_NIXL_SIDE_CHANNEL_PORT and the --kv-events-config strings in the encode/prefill/decode launches); ensure the existing variable names (VLLM_NIXL_SIDE_CHANNEL_PORT, --kv-events-config) are preserved but sourced from the new env vars so multiple instances can run without port collisions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh`:
- Around line 122-125: PD_EXTRA_ARGS is being overwritten when DEVICE_PLATFORM
== "xpu", which discards previously set single-GPU flags; instead append the XPU
option to preserve earlier settings by updating the branch that checks
DEVICE_PLATFORM to add to PD_EXTRA_ARGS (e.g., PD_EXTRA_ARGS="$PD_EXTRA_ARGS
--max-model-len 10240") rather than replacing it, and ensure PD_EXTRA_ARGS is
initialized before use so the append works when --single-gpu flags (KV cache
bytes, limit-mm-per-prompt, max-model-len 4096) are present.
---
Nitpick comments:
In `@components/src/dynamo/vllm/multimodal_utils/encode_utils.py`:
- Around line 109-117: Remove the redundant local logger declaration inside the
encode block: delete the line "logger = logging.getLogger(__name__)" so the
module-level logger (defined earlier) is used; keep the try/except that accesses
vision_encoder.parameters() and the logger.info calls unchanged, referencing the
existing module-level logger and the vision_encoder symbol in encode_utils.py.
In `@components/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.py`:
- Around line 46-60: The _parse_split_ratio function currently uses a broad
except Exception which hides other issues; update its exception handler to catch
only the expected parsing errors (e.g., ValueError and IndexError) so that only
invalid-format/int-conversion/split errors are handled and logged, preserving
the existing logger.warning and default return (1,1) behavior; reference
_parse_split_ratio to locate the try/except block and replace except Exception
as e with except (ValueError, IndexError) as e.
- Around line 128-151: Replace the explicit list equality checks with truthiness
checks to test emptiness: change both occurrences of `multi_modal_data["image"]
== []` to `if not multi_modal_data["image"]:` (and the corresponding else branch
to `if multi_modal_data["image"]:` if needed) so the code in
prefill_worker_utils.py that manipulates `multi_modal_data["image"]` and
`mm_data["image"]` uses idiomatic emptiness checks for lists/tensors in the
`multi_modal_data` handling logic.
- Around line 65-66: The module-level counter _encoder_round_robin_index is
currently documented as GIL/thread-safe for a single event loop but lacks a note
that it is not safe across multiple processes; update the comment above
_encoder_round_robin_index to explicitly state it is not process-safe if
multiple prefill workers run in the same process or across processes and suggest
using a process-safe primitive (e.g., multiprocessing.Value/Manager or an
external coordinator) or redesigning scheduling if multi-process concurrency is
required.
In `@examples/backends/vllm/launch/dual_encoder_epd.sh`:
- Around line 99-125: Replace hard-coded ports (VLLM_NIXL_SIDE_CHANNEL_PORT
values 20097/20100 and the ZMQ endpoints tcp://*:20080..20083) with dynamically
allocated ports using alloc_port from launch_utils.sh or by reading port env
vars; update the two launch blocks that set VLLM_NIXL_SIDE_CHANNEL_PORT and the
--kv-events-config endpoints to call alloc_port (or use provided env vars) and
export those port values so both the VLLM_NIXL_SIDE_CHANNEL_PORT variable and
the --kv-events-config JSON strings use the allocated ports, ensuring the script
uses alloc_port where available and falls back to an env-provided port if set.
In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh`:
- Around line 132-148: Replace the hard-coded NIXL and ZMQ ports with
configurable environment variables and defaults: introduce variables like
NIXL_PORT_ENCODE, NIXL_PORT_PREFILL, NIXL_PORT_DECODE and ZMQ_PORT_ENCODE,
ZMQ_PORT_PREFILL, ZMQ_PORT_DECODE (with sensible default values matching current
ports) and use those variables wherever VLLM_NIXL_SIDE_CHANNEL_PORT and the
hard-coded tcp://*:2008X endpoints are set in the script (affecting the blocks
that set VLLM_NIXL_SIDE_CHANNEL_PORT and the --kv-events-config strings in the
encode/prefill/decode launches); ensure the existing variable names
(VLLM_NIXL_SIDE_CHANNEL_PORT, --kv-events-config) are preserved but sourced from
the new env vars so multiple instances can run without port collisions.
In `@examples/backends/vllm/launch/xpu/dual_encoders_for_epd.sh`:
- Line 127: Replace the fixed "sleep 5" pause with a readiness polling loop that
waits for encoder registration before starting the prefill worker: implement a
helper (e.g., check_encoder_ready) that probes each encoder's
health/registration endpoint or socket, poll it every 1s for up to 30s (break
when all encoders report ready), and fail/exit with a clear error if the timeout
is reached; update the startup sequence that currently relies on "sleep 5" to
call this polling helper and only proceed to start the prefill worker when
readiness is confirmed.
In
`@examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh`:
- Line 15: The OUTPUT_DIR assignment uses a hard-coded /workspace path which
reduces portability; change the OUTPUT_DIR expression in this script
(referencing the OUTPUT_DIR variable) to build the path relative to the script
directory (use the existing SCRIPT_DIR variable like other launch scripts) and
update the other occurrences mentioned (around the other OUTPUT_DIR uses at
lines ~40-41) so all output paths are derived from SCRIPT_DIR instead of
/workspace, preserving the timestamp suffix behavior.
- Line 7: The script currently only has "set -e" and lacks an EXIT trap for
consistent cleanup; add a cleanup function (e.g., cleanup) that undoes temporary
state and kills any background processes started by
run_per_image_encode_schedule_benchmark.sh, register it with trap to run on EXIT
(trap 'cleanup' EXIT), and ensure the cleanup function is defined before
starting any background work so it reliably runs on any exit path.
- Around line 1-90: The two scripts duplicate startup, readiness, wait/warmup
and benchmarking logic—extract that common flow into a shared helper (e.g.,
launch_benchmark.sh) and call it from run_per_image_encode_schedule_benchmark.sh
and run_per_request_encode_schedule_benchmark.sh, passing parameters like
DYN_ENCODER_SCHEDULER (per_image / per_request), DEVICE_PLATFORM, PORT, MODEL,
DATASET, NUM_PROMPTS, SEED and optional split ratio; move the server start
sequence (export DYN_ENCODER_SCHEDULER..., bash dual_encoders_for_epd.sh,
capturing SERVER_PID), the readiness loop (curl health check), the warmup/sleep
handling, and the vllm bench serve invocation into reusable functions inside the
helper so the two scripts only set args and delegate to functions such as
start_server_and_wait, run_benchmark, and stop_server.
In
`@examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh`:
- Line 16: The OUTPUT_DIR variable is hard-coded to /workspace which reduces
portability; change its assignment to build the path relative to the script
directory (use the existing SCRIPT_DIR variable or compute SCRIPT_DIR="$(cd
"$(dirname "${BASH_SOURCE[0]}")" && pwd)") and then set
OUTPUT_DIR="${SCRIPT_DIR}/benchmark_results_per_request_8_1_$(date
+%Y%m%d_%H%M%S)"; also update any other hard-coded launch script path
occurrences (the launch path referenced later in the script) to use SCRIPT_DIR
similarly so all paths are relative to the script location.
- Line 7: Add an EXIT trap to ensure background processes are cleaned up on
failure/interrupt; in the run_per_request_encode_schedule_benchmark.sh script,
immediately after the existing set -e line register a trap like the repo's
pattern (e.g., trap 'echo Cleaning up...; kill 0' EXIT) so any spawned
background jobs are killed when the script exits; ensure the trap runs before
any background processes are started so it covers failures during startup and
normal execution.
- Line 77: The fixed 100-second delay in
run_per_request_encode_schedule_benchmark.sh (the line with "sleep 100") should
be made configurable and/or reduced; replace the hardcoded sleep with a
configurable environment variable (e.g., SLEEP_SECONDS) that defaults to a
smaller value and/or implement a readiness check loop that polls server
readiness (health endpoint or port) before proceeding, updating references to
"sleep 100" accordingly in the script.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d0b1ecf5-37ef-464a-9251-b6ed816a8f49
📒 Files selected for processing (10)
components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.pycomponents/src/dynamo/vllm/multimodal_utils/encode_utils.pycomponents/src/dynamo/vllm/multimodal_utils/model.pycomponents/src/dynamo/vllm/multimodal_utils/prefill_worker_utils.pyexamples/backends/vllm/launch/dual_encoder_epd.shexamples/backends/vllm/launch/xpu/cpu_encoder_for_epd.shexamples/backends/vllm/launch/xpu/disagg_multimodal_epd_xpu.shexamples/backends/vllm/launch/xpu/dual_encoders_for_epd.shexamples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.shexamples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.sh
| if [[ "${DEVICE_PLATFORM,,}" == "xpu" ]]; then | ||
| EXTRA_ARGS="$EXTRA_ARGS --block-size 64" | ||
| PD_EXTRA_ARGS="--max-model-len 10240" | ||
| fi |
There was a problem hiding this comment.
PD_EXTRA_ARGS is overwritten instead of appended for XPU mode.
When DEVICE_PLATFORM is xpu, line 124 overwrites PD_EXTRA_ARGS entirely. If --single-gpu mode is combined with XPU, the single-GPU settings (KV cache bytes, limit-mm-per-prompt, max-model-len 4096) from line 119 will be lost.
🐛 Proposed fix
if [[ "${DEVICE_PLATFORM,,}" == "xpu" ]]; then
EXTRA_ARGS="$EXTRA_ARGS --block-size 64"
- PD_EXTRA_ARGS="--max-model-len 10240"
+ PD_EXTRA_ARGS="$PD_EXTRA_ARGS --max-model-len 10240"
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [[ "${DEVICE_PLATFORM,,}" == "xpu" ]]; then | |
| EXTRA_ARGS="$EXTRA_ARGS --block-size 64" | |
| PD_EXTRA_ARGS="--max-model-len 10240" | |
| fi | |
| if [[ "${DEVICE_PLATFORM,,}" == "xpu" ]]; then | |
| EXTRA_ARGS="$EXTRA_ARGS --block-size 64" | |
| PD_EXTRA_ARGS="$PD_EXTRA_ARGS --max-model-len 10240" | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh` around lines 122 -
125, PD_EXTRA_ARGS is being overwritten when DEVICE_PLATFORM == "xpu", which
discards previously set single-GPU flags; instead append the XPU option to
preserve earlier settings by updating the branch that checks DEVICE_PLATFORM to
add to PD_EXTRA_ARGS (e.g., PD_EXTRA_ARGS="$PD_EXTRA_ARGS --max-model-len
10240") rather than replacing it, and ensure PD_EXTRA_ARGS is initialized before
use so the append works when --single-gpu flags (KV cache bytes,
limit-mm-per-prompt, max-model-len 4096) are present.
Overview:
This PR is to add the Per Request scheduler for dual/multiple encoders for EPD disaggregation scenario.
The problem solved here:
Default encoder scheduler in EPD with vllm backend is to use the per image based within one request. for example, one request with 2 image, assume 2 encoders (1 for xpu and 1 for cpu) then 1 image to xpu encoder and 1 image to cpu encode and then they will merge together into one embedding for kv cache. So this case the xpu encode will wait for the finish of cpu encode and slow the process much.
We provide per request scheduler like as in example , 8 request for xpu and 1 request for cpu then their process time may be similar in parallel processing which will greatly offload the encode working with help of CPU encode.
In XPU/Cuda environment the CPU still can help the offloading work from XPU/GPU for this EPD scenario. Here one example result below.
Details:
./examples/backends/vllm/launch/xpu/run_per_image_encode_schedule_benchmark.sh./examples/backends/vllm/launch/xpu/run_per_request_encode_schedule_benchmark.shYou will see per request scheduler is better than per image from throughput/TTFT/TPOT point of view.
Thanks.
Summary by CodeRabbit
New Features
Improvements