-
Notifications
You must be signed in to change notification settings - Fork 964
feat(vllm): add CPU Encode for dual/multiple encoder EPD #7667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ZhengHongming888
wants to merge
2
commits into
ai-dynamo:main
Choose a base branch
from
ZhengHongming888:cpu_encode_for_epd
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+230
−7
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
156 changes: 156 additions & 0 deletions
156
examples/backends/vllm/launch/xpu/cpu_encoder_for_epd.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,156 @@ | ||
| #!/bin/bash | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| set -e | ||
| trap 'echo Cleaning up...; kill 0' EXIT | ||
|
|
||
| SCRIPT_DIR="$(dirname "$(readlink -f "$0")")" | ||
| source "$SCRIPT_DIR/../../../../common/gpu_utils.sh" | ||
| source "$SCRIPT_DIR/../../../../common/launch_utils.sh" | ||
|
|
||
| # Default values | ||
| MODEL_NAME="llava-hf/llava-1.5-7b-hf" | ||
|
|
||
| # --single-gpu: Packs all 3 workers (encode, prefill, decode) onto a single GPU. | ||
| # This is intended for functional testing with small models (e.g. 2B) where CI | ||
| # only has 1 GPU available. It reduces performance by: | ||
| # - Enabling --enforce-eager (disables torch.compile and CUDA graph capture) | ||
| # - Hardcoding P/D KV cache to 512 MB (skips all memory profiling) | ||
| # - Limiting --max-model-len to 4096 tokens on P/D workers | ||
| # - Limiting P/D workers to image=1,video=0,audio=0 (--limit-mm-per-prompt) | ||
| # - Using lower gpu-memory-utilization fractions to share the GPU | ||
| SINGLE_GPU=false | ||
|
|
||
| # Parse command line arguments | ||
| while [[ $# -gt 0 ]]; do | ||
| case $1 in | ||
| --model) | ||
| MODEL_NAME=$2 | ||
| shift 2 | ||
| ;; | ||
| --single-gpu) | ||
| SINGLE_GPU=true | ||
| shift | ||
| ;; | ||
| -h|--help) | ||
| echo "Usage: $0 [OPTIONS]" | ||
| echo "" | ||
| echo "Disaggregated multimodal serving with separate Encode/Prefill/Decode workers" | ||
| echo "" | ||
| echo "Options:" | ||
| echo " --model <model_name> Specify the VLM model to use (default: $MODEL_NAME)" | ||
| echo " LLaVA 1.5 7B, Qwen2.5-VL, and Phi3V models have predefined templates" | ||
| echo " --single-gpu Pack all 3 workers on 1 GPU (for small models, e.g. 2B)" | ||
| echo " -h, --help Show this help message" | ||
| echo "" | ||
| echo "Examples:" | ||
| echo " $0 --model llava-hf/llava-1.5-7b-hf" | ||
| echo " $0 --model microsoft/Phi-3.5-vision-instruct" | ||
| echo " $0 --model Qwen/Qwen2.5-VL-7B-Instruct" | ||
| echo " $0 --model Qwen/Qwen3-VL-2B-Instruct --single-gpu" | ||
| echo "" | ||
| exit 0 | ||
| ;; | ||
| *) | ||
| echo "Unknown option: $1" | ||
| echo "Use --help for usage information" | ||
| exit 1 | ||
| ;; | ||
| esac | ||
| done | ||
|
|
||
| # Device platform and affinity env name. | ||
| # DEVICE_PLATFORM supports: cuda, xpu | ||
| DEVICE_PLATFORM="${DEVICE_PLATFORM:-cuda}" | ||
| if [[ -z "${DEVICE_AFFINITY_ENV:-}" ]]; then | ||
| if [[ "${DEVICE_PLATFORM,,}" == "xpu" ]]; then | ||
| DEVICE_AFFINITY_ENV="ZE_AFFINITY_MASK" | ||
| else | ||
| DEVICE_AFFINITY_ENV="CUDA_VISIBLE_DEVICES" | ||
| fi | ||
| fi | ||
|
|
||
| HTTP_PORT="${DYN_HTTP_PORT:-8000}" | ||
| if [[ "$SINGLE_GPU" == "true" ]]; then | ||
| GPU_LABEL="1 GPU" | ||
| else | ||
| GPU_LABEL="3 GPUs" | ||
| fi | ||
| print_launch_banner --multimodal "Launching Disaggregated Multimodal E/P/D ($GPU_LABEL)" "$MODEL_NAME" "$HTTP_PORT" | ||
|
|
||
|
|
||
| # Start frontend (no router mode) | ||
| echo "Starting frontend..." | ||
| # dynamo.frontend accepts either --http-port flag or DYN_HTTP_PORT env var (defaults to 8000) | ||
| python -m dynamo.frontend & | ||
|
|
||
| EXTRA_ARGS="" | ||
| PD_EXTRA_ARGS="" | ||
|
|
||
| # GPU assignments (override via environment variables) | ||
| # Encoder uses GPU 0 for vLLM infrastructure, but vision model loads on CPU via DYN_ENCODER_DEVICE | ||
| DYN_ENCODE_WORKER_GPU=${DYN_ENCODE_WORKER_GPU:-0} | ||
| DYN_PREFILL_WORKER_GPU=${DYN_PREFILL_WORKER_GPU:-1} | ||
| DYN_DECODE_WORKER_GPU=${DYN_DECODE_WORKER_GPU:-2} | ||
|
|
||
| # GPU memory utilization for workers. | ||
| # NOTE: --kv-cache-memory-bytes (set below for P/D workers) overrides | ||
| # --gpu-memory-utilization for KV cache sizing. Per vLLM CacheConfig: | ||
| # "kv_cache_memory_bytes (when not-None) ignores gpu_memory_utilization" | ||
| # Ref: https://docs.vllm.ai/en/stable/api/vllm/config/cache/ | ||
| # Therefore _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect on actual VRAM | ||
| # usage when --kv-cache-memory-bytes is set. | ||
| if [[ -n "${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}" ]]; then | ||
| echo "WARNING: _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE is set but has no effect here because" >&2 | ||
| echo " --kv-cache-memory-bytes overrides --gpu-memory-utilization in vLLM." >&2 | ||
| fi | ||
| DYN_ENCODE_GPU_MEM=${DYN_ENCODE_GPU_MEM:-0.9} | ||
| DYN_PREFILL_GPU_MEM=${DYN_PREFILL_GPU_MEM:-0.9} | ||
| DYN_DECODE_GPU_MEM=${DYN_DECODE_GPU_MEM:-0.9} | ||
|
|
||
| # 512 MB KV cache per P/D worker. Setting --kv-cache-memory-bytes bypasses vLLM's | ||
| # memory profiling entirely (both language model and multimodal encoder), which avoids | ||
| # OOM during profiling when 3 workers share a GPU. 512 MB covers the | ||
| # minimum vLLM requires for max_model_len=4096 on Qwen3-VL-2B. | ||
| PD_KV_CACHE_BYTES=$((512 * 1024 * 1024)) | ||
|
|
||
| if [[ "$SINGLE_GPU" == "true" ]]; then | ||
| EXTRA_ARGS="--enforce-eager" | ||
| PD_EXTRA_ARGS="--max-model-len 4096 --kv-cache-memory-bytes $PD_KV_CACHE_BYTES --limit-mm-per-prompt {\"image\":1,\"video\":0,\"audio\":0}" | ||
| fi | ||
|
|
||
| if [[ "${DEVICE_PLATFORM,,}" == "xpu" ]]; then | ||
| EXTRA_ARGS="$EXTRA_ARGS --block-size 64" | ||
| PD_EXTRA_ARGS="--max-model-len 10240" | ||
| fi | ||
|
|
||
| # Start encode worker with CPU vision model | ||
| echo "Starting encode worker with CPU vision model (vLLM on GPU $DYN_ENCODE_WORKER_GPU)..." | ||
| # DYN_ENCODER_DEVICE=cpu forces the vision model to load on CPU (device_map="cpu") | ||
| # VLLM_ENCODER=0 ensures HuggingFace encoding path is used (not vLLM encoder) | ||
| # vLLM infrastructure still runs on GPU to maintain compatibility | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT=20097 \ | ||
| DYN_ENCODER_DEVICE=cpu \ | ||
| VLLM_ENCODER=0 \ | ||
| env $DEVICE_AFFINITY_ENV=$DYN_ENCODE_WORKER_GPU \ | ||
| python -m dynamo.vllm --multimodal-encode-worker --enable-multimodal --enable-mm-embeds --model $MODEL_NAME --gpu-memory-utilization $DYN_ENCODE_GPU_MEM $EXTRA_ARGS --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device": "cpu"}' --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20080"}' & | ||
|
|
||
| # Start prefill worker (also handles encode routing via --route-to-encoder) | ||
| echo "Starting prefill worker on GPU $DYN_PREFILL_WORKER_GPU (GPU mem: $DYN_PREFILL_GPU_MEM)..." | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT=20098 \ | ||
| env $DEVICE_AFFINITY_ENV=$DYN_PREFILL_WORKER_GPU \ | ||
| python -m dynamo.vllm --multimodal-worker --route-to-encoder --disaggregation-mode prefill --enable-multimodal --enable-mm-embeds --model $MODEL_NAME --gpu-memory-utilization $DYN_PREFILL_GPU_MEM $EXTRA_ARGS $PD_EXTRA_ARGS --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device": "'"$DEVICE_PLATFORM"'"}' --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081"}' & | ||
|
|
||
| # Start decode worker | ||
| echo "Starting decode worker on GPU $DYN_DECODE_WORKER_GPU (GPU mem: $DYN_DECODE_GPU_MEM)..." | ||
| VLLM_NIXL_SIDE_CHANNEL_PORT=20099 \ | ||
| env $DEVICE_AFFINITY_ENV=$DYN_DECODE_WORKER_GPU \ | ||
| python -m dynamo.vllm --multimodal-decode-worker --enable-multimodal --enable-mm-embeds --model $MODEL_NAME --gpu-memory-utilization $DYN_DECODE_GPU_MEM $EXTRA_ARGS $PD_EXTRA_ARGS --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_buffer_device": "'"$DEVICE_PLATFORM"'"}' --kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20082"}' & | ||
|
|
||
|
|
||
| echo "==================================================" | ||
| echo "All components started. Waiting for initialization..." | ||
| echo "==================================================" | ||
|
|
||
| # Exit on first worker failure; kill 0 in the EXIT trap tears down the rest | ||
| wait_any_exit | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: ai-dynamo/dynamo
Length of output: 3720
🏁 Script executed:
Repository: ai-dynamo/dynamo
Length of output: 111
🏁 Script executed:
Repository: ai-dynamo/dynamo
Length of output: 1127
🏁 Script executed:
rg -n 'multimodal-decode-worker|disaggregation-mode' ./components/src/dynamo/vllm/Repository: ai-dynamo/dynamo
Length of output: 2884
🏁 Script executed:
sed -n '20,110p' ./components/src/dynamo/vllm/backend_args.pyRepository: ai-dynamo/dynamo
Length of output: 3683
🏁 Script executed:
sed -n '220,300p' ./components/src/dynamo/vllm/backend_args.pyRepository: ai-dynamo/dynamo
Length of output: 3609
🏁 Script executed:
rg -n 'multimodal_decode_worker' ./components/src/dynamo/vllm/ -A 3 -B 1Repository: ai-dynamo/dynamo
Length of output: 6465
🏁 Script executed:
sed -n '150,180p' ./components/src/dynamo/vllm/args.pyRepository: ai-dynamo/dynamo
Length of output: 1296
Add
--disaggregation-mode decodeto the decode worker command.The decode worker at line 148 uses
--multimodal-decode-workerbut omits--disaggregation-mode decode. These flags are independent—--multimodal-decode-workeronly sets the component type, not the disaggregation mode. Without the explicit flag, disaggregation_mode defaults to AGGREGATED, which conflicts with the intended decode-only behavior. All other decode workers in the repository explicitly specify--disaggregation-mode decodefor consistency. Add this flag to match the sibling scriptdisagg_multimodal_epd_xpu.sh:142.🤖 Prompt for AI Agents