Skip to content

refactor(vLLM): Move video support from example to backend#7663

Draft
rmccorm4 wants to merge 4 commits intomainfrom
rmccormick/vllm-video
Draft

refactor(vLLM): Move video support from example to backend#7663
rmccorm4 wants to merge 4 commits intomainfrom
rmccormick/vllm-video

Conversation

@rmccorm4
Copy link
Copy Markdown
Contributor

Overview:

  • replace model-name allowlists with capability-driven vision loading and multimodal handling
  • add native video_url loading in the standard TokensPrompt multi_modal_data flow
  • move the video agg/disagg launch scripts under examples/backends/vllm and update docs/tests

Details:

Quick Benchmark: Dynamo vs vllm serve for Video Inference

I ran a quick apples-to-apples comparison between Dynamo aggregate mode (examples/backends/vllm/launch/video_agg.sh) and plain vllm serve, both serving Qwen/Qwen2-VL-2B-Instruct on the same machine and GPU configuration.

Benchmark command:

aiperf profile \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --endpoint-type chat \
  --endpoint /v1/chat/completions \
  --url localhost:8000 \
  --video-width 640 \
  --video-height 480 \
  --video-fps 4 \
  --video-duration 5.0 \
  --request-count 20 \
  --osl 1200 \
  --osl-stddev 0 \
  --extra-inputs '{"ignore_eos": true, "min_tokens": 1200}' \
  --use-server-token-count \
  --ui none \
  --no-server-metrics \
  --no-gpu-telemetry

Both runs completed successfully with identical prompt/completion lengths:

  • Average ISL: 962
  • Average OSL: 1200
  • Success rate: 20/20
Metric Dynamo (video_agg.sh) vllm serve Delta
Request throughput 0.17110 req/s 0.17150 req/s vLLM +0.23%
Avg latency 5842.53 ms 5829.02 ms vLLM -0.23%
P50 latency 5648.22 ms 5631.58 ms vLLM -0.30%
P90 latency 5688.17 ms 5665.62 ms vLLM -0.40%
P99 latency 8735.89 ms 8833.25 ms vLLM +1.11%
Output token throughput 205.32 tok/s 205.79 tok/s vLLM +0.23%
Total token throughput 369.92 tok/s 370.77 tok/s vLLM +0.23%
Benchmark duration 116.89 s 116.62 s vLLM -0.23%

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

- replace model-name allowlists with capability-driven vision loading and multimodal handling
- add native video_url loading in the standard TokensPrompt multi_modal_data flow
- move the video agg/disagg launch scripts under examples/backends/vllm and update docs/tests
@github-actions github-actions bot added refactor documentation Improvements or additions to documentation backend::vllm Relates to the vllm backend multimodal labels Mar 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 27, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend documentation Improvements or additions to documentation multimodal refactor size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant