refactor(vLLM): Move video support from example to backend#7663
Draft
refactor(vLLM): Move video support from example to backend#7663
Conversation
- replace model-name allowlists with capability-driven vision loading and multimodal handling - add native video_url loading in the standard TokensPrompt multi_modal_data flow - move the video agg/disagg launch scripts under examples/backends/vllm and update docs/tests
…direct to backend
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview:
Details:
Quick Benchmark: Dynamo vs
vllm servefor Video InferenceI ran a quick apples-to-apples comparison between Dynamo aggregate mode (
examples/backends/vllm/launch/video_agg.sh) and plainvllm serve, both servingQwen/Qwen2-VL-2B-Instructon the same machine and GPU configuration.Benchmark command:
aiperf profile \ --model Qwen/Qwen2-VL-2B-Instruct \ --endpoint-type chat \ --endpoint /v1/chat/completions \ --url localhost:8000 \ --video-width 640 \ --video-height 480 \ --video-fps 4 \ --video-duration 5.0 \ --request-count 20 \ --osl 1200 \ --osl-stddev 0 \ --extra-inputs '{"ignore_eos": true, "min_tokens": 1200}' \ --use-server-token-count \ --ui none \ --no-server-metrics \ --no-gpu-telemetryBoth runs completed successfully with identical prompt/completion lengths:
962120020/20video_agg.sh)vllm serve0.17110 req/s0.17150 req/svLLM +0.23%5842.53 ms5829.02 msvLLM -0.23%5648.22 ms5631.58 msvLLM -0.30%5688.17 ms5665.62 msvLLM -0.40%8735.89 ms8833.25 msvLLM +1.11%205.32 tok/s205.79 tok/svLLM +0.23%369.92 tok/s370.77 tok/svLLM +0.23%116.89 s116.62 svLLM -0.23%Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)