Skip to content

feat(vllm): add TTS audio generation endpoint via vLLM Omni#7661

Draft
hatemfaheem wants to merge 1 commit intoai-dynamo:mainfrom
hatemfaheem:hatemfaheem/vllm-tts-support
Draft

feat(vllm): add TTS audio generation endpoint via vLLM Omni#7661
hatemfaheem wants to merge 1 commit intoai-dynamo:mainfrom
hatemfaheem:hatemfaheem/vllm-tts-support

Conversation

@hatemfaheem
Copy link
Copy Markdown

@hatemfaheem hatemfaheem commented Mar 26, 2026

DEP: ai-dynamo/enhancements#78

Overview:

Add a Text-to-Speech (TTS) audio generation endpoint (POST /v1/audio/speech) to Dynamo, powered by vLLM Omni. The endpoint accepts text input and returns a complete WAV or PCM audio file, following the same architectural patterns already established by the image and video generation modalities. This proposal covers the first working version (V1) that delivers complete audio responses, while laying the groundwork for future streaming and additional codecs.

Example

Request

curl -X POST http://localhost:8091/v1/audio/speech \                                                                           
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
        "input": "Hello, this is a test of text to speech.",
        "voice": "aiden"
    }'  --output speech.wav

Output: speech.wav

Details

                        RUST                              PYTHON
                  ──────────────                    ────────────────

  Client ──POST──> Axum Router
                   /v1/audio/speech
                        |
                   HTTP Handler
                        |
                   NvCreateAudioSpeechRequest
                        |
                   ModelManager -> WorkerSet
                   -> audios_engine
                        |
                   engine.generate()  ────────────>  OmniHandler.generate()
                                                           |
                                                      Parse request type
                                                      -> AudioGeneration
                                                           |
                                                      Build TTS model inputs
                                                      (model-specific prompt
                                                       construction)
                                                           |
                                                      vLLM engine generates
                                                      audio (cumulative PCM)
                                                           |
                                                      Extract & format audio
                                                      -> WAV or PCM + base64
                                                           |
                   <──── NvAudiosResponse ────────  yield NvAudiosResponse
                   (stream, base64 audio)
                        |
                   Aggregator
                   -> fold stream (take latest)
                        |
                   base64 decode -> raw bytes
                   Set Content-Type header
                        |
  Client <──200──  raw binary audio (WAV/PCM)

Where should the reviewer start?

Most of the files follow images and videos setup. Audio/vllm specific logic is in this file: components/src/dynamo/vllm/omni/omni_handler.py

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

To fix/improve (self review)

  • Move instructions out of nvnext to main API to match OpenAI definition.
  • Omni handler logic, specially, prompt length estimation is coupled to Qwen-TTS.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 26, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi hatemfaheem! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added external-contribution Pull request is from an external contributor backend::vllm Relates to the vllm backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` container labels Mar 26, 2026
Add POST /v1/audio/speech endpoint supporting text-to-speech with
vLLM Omni, following the OpenAI audio API convention. Includes Rust
protocol types, stream aggregation, model discovery for audio-capable
workers, and a Python handler that builds TTS engine inputs with
prompt length estimation and WAV encoding.

Signed-off-by: Hatem Elseidy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend container external-contribution Pull request is from an external contributor feat frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant