feat(vllm): add TTS audio generation endpoint via vLLM Omni by hatemfaheem · Pull Request #7661 · ai-dynamo/dynamo

hatemfaheem · 2026-03-26T17:01:14Z

Overview:

Add a Text-to-Speech (TTS) audio generation endpoint (POST /v1/audio/speech) to Dynamo, powered by vLLM Omni. The endpoint accepts text input and returns a complete WAV or PCM audio file, following the same architectural patterns already established by the image and video generation modalities. This proposal covers the first working version (V1) that delivers complete audio responses, while laying the groundwork for future streaming and additional codecs.

Example

Request

curl -X POST http://localhost:8091/v1/audio/speech \                                                                           
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
        "input": "Hello, this is a test of text to speech.",
        "voice": "aiden"
    }'  --output speech.wav

Output: speech.wav

Details

                        RUST                              PYTHON
                  ──────────────                    ────────────────

  Client ──POST──> Axum Router
                   /v1/audio/speech
                        |
                   HTTP Handler
                        |
                   NvCreateAudioSpeechRequest
                        |
                   ModelManager -> WorkerSet
                   -> audios_engine
                        |
                   engine.generate()  ────────────>  OmniHandler.generate()
                                                           |
                                                      Parse request type
                                                      -> AudioGeneration
                                                           |
                                                      Build TTS model inputs
                                                      (model-specific prompt
                                                       construction)
                                                           |
                                                      vLLM engine generates
                                                      audio (cumulative PCM)
                                                           |
                                                      Extract & format audio
                                                      -> WAV or PCM + base64
                                                           |
                   <──── NvAudiosResponse ────────  yield NvAudiosResponse
                   (stream, base64 audio)
                        |
                   Aggregator
                   -> fold stream (take latest)
                        |
                   base64 decode -> raw bytes
                   Set Content-Type header
                        |
  Client <──200──  raw binary audio (WAV/PCM)

Where should the reviewer start?

Most of the files follow images and videos setup. Audio/vllm specific logic is in this file: components/src/dynamo/vllm/omni/omni_handler.py

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: [CONTRIBUTION]: Add TTS Support with vllm omni (V1) #7664

To fix/improve (self review)

Move instructions out of nvnext to main API to match OpenAI definition.
Omni handler logic, specially, prompt length estimation is coupled to Qwen-TTS.

copy-pr-bot · 2026-03-26T17:01:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-03-26T17:01:31Z

👋 Hi hatemfaheem! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

Add POST /v1/audio/speech endpoint supporting text-to-speech with vLLM Omni, following the OpenAI audio API convention. Includes Rust protocol types, stream aggregation, model discovery for audio-capable workers, and a Python handler that builds TTS engine inputs with prompt length estimation and WAV encoding. Signed-off-by: Hatem Elseidy <[email protected]>

pull-request-size bot added the size/XXL label Mar 26, 2026

github-actions bot added the feat label Mar 26, 2026

github-actions bot added external-contribution Pull request is from an external contributor backend::vllm Relates to the vllm backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` container labels Mar 26, 2026

hatemfaheem force-pushed the hatemfaheem/vllm-tts-support branch from 423875d to 54b707a Compare March 26, 2026 21:36

hatemfaheem mentioned this pull request Mar 27, 2026

[CONTRIBUTION]: Add TTS Support with vllm omni (V1) #7664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vllm): add TTS audio generation endpoint via vLLM Omni#7661

feat(vllm): add TTS audio generation endpoint via vLLM Omni#7661
hatemfaheem wants to merge 1 commit intoai-dynamo:mainfrom
hatemfaheem:hatemfaheem/vllm-tts-support

hatemfaheem commented Mar 26, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hatemfaheem commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Example

Details

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

To fix/improve (self review)

Uh oh!

copy-pr-bot bot commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hatemfaheem commented Mar 26, 2026 •

edited

Loading