-
Notifications
You must be signed in to change notification settings - Fork 964
Description
Type of Change
New feature
Problem Statement
Dynamo supports multimodal generation through its vLLM Omni integration — image generation (/v1/images/generations) and video generation (/v1/videos) already ship. TTS is a natural next modality: vLLM Omni already supports TTS models (e.g. Qwen3-TTS), and the Dynamo codebase contains partial scaffolding for audio (protocol stubs, endpoint type enum, router placeholders), but no working end-to-end implementation. Users deploying TTS models through Dynamo today have no supported path.
Proposed Solution
Add a new Text-to-Speech audio generation endpoint (POST /v1/audio/speech) to Dynamo, powered by vLLM Omni (sglang support is out of scope here). The endpoint accepts text input and returns a complete WAV or PCM audio file, following the OpenAI TTS API contract and the same architectural patterns established by the image and video generation modalities.
V1 scope is a working end-to-end pipeline: HTTP handler, protocol types, model discovery, metrics, and a single supported model (Qwen3-TTS) producing non-streaming audio responses.
Explicitly out of scope for V1:
- Support for additional TTS models (e.g., Voxtral TTS)
- Audio streaming to clients
- Voice cloning (ref_audio / ref_text)
- Additional codecs beyond WAV and PCM (MP3, FLAC, Opus, AAC, OGG)
More retails in this DEP
ai-dynamo/enhancements#78
Estimated PR Size
XXL (1000+ lines)
Files/Components Affected
Draft PR: #7661
components/src/dynamo/vllm/omni/omni_handler.py
components/src/dynamo/common/protocols/audio_protocol.py
components/src/dynamo/common/utils/output_modalities.py
lib/bindings/python/rust/lib.rs
lib/llm/src/discovery/model.rs
lib/llm/src/discovery/model_manager.rs
lib/llm/src/discovery/watcher.rs
lib/llm/src/discovery/worker_set.rs
lib/llm/src/http/service/metrics.rs
lib/llm/src/http/service/openai.rs
lib/llm/src/http/service/service_v2.rs
lib/llm/src/protocols/openai/audios/aggregator.rs
lib/llm/src/protocols/openai/audios/nvext.rs
lib/llm/src/protocols/openai/audios.rs
lib/llm/src/protocols/openai.rs