This repository provides optimized inference deployment for ZipVoice text-to-speech models using NVIDIA Triton Inference Server and PyTriton, with TensorRT acceleration for production environments.
Launch the service directly using Docker Compose:
# For standard ZipVoice model
MODEL=zipvoice docker compose up
# For distilled ZipVoice model (faster inference)
MODEL=zipvoice_distill docker compose upBuild and run the Docker container manually:
# Build the Docker image
docker build . -f Dockerfile.server -t soar97/triton-zipvoice:24.12
# Create and run Docker container
your_mount_dir=/your/host/path:/your/container/path
docker run -it --name "zipvoice-server" --gpus all --net host \
-v $your_mount_dir --shm-size=2g soar97/triton-zipvoice:24.12The run.sh script automates the entire deployment workflow through numbered stages. Run specific stages with:
bash run.sh <start_stage> <stop_stage> [model_name]<start_stage>: Starting stage number (1-8)<stop_stage>: Ending stage number (1-8)[model_name]: Optional model name (zipvoiceorzipvoice_distill, default:zipvoice_distill)
Available Stages:
- Stage 1: Downloads ZipVoice models from HuggingFace
- Stage 2: Exports models to TensorRT format and builds optimized engines
- Stage 3: Creates Triton model repository and configuration files
- Stage 4: Launches Triton Inference Server
- Stage 5: Runs gRPC benchmark tests with multiple concurrency levels
- Stage 6: Tests HTTP client with sample audio
- Stage 7: Launches PyTriton server with speaker caching
- Stage 8: Tests PyTriton server with speaker cache benchmarks
Build TensorRT engines and launch the Triton server:
# Complete setup and launch (stages 1-4)
bash run.sh 1 4 zipvoice_distillNote
To modify the default NFE (Neural Function Evaluation) steps, edit model_repo/zipvoice/1/model.py manually.
Launch the PyTriton server with speaker caching for improved performance:
# Launch PyTriton server (stage 7)
bash run.sh 7 7 zipvoice_distillNote
To use the PyTriton Server, you don't have to use the Docker environment. You can install it manually with pip install nvidia-pytriton.
Test the server with a simple HTTP client:
python3 client_http.py --reference-audio prompt.wav \
--reference-text "Your reference text here" \
--target-text "Text to synthesize" \
--output-audio "./output.wav"Run performance benchmarks using the gRPC client:
# Single task benchmark
python3 client_grpc.py --num-tasks 1 --huggingface-dataset yuekai/seed_tts_cosy2 \
--split-name wenetspeech4tts
# Multi-task benchmark
num_task=8
python3 client_grpc.py --num-tasks $num_task --huggingface-dataset yuekai/seed_tts_cosy2 \
--split-name wenetspeech4ttsRun automated benchmarks across multiple concurrency levels:
# Benchmark Triton server (stage 5)
bash run.sh 5 5 zipvoice_distill
# Benchmark PyTriton server with speaker cache (stage 8)
bash run.sh 8 8 zipvoice_distillPerformance metrics on a single NVIDIA L20 GPU using 26 different prompt-text pairs with ZipVoice Distill (4 NFE steps):
| Concurrency | Processing Time (s) | P50 Latency (ms) | Avg Latency (ms) |
|---|---|---|---|
| 1 | 3.011 | 98.73 | 103.34 |
| 1 (with 3s prompt speaker cache) | 2.652 | 88.78 | 88.34 |
| 2 | 2.261 | 158.71 | 159.49 |
| 2 (with 3s prompt speaker cache) | 1.729 | 116.53 | 119.74 |
| 4 | 1.872 | 272.16 | 261.75 |
| 4 (with 3s prompt speaker cache) | 1.330 | 184.19 | 179.32 |
| 8 | 1.710 | 468.29 | 470.20 |
| 8 (with 3s prompt speaker cache) | 1.220 | 300.48 | 306.35 |
Deploy an OpenAI-compatible TTS API service:
# Clone the OpenAI bridge repository
git clone https://github.com/yuekaizhang/Triton-OpenAI-Speech.git
cd Triton-OpenAI-Speech
pip install -r requirements.txt
# Start the FastAPI bridge (after Triton service is running)
python3 tts_server.py --url http://localhost:8000 \
--ref_audios_dir ./ref_audios/ \
--port 10086 \
--default_sample_rate 24000The PyTriton server supports speaker caching to improve performance for repeated synthesis with the same reference audio:
- Enabled with
--use_speaker_cacheflag - Reduces latency by using short-duration prompt audio (e.g., 3 seconds)
This work originates from the NVIDIA CISI project. For additional multimodal AI resources, visit mair-hub.