M3-TTS/README.md at main · WWWWxp/M3-TTS

$\Large \boldsymbol{\mathsf{\color{#6366f1}M\color{#a855f7}3\color{#ec4899}\text{-}TTS}}: \mathsf{\color{#de2910}M\color{black}\text{ulti-}\color{#de2910}M\color{black}\text{odal\ DiT\ Alignment}\ \color{black}\&\ \color{#de2910}M\color{black}\text{el-latent}}$

📅 Roadmap

Release model code
Release training and inference code
Release pre-trained model weights

🔥 Key Features

No Pseudo-Alignment: Achieves stable alignment implicitly via Joint-DiT attention.
Mel-VAE Codec: Efficient latent representation for faster training and high-fidelity reconstruction.
Unified Architecture: A simple, end-to-end framework without complex multi-stage pipelines.

🙌 Acknowledgements

This project is built upon the excellent work of F5-TTS, MMAudio and Zip-Voice. We thank the authors for their open-source contributions.

📝 Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝:

@article{wang2025m3tts,
  title={M3-TTS: Multi-modal DiT Alignment \& Mel-latent for Zero-shot High-fidelity Speech Synthesis},
  author={Wang, Xiaopeng and Qiang, Chunyu and Fu, Ruibo and others},
  journal={arXiv preprint arXiv:2512.04720},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📅 Roadmap

🔥 Key Features

🙌 Acknowledgements

📝 Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

📅 Roadmap

🔥 Key Features

🙌 Acknowledgements

📝 Citation