Support multiple lm_heads (multi-codebook) for LLM-based TTS models like Qwen3-TTS #21057
HaujetZhao
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
Modern neural codec-based TTS models (e.g., Qwen3-TTS, which uses the Mimi neural audio codec) generate audio via multiple codebooks — typically 8–16 parallel vocabulary heads. Each codebook has its own lm_head projection layer mapping the hidden state to a per-codebook vocabulary.
Current Workaround and Its Limitation
When porting such models to llama.cpp, a common workaround is to merge all codebook heads into a single large
output
tensor:
and then sample from a specific offset at inference time. However, this causes severe computational waste: even though only 1 codebook is needed per step, the full ggml_mul_mat computes logits for all n_codebooks × vocab_size tokens. For a 15-codebook model, this is ~14× redundant computation on every forward pass.
Proposed Solution
Add native support for multiple output heads in llama.cpp, allowing a model to declare n_codebooks separate output.weight tensors (e.g., output.0.weight, output.1.weight, …), and select the active head at inference time.
At the graph-build level, this would look like:
// Only multiply against the active codebook head cur = ggml_mul_mat(ctx0, model.output_heads[active_codebook], cur);This eliminates the redundancy entirely, making inference proportionally faster for multi-codebook models.
Alternatively (Minimal Change)
Even without a new multi-head architecture, exposing a way to pass a ggml_view-based slice of the output tensor through the API would allow users to achieve the same result without modifying the internal graph builder.
Affected Models:
Beta Was this translation helpful? Give feedback.
All reactions