Support multiple lm_heads (multi-codebook) for LLM-based TTS models like Qwen3-TTS #21057

HaujetZhao · 2026-03-27T07:03:52Z

HaujetZhao
Mar 27, 2026

Motivation

Modern neural codec-based TTS models (e.g., Qwen3-TTS, which uses the Mimi neural audio codec) generate audio via multiple codebooks — typically 8–16 parallel vocabulary heads. Each codebook has its own lm_head projection layer mapping the hidden state to a per-codebook vocabulary.

Current Workaround and Its Limitation

When porting such models to llama.cpp, a common workaround is to merge all codebook heads into a single large
output
tensor:

output.weight shape: [n_embd, n_codebooks × vocab_size]

and then sample from a specific offset at inference time. However, this causes severe computational waste: even though only 1 codebook is needed per step, the full ggml_mul_mat computes logits for all n_codebooks × vocab_size tokens. For a 15-codebook model, this is ~14× redundant computation on every forward pass.

Proposed Solution

Add native support for multiple output heads in llama.cpp, allowing a model to declare n_codebooks separate output.weight tensors (e.g., output.0.weight, output.1.weight, …), and select the active head at inference time.

At the graph-build level, this would look like:

// Only multiply against the active codebook head
cur = ggml_mul_mat(ctx0, model.output_heads[active_codebook], cur);

This eliminates the redundancy entirely, making inference proportionally faster for multi-codebook models.

Alternatively (Minimal Change)

Even without a new multi-head architecture, exposing a way to pass a ggml_view-based slice of the output tensor through the API would allow users to achieve the same result without modifying the internal graph builder.

Affected Models:

Qwen3-TTS (Alibaba) — 15 codebooks (Mimi codec, 12.5 Hz)
Dia (Nari Labs) — multi-codebook architecture
Any model using EnCodec, Mimi, DAC, or similar neural audio codecs as the decoding backend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple lm_heads (multi-codebook) for LLM-based TTS models like Qwen3-TTS #21057

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Support multiple lm_heads (multi-codebook) for LLM-based TTS models like Qwen3-TTS #21057

Uh oh!

HaujetZhao Mar 27, 2026

Motivation

Current Workaround and Its Limitation

Proposed Solution

Alternatively (Minimal Change)

Replies: 0 comments

HaujetZhao
Mar 27, 2026