Skip to content

Speculative decoding fails (almost) silently instead of clearly notifying users about draft model loading issues #280

@vnicolici

Description

@vnicolici

When I enabled the option to visualize the accepted tokens nothing was highlighted. Further more, after the assistant reply, I got no statistics about the accepted tokens. That was the only indication something was not quite right, but it wasn't obvious what the problem was.

I'm using qwen/qwen3-235b-a22b-2507 Q3_K_L GGUF for the main model and I tried using qwen/qwen3-30b-a3b-2507 and qwen/qwen3-4b-2507 GGUFs with Q3_K_L as draft models for it, and neither worked.

First of all, when I loaded the main model initially, even though I linked it to the draft model in the main model, it only loaded the main model, not the draft model too.

This is how I configured the main model:

Image

Then in the model parameters configuration for the chat, I have these settings, the draft model setting being inherited from the main model (but the behavior is exactly the same if I don't configure it at the model level and just set it here manually):

Image

Then, when I enter the first prompt in the chat, it seems LM Studio detects that the main model for some reason wasn't loaded with the draft model initially, and reloads the main model along with the draft model.

However, even after both models seem to be finally loaded together after the first prompt, there is no evidence of any speculative decoding actually being used, this is how the first assistant reply looks:

Image

Relevant sections from the logs (initially I missed the errors while scrolling through the logs):

2026-02-20 03:38:44 [DEBUG]
 srv    load_model: loading draft model 'C:\Users\vladn\.lmstudio\models\lmstudio-community\Qwen3-30B-A3B-Instruct-2507-GGUF\Qwen3-30B-A3B-Instruct-2507-Q3_K_L.gguf

[...]

2026-02-20 03:38:53 [DEBUG]
 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7168.00 MiB on device 1: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA1 buffer of size 7516192768
2026-02-20 03:38:53 [DEBUG]
 llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
failed to create draft context: Failed to initialize the context: failed to allocate buffer for kv cache
slot   load_model: id  0 | task -1 | speculative decoding context not initialized

[...]

2026-02-20 03:38:54 [DEBUG]
 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7168.00 MiB on device 1: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA1 buffer of size 7516192768
2026-02-20 03:38:54 [DEBUG]
 llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
failed to create draft context: Failed to initialize the context: failed to allocate buffer for kv cache
slot   load_model: id  1 | task -1 | speculative decoding context not initialized

[...]

2026-02-20 03:38:54 [DEBUG]
 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7168.00 MiB on device 1: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA1 buffer of size 7516192768
2026-02-20 03:38:54 [DEBUG]
 llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
failed to create draft context: Failed to initialize the context: failed to allocate buffer for kv cache
slot   load_model: id  2 | task -1 | speculative decoding context not initialized

[...]

026-02-20 03:38:54 [DEBUG]
 ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7168.00 MiB on device 1: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA1 buffer of size 7516192768
2026-02-20 03:38:54 [DEBUG]
 llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
failed to create draft context: Failed to initialize the context: failed to allocate buffer for kv cache
slot   load_model: id  3 | task -1 | speculative decoding context not initialized

[...]

2026-02-20 03:45:45 [DEBUG]
 no implementations specified for speculative decoding
slot   load_model: id  0 | task -1 | speculative decoding context not initialized
slot   load_model: id  0 | task -1 | new slot, n_ctx = 131072
no implementations specified for speculative decoding
slot   load_model: id  1 | task -1 | speculative decoding context not initialized
slot   load_model: id  1 | task -1 | new slot, n_ctx = 131072
no implementations specified for speculative decoding
slot   load_model: id  2 | task -1 | speculative decoding context not initialized
slot   load_model: id  2 | task -1 | new slot, n_ctx = 131072
no implementations specified for speculative decoding
slot   load_model: id  3 | task -1 | speculative decoding context not initialized
slot   load_model: id  3 | task -1 | new slot, n_ctx = 131072

So, while reporting this issue I found the problem. Loading the draft model actually fails due to lack of video memory, and the UI just ignores this and apparently then proceeds to load the main model a 3rd time, this time without using the draft, instead of clearly notifying the user about the issue.

In my opinion it would be better to prevent the main model from being used when the selected draft model can't be loaded, instead of running inference without it.

And about the error messages that show while loading models, they are very easy to miss, because they only persist on the screen for a few seconds then they disappear. For errors loading the main model this is not such a big issue, because you can see very easily that the model has not been loaded, even if you miss the notification (and I even have doubts the error is always shown in the UI, even temporarily, sometimes).

But since draft model loading failures are not obvious when you miss the error message, for example if you move your eyes for a few seconds from the screen while a big model loads, it's a much bigger problem for draft models. This has caused me to lose a lot of time troubleshooting this.

And a less technical user would probably have even more issues getting to the bottom of this.

Anyway, after a lot of lost time, it finally works, the solution was reducing the context size to fit in the available video memory:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions