Support Qwen3-VL-Embedding for multimodal embeddings (vision + text) #19516

Wolframko · 2026-02-11T12:46:49Z

Wolframko
Feb 11, 2026

Motivation

Qwen recently released Qwen3-VL-Embedding (2B and 8B variants) — a state-of-the-art multimodal embedding model built on the Qwen3-VL architecture. It achieves 77.8 on MMEB-V2 (ranked #1 as of January 2025) and supports text, images, screenshots, videos, and mixed-modal inputs in a unified embedding space.

Why this matters

Currently, llama.cpp supports:

✅ Qwen3-VL Instruct models (generation with vision via --mmproj)
✅ Qwen3-Embedding text-only models (--embedding --pooling last)
❌ Qwen3-VL-Embedding (multimodal embeddings combining vision encoder + pooling)

There's no way to run a multimodal embedding model that takes both images and text as input and produces a single embedding vector. This is a fundamentally different use case from VLM generation — it's about retrieval, RAG with images/documents, cross-modal search, and clustering.

What would need to happen

The model architecture is essentially Qwen3-VL (same vision encoder + merger) but used in embedding/pooling mode rather than generation mode. The key pieces:

GGUF conversion — convert_hf_to_gguf.py should handle the embedding variant (the text backbone converts fine since it's the same Qwen3 architecture, but the mmproj needs to work in embedding mode)
Inference pipeline — combine multimodal input processing (image → vision encoder → merger → tokens) with pooling (--embedding --pooling last) instead of autoregressive generation
API endpoint — llama-server should support /v1/embeddings with image inputs (base64 or URL), similar to how /v1/chat/completions handles images today

Key model details

Architecture: Qwen3-VL base (same vision encoder + merger as Instruct variants)
Sizes: 2B and 8B
Pooling: Last token pooling with optional instruction prefix
Matryoshka support: Flexible embedding dimensions
Context: Up to 32K tokens
License: Apache 2.0

References

Model card: https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B
Paper: https://arxiv.org/abs/2601.04720
GitHub: https://github.com/QwenLM/Qwen3-VL-Embedding

Use cases

Multimodal RAG (retrieve relevant documents/images by semantic similarity)
Visual document retrieval (ViDoRe-style)
Image-text cross-modal search
Screenshot/UI understanding for retrieval pipelines
Video-text matching and retrieval

Would love to hear if this is on the roadmap or if anyone has started working on this. Happy to help test!

S1M0N38 · 2026-02-11T23:07:11Z

S1M0N38
Feb 11, 2026

There is a Draft PR open #18665 . I'm not sure about the status tho. I've tried to convert to GGUF (S1M0N38/Qwen3-VL-Embedding-2B-Q4_K_M-GGUF) and it seems that qwen fix the issue with 1_Pooling.

0 replies

vincenthawke · 2026-02-12T08:27:35Z

vincenthawke
Feb 12, 2026

I got it kind of working using /embeddings API. With multimodal payloads formatted like so:

{
    "content": [
        {
            "prompt_string": "<__media__>",
            "multimodal_data": ["iVBORw0KGgoAAAANSUhEUgAAAAgAAAAIAQMAAAD+wSzIAAAABlBMVEX///+/v7+jQ3Y5AAAADklEQVQI12P4AIX8EAgALgAD/aNpbtEAAAAASUVORK5CYII"]
        }
    ]
    
}

Checked cosine on similar images manually and it looks like it is working correctly.

I don't know yet how to get same vectors back when sending an array of multiple objects inside "content". If I send several items with same prompt and data, I get slight differences in each (0.08%) when it should be the same.

5 replies

jaredjonckheere Feb 13, 2026

did you set the temperature to 0?

vincenthawke Feb 13, 2026

did you set the temperature to 0?

Why would I need to set temperature for an embedding model when using embedding API?

The values don't randomly change on every unchanged request. They change in relation to size of the batch (token length).

S1M0N38 Feb 13, 2026

the embedding generated is simply the embedding for the <PAD> token. There isn't the final projection to the vocab matrix and the subsequent sampling (that make use of the temperature param). I would be surprise if the temp play a role in embedding models. Maybe it's required by llama.cpp implementation, idk (but I guess that's not the case).

Zambonilli Mar 3, 2026

What HF gguf did you use? I tried a bunch of the more recently pushed and am still getting the Pooling_1 error S1M0N38 linked to. The only thing in the thread is to set the --sentence-transformers-dense-modules flag in the gguf conversion but then someone mentions qwen models usually drop with this issue.

vincenthawke Mar 6, 2026

What HF gguf did you use? I tried a bunch of the more recently pushed and am still getting the Pooling_1 error S1M0N38 linked to. The only thing in the thread is to set the --sentence-transformers-dense-modules flag in the gguf conversion but then someone mentions qwen models usually drop with this issue.

https://huggingface.co/DevQuasar/Qwen.Qwen3-VL-Embedding-2B-GGUF/tree/main
This one.

Didn't find any for 8B model. Disable pooling and just grab the last vector.

Tokimorphling · 2026-03-28T07:00:01Z

Tokimorphling
Mar 28, 2026

I've fixed the Qwen3-VL-Embedding issues in llama.cpp and verified the fix with regression tests. Check out the code here: https://github.com/Tokimorphling/qwen3-vl-embedding

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Qwen3-VL-Embedding for multimodal embeddings (vision + text) #19516

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support Qwen3-VL-Embedding for multimodal embeddings (vision + text) #19516

Uh oh!

Motivation

Why this matters

What would need to happen

Key model details

References

Use cases

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 3 comments 5 replies