Replies: 3 comments 5 replies
-
|
There is a Draft PR open #18665 . I'm not sure about the status tho. I've tried to convert to GGUF (S1M0N38/Qwen3-VL-Embedding-2B-Q4_K_M-GGUF) and it seems that qwen fix the issue with |
Beta Was this translation helpful? Give feedback.
-
|
I got it kind of working using Checked cosine on similar images manually and it looks like it is working correctly. I don't know yet how to get same vectors back when sending an array of multiple objects inside "content". If I send several items with same prompt and data, I get slight differences in each (0.08%) when it should be the same. |
Beta Was this translation helpful? Give feedback.
-
|
I've fixed the Qwen3-VL-Embedding issues in llama.cpp and verified the fix with regression tests. Check out the code here: https://github.com/Tokimorphling/qwen3-vl-embedding |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
Qwen recently released Qwen3-VL-Embedding (2B and 8B variants) — a state-of-the-art multimodal embedding model built on the Qwen3-VL architecture. It achieves 77.8 on MMEB-V2 (ranked #1 as of January 2025) and supports text, images, screenshots, videos, and mixed-modal inputs in a unified embedding space.
Why this matters
Currently, llama.cpp supports:
--mmproj)--embedding --pooling last)There's no way to run a multimodal embedding model that takes both images and text as input and produces a single embedding vector. This is a fundamentally different use case from VLM generation — it's about retrieval, RAG with images/documents, cross-modal search, and clustering.
What would need to happen
The model architecture is essentially Qwen3-VL (same vision encoder + merger) but used in embedding/pooling mode rather than generation mode. The key pieces:
convert_hf_to_gguf.pyshould handle the embedding variant (the text backbone converts fine since it's the same Qwen3 architecture, but the mmproj needs to work in embedding mode)--embedding --pooling last) instead of autoregressive generationllama-servershould support/v1/embeddingswith image inputs (base64 or URL), similar to how/v1/chat/completionshandles images todayKey model details
References
Use cases
Would love to hear if this is on the roadmap or if anyone has started working on this. Happy to help test!
Beta Was this translation helpful? Give feedback.
All reactions