-
Notifications
You must be signed in to change notification settings - Fork 425
Description
Context
The README helpfully notes that vLLM requires 24GB+ VRAM and points users with lower-VRAM GPUs toward Ollama/LM Studio with GGUF quantized models. However, I ran into some difficulties getting this path to work end-to-end and wanted to share feedback that might help other users.
Experience
I tried hosting the GGUF model via Ollama on an 8GB laptop GPU. While the server started, fara-cli failed when making its first model call. Since Fara-7B is a vision-language model (Qwen2.5-VL) that sends base64-encoded screenshots via the OpenAI image_url content type on every step, it's possible Ollama's OpenAI-compatible endpoint doesn't fully support this for Qwen2.5-VL GGUF models — though I'm not 100% sure this was the root cause vs. VRAM constraints.
It would help to know whether the team has validated this path end-to-end, and if so, what configuration was used.
Suggestions for the documentation
1. Example commands
The section says to specify the correct --base_url, --api_key, and --model but does not provide concrete values. Adding something like this would reduce trial and error:
ollama pull <exact_model_name>
fara-cli \
--task "..." \
--base_url http://localhost:11434/v1 \
--api_key ollama \
--model <model_name>2. VRAM guidance
The advice to select the largest model that fits your GPU is reasonable, but a rough table would help users choose a quantization level more confidently:
| VRAM | Suggested quantization | Notes |
|---|---|---|
| 8GB | Q4_K_M (~4.5GB) | Tight with KV cache |
| 12GB | Q5_K_M / Q6_K | |
| 16GB | Q8_0 or FP16 | |
| 24GB+ | FP16 via vLLM | Recommended path |
3. Vision model compatibility note
Since GGUF quantization and llama.cpp-based servers may handle vision inputs differently than vLLM, it would help to clarify whether any quality or compatibility trade-offs should be expected compared to the vLLM path.
4. Modelfile reference
There is a Modelfile in the repository root that is not mentioned in the README. If it is intended for Ollama use, a short note linking to it would make the workflow clearer.
Not a blocker. This is not a critical issue, just sharing this in case it helps improve onboarding for users who start with the Ollama or LM Studio path.