You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Below are the generic details and resources of the machine I used:
Operating System: Linux Mint 22.3 - MATE 64-bit (safe-graphics / boot with 'nomodeset') - Linux Kernel: 6.14.0-37-generic
GPU: not relevant / not used
CPU: 2 cores – Intel® Core™2 Duo CPU E8400 @ 3.00GHz × 2
RAM: 8 GB (DDR2)
Given my recent personal successes in running inference with CPU-only (no GPU) on local models up to 4B–7B parameters with more than acceptable speed and performance on really limited hardware and very little RAM, I decided to share the instructions and commands I used to compile llama.cpp on my machine and the arguments I currently use to launch llama.
In my specific case, the configuration I share below allowed me to get the maximum out of my hardware, receive fast responses with no crashes at all, and even do other things while using the model (for example, browsing the web) while keeping everything fluid. My stubborn determination to run offline inference on a home computer and my initial frustration thinking it was impossible kept me awake for many nights until, after numerous attempts, I finally found the way to do it!
At the moment I am using MX Linux or Linux Mint (Debian) 64-bit as operating systems. I usually use Ubuntu, which I love and find very convenient, and it works fine too, but I switched to these two because they use much less RAM and, in my specific case, inference runs much better. In any case, the commands and configurations that follow work on any Debian-based operating system, including Ubuntu.
To get the maximum performance I recommend using a relatively lightweight OS like Linux Mint (easier to install) or even better MX Linux or AntiX Linux, although they can be a bit more complex to install and configure. This is my personal opinion and advice, but you are free to do whatever you want :)
Let’s get started step by step with the compilation and installation of llama.cpp on your Ubuntu/Debian system with CPU-only and 8 GB of RAM. These are the same steps I used, and I get responses with an average speed of 2 tokens/s on a 4B model.
[Phase 2 – Important: Apply some tricks to improve performance with low RAM]
Ok. Now, before using a model, let’s apply a few tricks to improve memory management during inference (they worked really well for me):
Step 1: Set vm.swappiness=10
What it does: sets how “aggressive” the Linux kernel is in moving memory pages to swap (disk) when RAM fills up.
Default value: 60 (quite aggressive).
Why lower it to 10:
With only 8 GB and models that occupy 2.5–4 GB, the kernel tends to swap the model pages (which are mmap’ed) very easily. With low swappiness, the kernel prefers to drop filesystem cache (which is less critical) instead of swapping the model.
Result: fewer micro-freezes during token generation.
How to set it:
Temporary (until reboot):
sudo sysctl vm.swappiness=10
Permanent (recommended):
echo"vm.swappiness=10"| sudo tee /etc/sysctl.d/99-swappiness.conf
sudo sysctl -p /etc/sysctl.d/99-swappiness.conf
10 is great for low-RAM desktops. Some people set 5 or even 1, but 10 is a good compromise.
Step 2: Set memlock to unlimited.
Open the file:
sudo nano /etc/security/limits.conf
Add these lines:
* soft memlock unlimited
* hard memlock unlimited
Save and exit (CTRL+X → Enter). * means it will be applied to all users; replace * with a specific username if you want to apply it only to one user.
Reboot your session (logout + login) to apply the changes.
Check that it was applied:
ulimit -l
It should return unlimited (IMPORTANT).
[Phase 3: Downloading a model and chatting completely offline (finally :))]
Step 1: Download a model in GGUF format from Hugging Face.
Create a dedicated folder for models:
mkdir -p ~/GGUF_models
Download (in this case we will download Qwen3.5 4B by Unsloth with Q4_K_M quantization — a good compromise between size and quality — but you can download any model you want; I recommend not going beyond 4B, or at most 6B–7B which is still acceptable):
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
The full guide is also available as a Public Gist here:
https://gist.github.com/0ut0flin3/a64a97d31eb3d0e9261f54206c3bacf3
Below are the generic details and resources of the machine I used:
Given my recent personal successes in running inference with CPU-only (no GPU) on local models up to 4B–7B parameters with more than acceptable speed and performance on really limited hardware and very little RAM, I decided to share the instructions and commands I used to compile llama.cpp on my machine and the arguments I currently use to launch llama.
In my specific case, the configuration I share below allowed me to get the maximum out of my hardware, receive fast responses with no crashes at all, and even do other things while using the model (for example, browsing the web) while keeping everything fluid. My stubborn determination to run offline inference on a home computer and my initial frustration thinking it was impossible kept me awake for many nights until, after numerous attempts, I finally found the way to do it!
At the moment I am using MX Linux or Linux Mint (Debian) 64-bit as operating systems. I usually use Ubuntu, which I love and find very convenient, and it works fine too, but I switched to these two because they use much less RAM and, in my specific case, inference runs much better. In any case, the commands and configurations that follow work on any Debian-based operating system, including Ubuntu.
To get the maximum performance I recommend using a relatively lightweight OS like Linux Mint (easier to install) or even better MX Linux or AntiX Linux, although they can be a bit more complex to install and configure. This is my personal opinion and advice, but you are free to do whatever you want :)
Let’s get started step by step with the compilation and installation of llama.cpp on your Ubuntu/Debian system with CPU-only and 8 GB of RAM. These are the same steps I used, and I get responses with an average speed of 2 tokens/s on a 4B model.
[Phase 1: Installation]
Open a terminal.
Step 1: Install the necessary dependencies:
Step 2: Clone the repository (latest official version):
git clone https://github.com/ggml-org/llama.cpp.git cd llama.cppStep 3: Compile optimized for CPU (no GPU):
Then build (use all available cores):
cmake --build build --config Release -j $(nproc)Estimated time: 5–15 minutes depending on your CPU.
This is the configuration I used to get the most out of my CPU.
Step 4: Make sure everything went well:
You should see the version.
Step 5: (Optional but recommended) Add the binaries to your PATH so you can use
llama-cli,llama-server, etc. from any folder:[Phase 2 – Important: Apply some tricks to improve performance with low RAM]
Ok. Now, before using a model, let’s apply a few tricks to improve memory management during inference (they worked really well for me):
Step 1: Set
vm.swappiness=10What it does: sets how “aggressive” the Linux kernel is in moving memory pages to swap (disk) when RAM fills up.
Default value: 60 (quite aggressive).
Why lower it to 10:
With only 8 GB and models that occupy 2.5–4 GB, the kernel tends to swap the model pages (which are mmap’ed) very easily. With low swappiness, the kernel prefers to drop filesystem cache (which is less critical) instead of swapping the model.
Result: fewer micro-freezes during token generation.
How to set it:
Temporary (until reboot):
Permanent (recommended):
10 is great for low-RAM desktops. Some people set 5 or even 1, but 10 is a good compromise.
Step 2: Set memlock to unlimited.
Open the file:
Add these lines:
Save and exit (CTRL+X → Enter).
*means it will be applied to all users; replace*with a specific username if you want to apply it only to one user.Reboot your session (logout + login) to apply the changes.
Check that it was applied:
ulimit -lIt should return
unlimited(IMPORTANT).[Phase 3: Downloading a model and chatting completely offline (finally :))]
Step 1: Download a model in GGUF format from Hugging Face.
Create a dedicated folder for models:
mkdir -p ~/GGUF_modelsDownload (in this case we will download Qwen3.5 4B by Unsloth with Q4_K_M quantization — a good compromise between size and quality — but you can download any model you want; I recommend not going beyond 4B, or at most 6B–7B which is still acceptable):
Step 2: Launch the model with
--mlockenabled (the most important point).Create a bash script for convenience:
nano ~/chat.shInsert this content:
Save and exit (CTRL+X → Enter).
Make it executable:
chmod +x ~/chat.shImportant flags:
--mlock→ locks the model in RAM (big performance boost in my case)--ctx-size 2048(or maximum 4096) → do not exaggerate with context-t 2→ number of threads (adjust-tto the number of cores of your CPU — 2 in my case)-ngl 0→ no GPU layersNow you are ready to chat offline from your terminal :)
Start the chat:
~/chat.shI hope this has been useful to you :)
Beta Was this translation helpful? Give feedback.
All reactions