llama.cpp on 2-core CPU + 8GB DDR2: My Full CPU-only Guide (4B models at ~2 t/s) #21136

0ut0flin3 · 2026-03-28T23:29:43Z

0ut0flin3
Mar 28, 2026

The full guide is also available as a Public Gist here:
https://gist.github.com/0ut0flin3/a64a97d31eb3d0e9261f54206c3bacf3

Below are the generic details and resources of the machine I used:

Operating System: Linux Mint 22.3 - MATE 64-bit (safe-graphics / boot with 'nomodeset') - Linux Kernel: 6.14.0-37-generic
GPU: not relevant / not used
CPU: 2 cores – Intel® Core™2 Duo CPU E8400 @ 3.00GHz × 2
RAM: 8 GB (DDR2)

Given my recent personal successes in running inference with CPU-only (no GPU) on local models up to 4B–7B parameters with more than acceptable speed and performance on really limited hardware and very little RAM, I decided to share the instructions and commands I used to compile llama.cpp on my machine and the arguments I currently use to launch llama.

In my specific case, the configuration I share below allowed me to get the maximum out of my hardware, receive fast responses with no crashes at all, and even do other things while using the model (for example, browsing the web) while keeping everything fluid. My stubborn determination to run offline inference on a home computer and my initial frustration thinking it was impossible kept me awake for many nights until, after numerous attempts, I finally found the way to do it!

At the moment I am using MX Linux or Linux Mint (Debian) 64-bit as operating systems. I usually use Ubuntu, which I love and find very convenient, and it works fine too, but I switched to these two because they use much less RAM and, in my specific case, inference runs much better. In any case, the commands and configurations that follow work on any Debian-based operating system, including Ubuntu.

To get the maximum performance I recommend using a relatively lightweight OS like Linux Mint (easier to install) or even better MX Linux or AntiX Linux, although they can be a bit more complex to install and configure. This is my personal opinion and advice, but you are free to do whatever you want :)

Let’s get started step by step with the compilation and installation of llama.cpp on your Ubuntu/Debian system with CPU-only and 8 GB of RAM. These are the same steps I used, and I get responses with an average speed of 2 tokens/s on a 4B model.

[Phase 1: Installation]

Open a terminal.

Step 1: Install the necessary dependencies:

sudo apt update
sudo apt install -y git build-essential cmake libcurl4-openssl-dev

Step 2: Clone the repository (latest official version):

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Step 3: Compile optimized for CPU (no GPU):

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=OFF \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_EXAMPLES=ON \
  -DLLAMA_CURL=ON

Then build (use all available cores):

cmake --build build --config Release -j $(nproc)

Estimated time: 5–15 minutes depending on your CPU.

This is the configuration I used to get the most out of my CPU.

Step 4: Make sure everything went well:

./build/bin/llama-cli --version

You should see the version.

Step 5: (Optional but recommended) Add the binaries to your PATH so you can use llama-cli, llama-server, etc. from any folder:

echo 'export PATH="$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

[Phase 2 – Important: Apply some tricks to improve performance with low RAM]

Ok. Now, before using a model, let’s apply a few tricks to improve memory management during inference (they worked really well for me):

Step 1: Set vm.swappiness=10

What it does: sets how “aggressive” the Linux kernel is in moving memory pages to swap (disk) when RAM fills up.
Default value: 60 (quite aggressive).

Why lower it to 10:
With only 8 GB and models that occupy 2.5–4 GB, the kernel tends to swap the model pages (which are mmap’ed) very easily. With low swappiness, the kernel prefers to drop filesystem cache (which is less critical) instead of swapping the model.
Result: fewer micro-freezes during token generation.

How to set it:

Temporary (until reboot):

sudo sysctl vm.swappiness=10

Permanent (recommended):

echo "vm.swappiness=10" | sudo tee /etc/sysctl.d/99-swappiness.conf
sudo sysctl -p /etc/sysctl.d/99-swappiness.conf

10 is great for low-RAM desktops. Some people set 5 or even 1, but 10 is a good compromise.

Step 2: Set memlock to unlimited.

Open the file:

sudo nano /etc/security/limits.conf

Add these lines:

* soft memlock unlimited
* hard memlock unlimited

Save and exit (CTRL+X → Enter).
* means it will be applied to all users; replace * with a specific username if you want to apply it only to one user.

Reboot your session (logout + login) to apply the changes.

Check that it was applied:

ulimit -l

It should return unlimited (IMPORTANT).

[Phase 3: Downloading a model and chatting completely offline (finally :))]

Step 1: Download a model in GGUF format from Hugging Face.

Create a dedicated folder for models:

mkdir -p ~/GGUF_models

Download (in this case we will download Qwen3.5 4B by Unsloth with Q4_K_M quantization — a good compromise between size and quality — but you can download any model you want; I recommend not going beyond 4B, or at most 6B–7B which is still acceptable):

cd ~/GGUF_models
wget https://huggingface.co/unsloth/Qwen3.5-4B-GGUF/resolve/main/Qwen3.5-4B-Q4_K_M.gguf -O Qwen3.5-4B-Q4_K_M.gguf

Step 2: Launch the model with --mlock enabled (the most important point).

Create a bash script for convenience:

nano ~/chat.sh

Insert this content:

#!/bin/bash
llama-cli \
  -m ~/GGUF_models/Qwen3.5-4B-Q4_K_M.gguf \
  --mlock \
  --ctx-size 2048 \
  -t 2 \
  -ngl 0

Save and exit (CTRL+X → Enter).

Make it executable:

chmod +x ~/chat.sh

Important flags:

--mlock → locks the model in RAM (big performance boost in my case)
--ctx-size 2048 (or maximum 4096) → do not exaggerate with context
-t 2 → number of threads (adjust -t to the number of cores of your CPU — 2 in my case)
-ngl 0 → no GPU layers

Now you are ready to chat offline from your terminal :)

Start the chat:

~/chat.sh

I hope this has been useful to you :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp on 2-core CPU + 8GB DDR2: My Full CPU-only Guide (4B models at ~2 t/s) #21136

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

llama.cpp on 2-core CPU + 8GB DDR2: My Full CPU-only Guide (4B models at ~2 t/s) #21136

Uh oh!

0ut0flin3 Mar 28, 2026

[Phase 1: Installation]

[Phase 2 – Important: Apply some tricks to improve performance with low RAM]

[Phase 3: Downloading a model and chatting completely offline (finally :))]

Replies: 0 comments

0ut0flin3
Mar 28, 2026