Optimize Prompt Processing Chunk Size to 8192 → Boost Prefill Speed by Up to 1.5x for MLX Engine



## 🚀 Optimize Prompt Processing Chunk Size to 8192 → Boost Prefill Speed by Up to 1.5x

## 🚀 优化 prompt 处理 chunk size 至 8192，使你的 prefill 提速高达 1.5x

> **Platform**: M1 Pro (32GB RAM), 10 CPU cores, 16 GPU threads (MLX backend)  
> **Model Tested**: `root4k/qwen3-coder-30b-a3b-instruct-mlx`  
> **Objective**: Reduce prompt processing time for long inputs via chunk size tuning

---

### ✅ Summary
By increasing the `PROMPT_PROCESSING_CHUNK_SIZE` and `prefill_step_size` from **512 → 8192**, we achieved significant performance gains in **prefill phase** (i.e., processing initial prompt before generation). The optimal balance between memory use and throughput yields **up to 1.5× faster prefill** on long prompts.

> 🔥 **Best performance**: `chunk_size = 8192` — ~1.5× faster than default on long contexts.

> 💡 **Important**: You must **restart LM Studio** after applying the patch for changes to take effect.

---

### 📊 Performance Comparison Table

| Prompt Length | Chunk Size | Time (s) | Speedup vs 512 |
|---------------|------------|----------|----------------|
| 5,000 tokens | 512        | 23.4     | 1.0×           |
| 5,000 tokens | 2,048      | 19.6     | **1.19×**     |
| 5,000 tokens | 4,096      | 19.2     | **1.22×**     |
| 5,000 tokens | **8,192**    | **19.2**   | **1.22×**     |
| 18,000 tokens | 512        | 164.3    | 1.0×           |
| 18,000 tokens | 4,096      | 116.7    | **1.41×**     |
| 18,000 tokens | **8,192**    | **105.1**  | **1.56×**     |
| 18,000 tokens | 16,384     | 166.2    | **0.99×** (≈1.0×) |

> ✅ **Best throughput**: `8,192` gives the fastest prefill across all lengths  
> ⚠️ **Memory tradeoff**: Larger chunks improve speed but require more memory. On M1 Pro with 32GB, `8192` is optimal within safe limits.

> 💬 **Note**: At `16,384`, performance regresses — likely due to memory allocation overhead or GPU kernel launch limits in MLX.

---

### 🛠️ Patch Script (Auto-Adjusted for Any Size)

```#!/bin/bash

# Define the chunk size values as variables
OLD_CHUNK_SIZE=512 # This is the default chunk size used by LM
NEW_CHUNK_SIZE=8192 # Thisis the optimized chunk size we want to set
# Patch LM Studio's mlx-engine with optimized chunk size (512 → 1024)
# This reduces prompt processing time by ~3x for large prompts

LMSTUDIO_DIR=~/.lmstudio/extensions/backends/vendor/_amphibian

if [ ! -d "$LMSTUDIO_DIR" ]; then
    echo "Error: LM Studio extensions not found at $LMSTUDIO_DIR"
    exit 1
fi

echo "=== Patching LM Studio MLX backends ==="

# Patch cache_wrapper.py (mlx_engine)
echo ""
echo "Patching mlx_engine/cache_wrapper.py..."
for f in "$LMSTUDIO_DIR"/*/lib/python3.11/site-packages/mlx_engine/cache_wrapper.py; do
    if [ -f "$f" ]; then
        sed -i '' "s/PROMPT_PROCESSING_CHUNK_SIZE = $OLD_CHUNK_SIZE/PROMPT_PROCESSING_CHUNK_SIZE = $NEW_CHUNK_SIZE/" "$f"
        echo "Patched: $(basename "$(dirname "$f")")"
    fi
done

# Patch generate.py (mlx_lm)
echo ""
echo "Patching mlx_lm/generate.py..."
for f in "$LMSTUDIO_DIR"/*/lib/python3.11/site-packages/mlx_lm/generate.py; do
    if [ -f "$f" ]; then
        sed -i '' "s/prefill_step_size: int = $OLD_CHUNK_SIZE/prefill_step_size: int = $NEW_CHUNK_SIZE/" "$f"
        echo "Patched: $(basename "$(dirname "$f")")"
    fi
done

# Clear Python cache
echo ""
echo "Clearing Python cache..."
find "$LMSTUDIO_DIR" -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null
echo "Done"

# Verification output
echo ""
echo "=== Verification ==="
echo "cache_wrapper.py chunk sizes:"
grep -h "PROMPT_PROCESSING_CHUNK_SIZE = " "$LMSTUDIO_DIR"/*/lib/python3.11/site-packages/mlx_engine/cache_wrapper.py 2>/dev/null | sort | uniq -c

echo ""
echo "generate.py prefill_step_size (should all be $NEW_CHUNK_SIZE):"
grep -h "prefill_step_size: int = " "$LMSTUDIO_DIR"/*/lib/python3.11/site-packages/mlx_lm/generate.py 2>/dev/null | sort | uniq -c

echo ""
echo "=== Done! Restart LM Studio to apply changes ==="

```

> ✅ Run the script once. After restart, you’ll see **fast prefill even for 18k+ prompts**.

---

### 📌 Recommendations & Notes

- ✅ **Use `8192`** as your default chunk size on M1 Pro 32GB — it offers **best speed-per-memory ratio**.
- ❌ Avoid `16,384` and above unless you have 64GB+ RAM; performance may drop due to kernel launch overhead.
- 🔁 Always **restart LM Studio** after patching — changes are not hot-reloadable.
- 🔬 Future: Add **auto-detection of ideal chunk size** based on available GPU memory.

---

### 📎 Reference
- [YouTube: MLX Prompt Processing Too Slow → Fix](https://www.youtube.com/watch?v=AuYfT3fnlCo)

---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Prompt Processing Chunk Size to 8192 → Boost Prefill Speed by Up to 1.5x for MLX Engine #507

🚀 Optimize Prompt Processing Chunk Size to 8192 → Boost Prefill Speed by Up to 1.5x

🚀 优化 prompt 处理 chunk size 至 8192，使你的 prefill 提速高达 1.5x

✅ Summary

📊 Performance Comparison Table

🛠️ Patch Script (Auto-Adjusted for Any Size)

📌 Recommendations & Notes

📎 Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prompt Length	Chunk Size	Time (s)	Speedup vs 512
5,000 tokens	512	23.4	1.0×
5,000 tokens	2,048	19.6	1.19×
5,000 tokens	4,096	19.2	1.22×
5,000 tokens	8,192	19.2	1.22×
18,000 tokens	512	164.3	1.0×
18,000 tokens	4,096	116.7	1.41×
18,000 tokens	8,192	105.1	1.56×
18,000 tokens	16,384	166.2	0.99× (≈1.0×)

Optimize Prompt Processing Chunk Size to 8192 → Boost Prefill Speed by Up to 1.5x for MLX Engine #507

Description

🚀 Optimize Prompt Processing Chunk Size to 8192 → Boost Prefill Speed by Up to 1.5x

🚀 优化 prompt 处理 chunk size 至 8192，使你的 prefill 提速高达 1.5x

✅ Summary

📊 Performance Comparison Table

🛠️ Patch Script (Auto-Adjusted for Any Size)

📌 Recommendations & Notes

📎 Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions