Skip to content

Optimize Prompt Processing Chunk Size to 8192 β†’ Boost Prefill Speed by Up to 1.5x for MLX EngineΒ #507

@Linwei-Chen

Description

@Linwei-Chen

πŸš€ Optimize Prompt Processing Chunk Size to 8192 β†’ Boost Prefill Speed by Up to 1.5x

πŸš€ δΌ˜εŒ– prompt 倄理 chunk size 至 8192οΌŒδ½Ώδ½ ηš„ prefill ζι€Ÿι«˜θΎΎ 1.5x

Platform: M1 Pro (32GB RAM), 10 CPU cores, 16 GPU threads (MLX backend)
Model Tested: root4k/qwen3-coder-30b-a3b-instruct-mlx
Objective: Reduce prompt processing time for long inputs via chunk size tuning


βœ… Summary

By increasing the PROMPT_PROCESSING_CHUNK_SIZE and prefill_step_size from 512 β†’ 8192, we achieved significant performance gains in prefill phase (i.e., processing initial prompt before generation). The optimal balance between memory use and throughput yields up to 1.5Γ— faster prefill on long prompts.

πŸ”₯ Best performance: chunk_size = 8192 β€” ~1.5Γ— faster than default on long contexts.

πŸ’‘ Important: You must restart LM Studio after applying the patch for changes to take effect.


πŸ“Š Performance Comparison Table

Prompt Length Chunk Size Time (s) Speedup vs 512
5,000 tokens 512 23.4 1.0Γ—
5,000 tokens 2,048 19.6 1.19Γ—
5,000 tokens 4,096 19.2 1.22Γ—
5,000 tokens 8,192 19.2 1.22Γ—
18,000 tokens 512 164.3 1.0Γ—
18,000 tokens 4,096 116.7 1.41Γ—
18,000 tokens 8,192 105.1 1.56Γ—
18,000 tokens 16,384 166.2 0.99Γ— (β‰ˆ1.0Γ—)

βœ… Best throughput: 8,192 gives the fastest prefill across all lengths
⚠️ Memory tradeoff: Larger chunks improve speed but require more memory. On M1 Pro with 32GB, 8192 is optimal within safe limits.

πŸ’¬ Note: At 16,384, performance regresses β€” likely due to memory allocation overhead or GPU kernel launch limits in MLX.


πŸ› οΈ Patch Script (Auto-Adjusted for Any Size)


# Define the chunk size values as variables
OLD_CHUNK_SIZE=512 # This is the default chunk size used by LM
NEW_CHUNK_SIZE=8192 # Thisis the optimized chunk size we want to set
# Patch LM Studio's mlx-engine with optimized chunk size (512 β†’ 1024)
# This reduces prompt processing time by ~3x for large prompts

LMSTUDIO_DIR=~/.lmstudio/extensions/backends/vendor/_amphibian

if [ ! -d "$LMSTUDIO_DIR" ]; then
    echo "Error: LM Studio extensions not found at $LMSTUDIO_DIR"
    exit 1
fi

echo "=== Patching LM Studio MLX backends ==="

# Patch cache_wrapper.py (mlx_engine)
echo ""
echo "Patching mlx_engine/cache_wrapper.py..."
for f in "$LMSTUDIO_DIR"/*/lib/python3.11/site-packages/mlx_engine/cache_wrapper.py; do
    if [ -f "$f" ]; then
        sed -i '' "s/PROMPT_PROCESSING_CHUNK_SIZE = $OLD_CHUNK_SIZE/PROMPT_PROCESSING_CHUNK_SIZE = $NEW_CHUNK_SIZE/" "$f"
        echo "Patched: $(basename "$(dirname "$f")")"
    fi
done

# Patch generate.py (mlx_lm)
echo ""
echo "Patching mlx_lm/generate.py..."
for f in "$LMSTUDIO_DIR"/*/lib/python3.11/site-packages/mlx_lm/generate.py; do
    if [ -f "$f" ]; then
        sed -i '' "s/prefill_step_size: int = $OLD_CHUNK_SIZE/prefill_step_size: int = $NEW_CHUNK_SIZE/" "$f"
        echo "Patched: $(basename "$(dirname "$f")")"
    fi
done

# Clear Python cache
echo ""
echo "Clearing Python cache..."
find "$LMSTUDIO_DIR" -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null
echo "Done"

# Verification output
echo ""
echo "=== Verification ==="
echo "cache_wrapper.py chunk sizes:"
grep -h "PROMPT_PROCESSING_CHUNK_SIZE = " "$LMSTUDIO_DIR"/*/lib/python3.11/site-packages/mlx_engine/cache_wrapper.py 2>/dev/null | sort | uniq -c

echo ""
echo "generate.py prefill_step_size (should all be $NEW_CHUNK_SIZE):"
grep -h "prefill_step_size: int = " "$LMSTUDIO_DIR"/*/lib/python3.11/site-packages/mlx_lm/generate.py 2>/dev/null | sort | uniq -c

echo ""
echo "=== Done! Restart LM Studio to apply changes ==="

βœ… Run the script once. After restart, you’ll see fast prefill even for 18k+ prompts.


πŸ“Œ Recommendations & Notes

  • βœ… Use 8192 as your default chunk size on M1 Pro 32GB β€” it offers best speed-per-memory ratio.
  • ❌ Avoid 16,384 and above unless you have 64GB+ RAM; performance may drop due to kernel launch overhead.
  • πŸ” Always restart LM Studio after patching β€” changes are not hot-reloadable.
  • πŸ”¬ Future: Add auto-detection of ideal chunk size based on available GPU memory.

πŸ“Ž Reference


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions