feat: Add memory factory hook for custom KV cache implementations #18357

rmarnold · 2025-12-24T23:56:40Z

Summary

Add a minimal extension point for custom memory (KV cache) implementations.

Motivation

KV cache optimization is an active research area (compression, semantic caching, etc.)
Currently requires forking llama.cpp to experiment with custom implementations
GGML backends already use similar factory patterns

Changes

Add llama_memory_factory_fn typedef to llama.h
Add llama_set_memory_factory() to set custom factory
Check factory before default memory creation in llama_context constructor

Usage

Implement factory function returning llama_memory_t
Call llama_set_memory_factory() before llama_init_from_model()
Factory can return nullptr to use default implementation

Example

static llama_memory_t my_cache_factory(
    const struct llama_model * model,
    const struct llama_context_params * params,
    void * user_data
) {
    if (should_use_custom_cache(params)) {
        return create_my_custom_cache(model, params);
    }
    return nullptr;  // Fall back to default
}

// Register before context creation
llama_set_memory_factory(my_cache_factory, nullptr);

Impact

Zero overhead when not used (single null pointer check)
No breaking changes to existing API
35 lines total across 2 files

Add a minimal extension point for custom memory (KV cache) implementations. Motivation: - KV cache optimization is an active research area (compression, semantic caching) - Currently requires forking llama.cpp to experiment with custom implementations - GGML backends already use similar factory patterns Changes: - Add llama_memory_factory_fn typedef to llama.h - Add llama_set_memory_factory() to set custom factory - Check factory before default memory creation in llama_context constructor Usage: 1. Implement factory function returning llama_memory_t 2. Call llama_set_memory_factory() before llama_init_from_model() 3. Factory can return nullptr to use default implementation Zero overhead when not used (single null pointer check).

linuxmagic-mp · 2025-12-28T20:04:57Z

const bool memory_enabled = parse_request_metadata_memory_flag(request);
Also related to memory usage, so we can get a per user option to choose to use memory or not. However, not sure if this is part of the upcoming MCP integration.

ggerganov · 2025-12-29T08:13:37Z

Feel free to experiment but this patch as it is, is not suitable for merging. If you want to try new cache implementations, implement them directly in libllama - no need to try to extract this logic in user code at this point.

rmarnold requested a review from ggerganov as a code owner December 24, 2025 23:56

loci-dev mentioned this pull request Dec 25, 2025

UPSTREAM PR #18357: feat: Add memory factory hook for custom KV cache implementations auroralabs-loci/llama.cpp#696

Open

ggerganov closed this Dec 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add memory factory hook for custom KV cache implementations #18357

feat: Add memory factory hook for custom KV cache implementations #18357

Uh oh!

rmarnold commented Dec 24, 2025

Uh oh!

linuxmagic-mp commented Dec 28, 2025

Uh oh!

ggerganov commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add memory factory hook for custom KV cache implementations #18357

feat: Add memory factory hook for custom KV cache implementations #18357

Uh oh!

Conversation

rmarnold commented Dec 24, 2025

Summary

Motivation

Changes

Usage

Example

Impact

Uh oh!

linuxmagic-mp commented Dec 28, 2025

Uh oh!

ggerganov commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants