Skip to content

Conversation

@rmarnold
Copy link

Summary

Add a minimal extension point for custom memory (KV cache) implementations.

Motivation

  • KV cache optimization is an active research area (compression, semantic caching, etc.)
  • Currently requires forking llama.cpp to experiment with custom implementations
  • GGML backends already use similar factory patterns

Changes

  • Add llama_memory_factory_fn typedef to llama.h
  • Add llama_set_memory_factory() to set custom factory
  • Check factory before default memory creation in llama_context constructor

Usage

  1. Implement factory function returning llama_memory_t
  2. Call llama_set_memory_factory() before llama_init_from_model()
  3. Factory can return nullptr to use default implementation

Example

static llama_memory_t my_cache_factory(
    const struct llama_model * model,
    const struct llama_context_params * params,
    void * user_data
) {
    if (should_use_custom_cache(params)) {
        return create_my_custom_cache(model, params);
    }
    return nullptr;  // Fall back to default
}

// Register before context creation
llama_set_memory_factory(my_cache_factory, nullptr);

Impact

  • Zero overhead when not used (single null pointer check)
  • No breaking changes to existing API
  • 35 lines total across 2 files

Add a minimal extension point for custom memory (KV cache) implementations.

Motivation:
- KV cache optimization is an active research area (compression, semantic caching)
- Currently requires forking llama.cpp to experiment with custom implementations
- GGML backends already use similar factory patterns

Changes:
- Add llama_memory_factory_fn typedef to llama.h
- Add llama_set_memory_factory() to set custom factory
- Check factory before default memory creation in llama_context constructor

Usage:
1. Implement factory function returning llama_memory_t
2. Call llama_set_memory_factory() before llama_init_from_model()
3. Factory can return nullptr to use default implementation

Zero overhead when not used (single null pointer check).
@linuxmagic-mp
Copy link

const bool memory_enabled = parse_request_metadata_memory_flag(request);
Also related to memory usage, so we can get a per user option to choose to use memory or not. However, not sure if this is part of the upcoming MCP integration.

@ggerganov
Copy link
Member

Feel free to experiment but this patch as it is, is not suitable for merging. If you want to try new cache implementations, implement them directly in libllama - no need to try to extract this logic in user code at this point.

@ggerganov ggerganov closed this Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants