Skip to content

Conversation

@nkanu17
Copy link
Contributor

@nkanu17 nkanu17 commented Jan 6, 2026

Summary

This PR replaces the custom LLM wrapper classes in llms.py with a unified LLMClient abstraction layer backed by LiteLLM. All LLM operations—chat completions, embeddings, and LangChain Embeddings instances—now go through a single entry point, reducing maintenance burden and enabling support for multiple providers without code changes.

Issue: #105


Problem

The previous llms.py module had several issues:

Issue Impact
Multiple wrapper classes (OpenAIClientWrapper, AnthropicClientWrapper, BedrockClientWrapper) Duplicated logic, maintenance overhead
Manual MODEL_NAME_MAP dictionary Required updates for every new model
Different response parsing per provider Inconsistent error handling
Adding a provider = new wrapper class High friction for expansion

Solution

Replace custom wrappers with LLMClient, a facade backed by LiteLLM:

Before After
3 wrapper classes 1 unified client
Manual provider detection Automatic from model name
Custom response parsing Standardized dataclasses
Adding provider = new class Adding provider = just use it

Why this reduces maintenance:

  1. Provider API changes are isolated - LiteLLM handles SDK updates; we just bump the dependency
  2. Zero-effort provider expansion - Gemini, Cohere, Ollama work without code changes
  3. Centralized bug fixes - One place for logging, retries, error handling
  4. Backend swappability - LLMBackend Protocol allows replacing LiteLLM without touching application code

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              CALL SITES                                     │
│  extraction.py | summarization.py | memory_strategies.py | api.py | main.py│
│                                                                             │
│     await LLMClient.create_chat_completion(model, messages, ...)            │
│     await LLMClient.create_embedding(model, input_texts, ...)               │
│     LLMClient.create_embeddings()  → LangChain Embeddings instance          │
│     LLMClient.get_model_config(model_name)                                  │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                            LLMClient                                        │
│                   (Facade - Public API, Never Changes)                      │
│                                                                             │
│  create_chat_completion() → ChatCompletionResponse                          │
│  create_embedding()       → EmbeddingResponse                               │
│  create_embeddings()      → LangChain Embeddings (for vector stores)        │
│  get_model_config()       → ModelConfig                                     │
│  set_backend() / reset()  → For testing                                     │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                     Internal: LiteLLM (hidden, swappable)                   │
│                                                                             │
│  acompletion() | aembedding() | get_model_info()                            │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  OpenAI  │  Anthropic  │  AWS Bedrock  │  Gemini  │  Ollama  │  100+ more   │
└─────────────────────────────────────────────────────────────────────────────┘

Key Design Decisions:

  1. Static class methods - No instantiation; call LLMClient.create_chat_completion() directly
  2. Standardized responses - ChatCompletionResponse and EmbeddingResponse dataclasses
  3. Protocol-based testing - LLMBackend Protocol for mock injection via set_backend()
  4. Exception hierarchy - LLMClientError, ModelValidationError, APIKeyMissingError
  5. Model config resolution - MODEL_CONFIGS → LiteLLM get_model_info() → fallback

Changes

New:

  • agent_memory_server/llm/ package:
    • client.py - LLMClient with create_chat_completion(), create_embedding(), create_embeddings(), get_model_config(), optimize_query()
    • types.py - ChatCompletionResponse, EmbeddingResponse, LLMBackend Protocol
    • exceptions.py - LLMClientError, ModelValidationError, APIKeyMissingError
    • __init__.py - Re-exports for clean imports

Removed:

  • agent_memory_server/llms.py

Updated:

  • main.py - Startup validation via LLMClient.get_model_config()
  • extraction.py - LLMClient.create_chat_completion()
  • summarization.py - LLMClient.create_chat_completion()
  • memory_strategies.py - LLMClient.create_chat_completion()
  • long_term_memory.py - LLMClient.create_embedding()
  • api.py - LLMClient.get_model_config()
  • vectorstore_factory.py - Delegates to LLMClient.create_embeddings()
  • All affected test files

Migration Guide

# Before
from agent_memory_server.llms import get_model_client
client = await get_model_client(model_name)
response = await client.create_chat_completion(model=model_name, prompt=messages)
text = response.choices[0].message.content

# After
from agent_memory_server.llm import LLMClient
response = await LLMClient.create_chat_completion(model=model_name, messages=messages)
text = response.content
# Embeddings - Before
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Embeddings - After
from agent_memory_server.llm import LLMClient
embeddings = LLMClient.create_embeddings()  # Uses configured provider
# Custom endpoints / gateways
response = await LLMClient.create_chat_completion(
    model="gpt-4o",
    messages=messages,
    api_base="https://your-gateway.example.com",
    api_key="your-gateway-key"
)

Breaking Changes

None. This is an internal refactor. External APIs remain unchanged.


Testing

  • All tests pass
  • Added tests/test_llm_client.py with unit tests for the new client

Future Work

Option: Move to RedisVL Directly

Currently, langchain_redis.RedisVectorStore owns embedding calls—it expects an Embeddings object and calls embed_query()/embed_documents() internally. Moving to RedisVL directly would give us full control:

# Current: LangChain controls embedding
vectorstore = RedisVectorStore(embeddings=LLMClient.create_embeddings(), ...)
vectorstore.aadd_documents(docs)  # LangChain embeds internally

# Future: We control embedding
vectors = await LLMClient.create_embedding(model, texts)
query = VectorQuery(vector=vectors.embeddings[0], ...)
results = await index.query(query)

Trade-offs:

Aspect LiteLLM (LLMClient) RedisVL Vectorizers
Providers 100+ ~6 (OpenAI, Azure, Cohere, HuggingFace, VertexAI, Mistral)
Already in codebase Yes New dependency path
Unified interface Single LLMClient Separate vectorizer classes
Async support Native Varies

Recommendation: If we migrate to RedisVL, continue using LLMClient.create_embedding() rather than RedisVL vectorizers—broader provider support and the abstraction is already built.

@nkanu17
Copy link
Contributor Author

nkanu17 commented Jan 6, 2026

We use langchain_redis.RedisVectorStore for the embedding calls. It expects the langchain Embeddings object and calls embed_query()/embed_documents() internally.

If we move away from LangChain VectorStore to RedisVL directly, we can use either LiteLLM (via LLMClient.create_embedding()) or RedisVL's built-in vectorizers for embeddings. Since the repo now has LLMClient built as a unified abstraction with native async support, it would be nice to stick with LiteLLM. It gives broader provider coverage without introducing another embedding interface into the codebase.

For now this is not implemented.

@nkanu17 nkanu17 marked this pull request as ready for review January 6, 2026 13:41
Copilot AI review requested due to automatic review settings January 6, 2026 13:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces custom LLM wrapper classes with a unified LLMClient abstraction layer backed by LiteLLM, significantly reducing maintenance burden and enabling support for multiple providers without code changes.

Key Changes:

  • Introduced new agent_memory_server/llm/ package with LLMClient facade class
  • Removed agent_memory_server/llms.py (454 lines) and replaced with cleaner abstraction
  • Updated all LLM-consuming modules to use LLMClient.create_chat_completion() and LLMClient.create_embedding()

Reviewed changes

Copilot reviewed 67 out of 75 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
agent_memory_server/llm/client.py New LLMClient facade with chat completion, embedding, and query optimization methods
agent_memory_server/llm/types.py Standardized response dataclasses (ChatCompletionResponse, EmbeddingResponse)
agent_memory_server/llm/exceptions.py Custom exception hierarchy for LLM operations
agent_memory_server/llm/__init__.py Clean re-exports for package consumers
agent_memory_server/main.py Updated model validation to use LLMClient.get_model_config()
agent_memory_server/summarization.py Removed client parameter, uses LLMClient directly
agent_memory_server/memory_strategies.py Updated to use LLMClient.create_chat_completion()
agent_memory_server/long_term_memory.py Removed llm_client parameters throughout
agent_memory_server/vectorstore_factory.py Delegates embedding creation to LLMClient.create_embeddings()
agent_memory_server/extraction.py Updated to use LLMClient.create_chat_completion()
agent_memory_server/api.py Uses LLMClient.get_model_config() for model configuration
tests/test_llm_client.py New comprehensive test suite for LLMClient
uv.lock, pyproject.toml Added litellm>=1.80.11 dependency
Multiple test files Updated mocks to use ChatCompletionResponse dataclass
Java client files Whitespace cleanup (trailing whitespace removal)
Documentation files Updated references from llms.py to llm package

The PR is well-structured and follows good software engineering practices with proper abstraction, testing, and documentation updates. No critical issues found.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nkanu17 added 15 commits January 6, 2026 17:44
- Create unified LLMClient class with static methods for all LLM operations
- Add async create_chat_completion() using litellm.acompletion()
- Add async create_embedding() using litellm.aembedding()
- Add get_model_config() with fallback chain: MODEL_CONFIGS → LiteLLM → defaults
- Add optimize_query() for vector search query optimization
- Define standardized ChatCompletionResponse and EmbeddingResponse dataclasses
- Add LLMBackend Protocol for test injection
- Add custom exception hierarchy:
  - LLMClientError (base)
  - ModelValidationError
  - APIKeyMissingError
- Remove llms.py module (replaced by llm_client.py)
- Update api.py to use LLMClient.get_model_config()
- Update extraction.py to use LLMClient.create_chat_completion()
- Update long_term_memory.py to use LLMClient
- Update memory_strategies.py to use LLMClient
- Update summarization.py to use LLMClient

BREAKING CHANGE: llms.py module removed, use llm_client.py instead
- Use LLMClient.get_model_config() for model resolution
- Add custom exception handling with ModelValidationError and APIKeyMissingError
- Validate API keys based on resolved provider
- Improve error messages with abstracted terminology
- Add test_llm_client.py with unit tests for LLMClient
- Update conftest.py with LLMClient mock fixtures
- Update test_llms.py imports for compatibility
- Update test_api.py imports
- Update test_contextual_grounding.py imports
- Update test_contextual_grounding_integration.py imports
- Update test_llm_judge_evaluation.py imports
- Update test_long_term_memory.py imports
- Update test_memory_compaction.py imports
- Update test_memory_strategies.py imports
- Update test_no_worker_mode.py imports
- Update test_query_optimization_errors.py imports
- Update test_summarization.py imports
- Update pyproject.toml with any new dependencies
- Sync uv.lock file
- Add LLMClient documentation for AI assistant context
- Update project structure references
- Exclude dev_docs directory from version control
- Keep development documentation local only
Split the monolithic llm_client.py into a proper package structure:
- llm/exceptions.py: LLMClientError, ModelValidationError, APIKeyMissingError
- llm/types.py: ChatCompletionResponse, EmbeddingResponse, LLMBackend Protocol
- llm/client.py: LLMClient class with all methods
- llm/__init__.py: Re-exports for clean imports

This improves code organization and makes it easier to:
- Find specific components (exceptions, types, client)
- Test individual modules
- Extend the package with new backends
Update all application code imports from:
  from agent_memory_server.llm_client import ...
to:
  from agent_memory_server.llm import ...

Files updated:
- api.py
- extraction.py
- long_term_memory.py
- main.py
- memory_strategies.py
- summarization.py
Update all test file imports from:
  from agent_memory_server.llm_client import ...
to:
  from agent_memory_server.llm import ...

Update mock patch paths from:
  agent_memory_server.llm_client.LLMClient
to:
  agent_memory_server.llm.client.LLMClient

Also removed duplicate mock_llm_client fixture in conftest.py.
- Add LLMClient.create_embeddings() classmethod that returns LangChain Embeddings
- Extract _create_openai_embeddings() and _create_bedrock_embeddings() helpers
- Update vectorstore_factory.create_embeddings() to delegate to LLMClient
- Update tests to patch agent_memory_server.config.settings
- Add Embeddings to TYPE_CHECKING imports for proper type hints

This makes LLMClient the single entry point for all LLM operations:
- create_chat_completion() for chat
- create_embedding() for async embedding vectors
- create_embeddings() for LangChain Embeddings instances
- Fix test_memory_prompt_with_model_name to patch agent_memory_server.api.LLMClient.get_model_config
- Remove commented-out code from tests/conftest.py, tests/test_long_term_memory.py, tests/test_llm_client.py
Copy link

@vishal-bala vishal-bala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious to know your thinking behind opting for the Client/Backend abstraction as opposed to a general Client abstract base class (that has a LiteLLM implementation subclass). I think both approaches can work here, but I feel like the single global LLMClient object with a reconfigurable LLMBackend comes with some potential Python scoping gotchas.

This was more of a higher-level look through - I saw that the LLMClient currently is directly calling LiteLLM. Is that intentional and meant to be abstracted in the future, or actually part of the design?

# Merge provider-specific kwargs
call_kwargs.update(kwargs)

response = await acompletion(**call_kwargs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If LiteLLM is supposed to be the backend to the agnostic LLMClient wouldn't we rather have a generic call to an LLMBackend method here rather than locking in LiteLLM?

Copy link
Contributor Author

@nkanu17 nkanu17 Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLMClient (facade)
    ├── _backend: LLMBackend | None  # Only set in tests
    │
    ├── create_chat_completion()
    │       if _backend: use it (tests)
    │       else: call litellm.acompletion() directly
    │
    └── create_embedding()
            if _backend: use it (tests)
            else: call litellm.aembedding() directly

The _backend is only for tests, we wouldnt swap LiteLLM for something else. The protocol lets tests inject mocks without hitting real APIs.

I did consider what you said, we'd only want a different backend if we wanted to swap LiteLLM entirely for something else like:

  • Direct SDK calls (openai, anthropic, boto3)
  • A different unified library/custom gateway client

Copy link
Contributor Author

@nkanu17 nkanu17 Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And my thought was to revisit this when we need it. I did not want to create placeholder abstraction without a need yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I removed the _backend -> _test_handler/LLMBackend Protocol and replaced it with standard mocks.

@nkanu17
Copy link
Contributor Author

nkanu17 commented Jan 7, 2026

I'd be curious to know your thinking behind opting for the Client/Backend abstraction as opposed to a general Client abstract base class (that has a LiteLLM implementation subclass). I think both approaches can work here, but I feel like the single global LLMClient object with a reconfigurable LLMBackend comes with some potential Python scoping gotchas.

This was more of a higher-level look through - I saw that the LLMClient currently is directly calling LiteLLM. Is that intentional and meant to be abstracted in the future, or actually part of the design?

@vishal-bala
Yea so the backend is required for tests only, allowing tests to mock LLM calls without hitting real APIs. You are also right that the _backend state could cause issues in concurrent tests. However, we mitigate this by:

  • Only setting _backend in tests, never in production
  • Using LLMClient.reset() in test teardown

Open to an alternative.


Actually, this was a good discussion - I realized the _backend and LLMBackend names are confusing. I am changing it to TestLLMHandler (which is the protocol tests should follow to mock) and _backend -> _test_handler


Update:

I removed the _backend -> _test_handler/LLMBackend Protocol and replaced it with standard mocks.

@abrookins abrookins force-pushed the feat/litellm-llmclient-abstraction branch from b6db891 to 93b64af Compare January 7, 2026 17:04
…ptions

- Unknown embedding model now raises ModelValidationError instead of defaulting to OpenAI
- Anthropic embedding request raises ModelValidationError (Anthropic has no embedding models)
- Unsupported provider in _map_provider raises ModelValidationError instead of defaulting to OpenAI
- Removed vertex_ai and gemini fallback mappings
…eModel

- ChatCompletionResponse and EmbeddingResponse now use Pydantic BaseModel
- Maintains frozen=True behavior via ConfigDict(frozen=True)
- Aligns with project's consistent use of Pydantic models
- Update frozen tests to expect Pydantic ValidationError instead of AttributeError
- Change None content test to empty string (Pydantic validates str type)
…anism

- Remove TestLLMHandler Protocol from types.py
- Remove _test_handler class variable from LLMClient
- Remove set_test_handler() and clear_test_handler() methods
- Remove conditional dispatch in create_chat_completion() and create_embedding()
- Remove TestLLMHandler from __init__.py exports

This eliminates test-only code paths from production code and removes
global mutable state from the otherwise stateless LLMClient facade.
- Remove MockLLMHandler class and clear_llm_client_test_handler fixture
- Remove TestLLMClientClearTestHandler and TestLLMClientTestHandlerInjection classes
- Refactor TestLLMClientChatCompletion to use @patch on acompletion
- Refactor TestLLMClientEmbedding to use @patch on aembedding
- Add helper functions for creating mock LiteLLM responses

This aligns LLM client tests with the standard mocking patterns used
throughout the rest of the test suite.
- test_create_embeddings_unsupported_provider: expect ModelValidationError
  instead of ValueError to match actual exception raised by LLMClient
- test_optimization_with_none_content_response: mock at LiteLLM level
  instead of creating invalid ChatCompletionResponse with content=None
Replace manual single-quote wrapping with Python's !r (repr) format
specifier for more idiomatic string formatting that also handles
edge cases like values containing quotes.

Changed locations:
- Unknown embedding model error message
- Unsupported LiteLLM provider error message
- Model not found warning message
- Query optimization debug/warning messages
Copy link
Collaborator

@abrookins abrookins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing! LGTM 👍

- Add TestLLMClientIntegrationHuggingFace class with tests for:
  - Chat completion via huggingface/meta-llama/Llama-3.3-70B-Instruct
  - Embeddings via huggingface/microsoft/codebert-base
- Add TestLLMClientIntegrationGemini class with test for:
  - Chat completion via gemini/gemini-1.5-flash
- Tests require HF_TOKEN and GEMINI_API_KEY respectively
- Tests skip gracefully when API keys are not set
- Run with: uv run pytest tests/test_llm_client.py --run-api-tests
@nkanu17 nkanu17 merged commit 4baa5a7 into main Jan 8, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants