-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Plugin development for LiteLLM stats - token usage and context window info #1382
Description
Tried to build a plugin for A0 v.1.3 which injects information from LiteLLM into the header. What we would like to see is the usual - number of tokens processed, size of context window and percentage of usage. This LiteLLM Info should be updated, at least after every prompt or response (not necessarily in realtime). However, we hit brick walls, whatever we tried.
Summary
Two approaches were attempted to create a plugin that displays token usage and context window information in the A0 header bar. Neither approach succeeded. This report documents the attempts, failures, and missing developer information that would be needed to build this or similar plugins.
Approach 1: Core Modifications [rejected]
Goal
Extract response.usage (prompt_tokens, completion_tokens, total_tokens) from LiteLLM API responses and display them in the header.
What was done
- Modified
models.py:- Added
_extract_usage()function to parse LiteLLM response objects - Added
usageattribute toChatGenerationResult - Changed
unified_call()return type fromTuple[str, str]toTuple[str, str, dict | None] - Tracked usage in both streaming and non-streaming paths
- Added
- Modified
agent.py:- Changed
call_chat_modelandcall_utility_modelto unpack 3 return values - Passed
usagetochat_model_call_afterextension
- Changed
- Updated all callers of
unified_call()across the codebase (16 changes in 5 files)
Result
- Token data was correctly extracted and displayed in the header
- A0 broke:
ValueError: Tool request must be a dictionary— the agent could no longer process any messages - Root cause: Changing
unified_call()return signature from 2 to 3 values broke A0's response processing in ways that were not immediately visible in the code
Conclusion
Core modifications to unified_call() are too invasive. Changing a fundamental return signature affects the entire framework and is fragile against future A0 updates.
Approach 2: Isolated Plugin [no core modifications]
Goal
Use A0's existing agent.get_data("ctx_window") (which stores approximate token count from prepare_prompt()) and get_chat_model_config() (which provides ctx_length) to display token usage without modifying any core files.
What was done
- Created API handler (
api/token_info.py) that reads fromagent.get_data(Agent.DATA_NAME_CTX_WINDOW) - Fixed agent lookup bug: API now checks both
context.agent0andcontext.streaming_agent - Created Alpine store (
webui/token-store.js) that polls the API every 2 seconds - Created WebUI widget (
webui/token-widget.html) with badge display - Created WebUI extension (
extensions/webui/chat-top-start/token-display.html) - Fixed path issues: User plugins must use
/usr/plugins/...paths (not/plugins/...)
Result
- The plugin loads and displays data in the header
- The API returns a 200 response with token data
- Token data never updates: Always shows the same value (53,859 tokens / 262,144 ctx_length / 20.5%)
- Debug logging in
prepare_prompt()never produces output — the function is not called during normal chat operation
Investigation
- Added
print()withflush=Truetocommunicate(),monologue(), andprepare_prompt()inagent.py - None of these debug statements produced output in
docker logs - This means A0 uses a different code path for chat processing that does not go through
communicate → _process_chain → monologue → prepare_prompt - Without understanding which code path A0 uses, it's impossible to know when/where token data is updated
Conclusion
The isolated plugin approach works structurally (API, store, widget all function), but cannot display useful data because:
- The data source (
ctx_window) is only updated byprepare_prompt() prepare_prompt()is not called during normal chat operation (for unknown reasons)- Without access to the actual chat processing code path, there's no way to get live token data
Missing Developer Information
The following information would be needed to develop this plugin (and similar plugins) for Agent Zero:
1. Chat Processing Code Path
Question: What is the exact call chain when a user sends a message and A0 responds?
communicate()→_process_chain()→monologue()→prepare_prompt()was assumed but debug output never appeared- Is there a different entry point for chat messages?
- Does the WebUI use a different mechanism (WebSocket, direct API call) that bypasses
communicate()?
2. Token Data Availability
Question: When and where are token counts calculated and stored?
prepare_prompt()stores data inagent.get_data("ctx_window")— but this function is apparently not called- Is there a different place where token data is available?
- Does LiteLLM provide
response.usagedata that could be accessed without core modifications?
3. Extension Point Documentation
Question: What data is available at each extension point?
- The extension framework lists points like
monologue_start,response_stream,message_loop_startetc. - But there's no documentation on what parameters/data each extension receives
- Without knowing the available data, it's trial-and-error to build useful extensions
4. Plugin API Route Resolution
Question: How are plugin API routes resolved for user plugins vs. core plugins?
- User plugins in
usr/plugins/use route/usr/plugins/<name>/...for static assets - But API routes use
/api/plugins/<name>/...(not/api/usr/plugins/<name>/...) - The resolution mechanism is not documented
5. WebUI Extension Loading
Question: How do WebUI extensions get loaded?
x-component path=requires absolute paths starting with/for user plugins- The
components.jsloader prependscomponents/if the path doesn't start with/orcomponents/ - This is not documented and caused a 404 error that took significant debugging to find
6. Logging in Docker
Question: How to output debug information that appears in docker logs?
PrintStyle.debug()output doesn't appear in docker logsprint()withflush=Truealso didn't appear (for code running in agent threads)- There's no documented way to write log output that's visible in the container logs
7. Agent Instance Lifecycle
Question: When are agent0 and streaming_agent created/destroyed?
- API calls return different agent instances than the one processing the chat
agent0is recreated inreset_context()— when is this called?- Understanding the agent lifecycle is essential for plugins that need to access agent state
Recommendations for Agent Zero Maintainers
-
Document the chat processing flow: A clear diagram of what happens when a user sends a message, including all code paths and entry points
-
Document extension point parameters: Each extension point should document exactly what
kwargsare available -
Add a token usage API: A built-in API endpoint that returns current token usage would eliminate the need for plugins to reverse-engineer the data flow
-
Document plugin routing: How API routes and static assets are resolved for plugins in different directories
-
Fix logging in Docker: Provide a documented way to write log output visible in container logs
-
Add a plugin development guide: A step-by-step guide for creating plugins, including common patterns and pitfalls
Files
Plugin files (in usr/plugins/litellm_info/)
plugin.yaml— Manifestapi/token_info.py— API handler (reads ctx_window data)extensions/webui/chat-top-start/token-display.html— Header extensionwebui/token-store.js— Alpine store (polls API)webui/token-widget.html— Header widget componentREADME.md— Plugin documentation
Core files (NOT modified — verified clean)
agent.pymodels.pyhelpers/tokens.py- All other core files