Skip to content

Plugin development for LiteLLM stats - token usage and context window info #1382

@asb-42

Description

@asb-42

Tried to build a plugin for A0 v.1.3 which injects information from LiteLLM into the header. What we would like to see is the usual - number of tokens processed, size of context window and percentage of usage. This LiteLLM Info should be updated, at least after every prompt or response (not necessarily in realtime). However, we hit brick walls, whatever we tried.

Summary

Two approaches were attempted to create a plugin that displays token usage and context window information in the A0 header bar. Neither approach succeeded. This report documents the attempts, failures, and missing developer information that would be needed to build this or similar plugins.


Approach 1: Core Modifications [rejected]

Goal

Extract response.usage (prompt_tokens, completion_tokens, total_tokens) from LiteLLM API responses and display them in the header.

What was done

  1. Modified models.py:
    • Added _extract_usage() function to parse LiteLLM response objects
    • Added usage attribute to ChatGenerationResult
    • Changed unified_call() return type from Tuple[str, str] to Tuple[str, str, dict | None]
    • Tracked usage in both streaming and non-streaming paths
  2. Modified agent.py:
    • Changed call_chat_model and call_utility_model to unpack 3 return values
    • Passed usage to chat_model_call_after extension
  3. Updated all callers of unified_call() across the codebase (16 changes in 5 files)

Result

  • Token data was correctly extracted and displayed in the header
  • A0 broke: ValueError: Tool request must be a dictionary — the agent could no longer process any messages
  • Root cause: Changing unified_call() return signature from 2 to 3 values broke A0's response processing in ways that were not immediately visible in the code

Conclusion

Core modifications to unified_call() are too invasive. Changing a fundamental return signature affects the entire framework and is fragile against future A0 updates.


Approach 2: Isolated Plugin [no core modifications]

Goal

Use A0's existing agent.get_data("ctx_window") (which stores approximate token count from prepare_prompt()) and get_chat_model_config() (which provides ctx_length) to display token usage without modifying any core files.

What was done

  1. Created API handler (api/token_info.py) that reads from agent.get_data(Agent.DATA_NAME_CTX_WINDOW)
  2. Fixed agent lookup bug: API now checks both context.agent0 and context.streaming_agent
  3. Created Alpine store (webui/token-store.js) that polls the API every 2 seconds
  4. Created WebUI widget (webui/token-widget.html) with badge display
  5. Created WebUI extension (extensions/webui/chat-top-start/token-display.html)
  6. Fixed path issues: User plugins must use /usr/plugins/... paths (not /plugins/...)

Result

  • The plugin loads and displays data in the header
  • The API returns a 200 response with token data
  • Token data never updates: Always shows the same value (53,859 tokens / 262,144 ctx_length / 20.5%)
  • Debug logging in prepare_prompt() never produces output — the function is not called during normal chat operation

Investigation

  • Added print() with flush=True to communicate(), monologue(), and prepare_prompt() in agent.py
  • None of these debug statements produced output in docker logs
  • This means A0 uses a different code path for chat processing that does not go through communicate → _process_chain → monologue → prepare_prompt
  • Without understanding which code path A0 uses, it's impossible to know when/where token data is updated

Conclusion

The isolated plugin approach works structurally (API, store, widget all function), but cannot display useful data because:

  1. The data source (ctx_window) is only updated by prepare_prompt()
  2. prepare_prompt() is not called during normal chat operation (for unknown reasons)
  3. Without access to the actual chat processing code path, there's no way to get live token data

Missing Developer Information

The following information would be needed to develop this plugin (and similar plugins) for Agent Zero:

1. Chat Processing Code Path

Question: What is the exact call chain when a user sends a message and A0 responds?

  • communicate()_process_chain()monologue()prepare_prompt() was assumed but debug output never appeared
  • Is there a different entry point for chat messages?
  • Does the WebUI use a different mechanism (WebSocket, direct API call) that bypasses communicate()?

2. Token Data Availability

Question: When and where are token counts calculated and stored?

  • prepare_prompt() stores data in agent.get_data("ctx_window") — but this function is apparently not called
  • Is there a different place where token data is available?
  • Does LiteLLM provide response.usage data that could be accessed without core modifications?

3. Extension Point Documentation

Question: What data is available at each extension point?

  • The extension framework lists points like monologue_start, response_stream, message_loop_start etc.
  • But there's no documentation on what parameters/data each extension receives
  • Without knowing the available data, it's trial-and-error to build useful extensions

4. Plugin API Route Resolution

Question: How are plugin API routes resolved for user plugins vs. core plugins?

  • User plugins in usr/plugins/ use route /usr/plugins/<name>/... for static assets
  • But API routes use /api/plugins/<name>/... (not /api/usr/plugins/<name>/...)
  • The resolution mechanism is not documented

5. WebUI Extension Loading

Question: How do WebUI extensions get loaded?

  • x-component path= requires absolute paths starting with / for user plugins
  • The components.js loader prepends components/ if the path doesn't start with / or components/
  • This is not documented and caused a 404 error that took significant debugging to find

6. Logging in Docker

Question: How to output debug information that appears in docker logs?

  • PrintStyle.debug() output doesn't appear in docker logs
  • print() with flush=True also didn't appear (for code running in agent threads)
  • There's no documented way to write log output that's visible in the container logs

7. Agent Instance Lifecycle

Question: When are agent0 and streaming_agent created/destroyed?

  • API calls return different agent instances than the one processing the chat
  • agent0 is recreated in reset_context() — when is this called?
  • Understanding the agent lifecycle is essential for plugins that need to access agent state

Recommendations for Agent Zero Maintainers

  1. Document the chat processing flow: A clear diagram of what happens when a user sends a message, including all code paths and entry points

  2. Document extension point parameters: Each extension point should document exactly what kwargs are available

  3. Add a token usage API: A built-in API endpoint that returns current token usage would eliminate the need for plugins to reverse-engineer the data flow

  4. Document plugin routing: How API routes and static assets are resolved for plugins in different directories

  5. Fix logging in Docker: Provide a documented way to write log output visible in container logs

  6. Add a plugin development guide: A step-by-step guide for creating plugins, including common patterns and pitfalls


Files

Plugin files (in usr/plugins/litellm_info/)

  • plugin.yaml — Manifest
  • api/token_info.py — API handler (reads ctx_window data)
  • extensions/webui/chat-top-start/token-display.html — Header extension
  • webui/token-store.js — Alpine store (polls API)
  • webui/token-widget.html — Header widget component
  • README.md — Plugin documentation

Core files (NOT modified — verified clean)

  • agent.py
  • models.py
  • helpers/tokens.py
  • All other core files

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions