Skip to content

Framework-level prompt injection defense for external input pipelines #45

@spinnakergit

Description

@spinnakergit

While developing the discord plugin ( https://github.com/spinnakergit/a0-discord) , I realized there is a prompt injection vulnerability that needed to be addressed. You can see how I handled it through security hardening at the LLM level as a default. Basically the default call_utility_model which forces a plain LLM with no tools or access. However, I realized this is very limiting to what I would actually want to do within the discord chat as it strips down any context from the chat and wouldn't allow for much detailed interaction back to Agent Zero. I chose to provide it as a default setting as it provides a convenience mechanism to allow an LLM chat within discord. To make it more robust I hardened the plugin with several layers of security that utilize Discord built in features for security. It is all detailed in the repo if you would like further details but basically I provided a hash for elevated sessions as well as an allowUser list to limit bot chat interactions. I think that is a very good approach for hardening within the existing plugin framework. However, I did want to raise this as an issue that may be better approached architecturally within the plugin framework or chat bridge itself. I only bring this to the team's attention as a lesson learned from my experience and hope that it provides some good feedback to the team. If the team decides that architecturally this isn't the best use of the team's time then perhaps published guidelines on how to address the prompt injection vulnerability would also be helpful to others. I think the team has built an amazing platform and only want it to continue to get better each day as it is truly exciting to see this evolve.

I had claude assist in performing an assessment of what could be built at the framework level of Agent Zero and here was Claude's take. My hope is that this is found to be helpful and constructive. If not, please let me know as I know AI based feedback isn't always the best :)

Problem Statement
Agent Zero currently applies no sanitization or injection detection to messages before they enter the agent loop. Each plugin that accepts external input (chat bridges, API integrations) must independently implement its own defenses. This creates three problems:

Inconsistent protection — Plugin A may have robust defenses while Plugin B has none. The framework is only as secure as its weakest plugin.
Duplicated effort — Every plugin author must independently research and implement injection patterns, Unicode normalization, delimiter escaping, etc.
Bypassed defenses — Even well-defended plugins can be bypassed if an attacker finds an alternative input path that skips the plugin's sanitization layer (e.g., a scheduled task that replays unsanitized content).
Specific Vulnerabilities Identified
Location Issue
files.py:289-296 (replace_placeholders_json) User input inserted via str.replace() after json.dumps() — no structural validation of the resulting JSON
files.py:280-286 (replace_placeholders_text) User input inserted via str(value) with zero escaping into markdown/text prompts
agent.py:665-682 (hist_add_user_message) No sanitization before template processing
agent.py:854-950 (process_tools) Tool names loaded dynamically, tool args passed directly to execute() with no schema validation
extensions/python/message_loop_prompts_after/ Extensions can modify loop_data.system with no validation
Proposed Approach: Sanitization Middleware in the Message Pipeline
Rather than expecting every plugin to implement its own defenses, add a sanitization layer at the framework boundary — the point where external input becomes an internal message.

Recommended insertion point: Inside hist_add_user_message() in agent.py, before the message is templated and stored. This is the single chokepoint that all external input must pass through.

External Input
→ UserMessage(message, attachments, source="webui"|"plugin"|"scheduler"|"subordinate")
→ hist_add_user_message()
→ InputSanitizer.sanitize(message, context) ← NEW
→ Normalize Unicode (NFKC + zero-width strip)
→ Detect injection patterns (configurable pattern set)
→ Escape structural delimiters
→ Enforce content length limits
→ Flag/log/replace detected injections
→ parse_prompt(...) ← existing template processing
→ stored in history → sent to LLM
Key design decisions:

Source-aware sanitization — Add a source field to UserMessage so the sanitizer knows the trust level. WebUI input from the local user might get lighter treatment than a Discord message from an unknown user. Subordinate agent messages might skip injection detection but still get Unicode normalization.

Configurable, not hardcoded — The injection pattern set should be loadable from a config file (e.g., prompts/fw.injection_patterns.yaml). Plugin authors should be able to register additional patterns for their domain. Users should be able to disable specific patterns if they cause false positives.

Replace, don't reject — The Discord plugin's approach of replacing detected injections with [blocked: suspected prompt injection] is better than silent rejection. The LLM still sees that something was attempted, which helps it respond appropriately. The user still gets a response rather than mysterious silence.

Log all detections — Every detected injection should be logged with the source, the matched pattern, and the original content (for security audit). This is critical for tuning false positives.

Tool argument validation — Separately from message sanitization, add schema validation for tool arguments. Each tool should declare an argument schema (types, allowed values, constraints) in its prompt file or a companion schema file. The framework validates before calling execute().

What Already Exists as a Reference Implementation
The a0-discord plugin's sanitize.py implements a battle-tested version of this at the plugin level:

30+ injection patterns covering instruction override, role hijacking, model-specific tokens, and chat role markers
NFKC Unicode normalization to defeat homoglyph attacks (Cyrillic а → Latin a)
23 zero-width character classes stripped before pattern matching
Delimiter tag escaping to prevent users from spoofing system-level markers
Content length enforcement per content type (message, username, embed, filename)
Truncation ordering — normalize → escape → replace → truncate (prevents boundary attacks)
This module could serve as the basis for a framework-level InputSanitizer class. The patterns are domain-agnostic — they detect injection attempts regardless of whether the input comes from Discord, Telegram, email, or a web form.

Impact on Plugin Architecture
If this is implemented at the framework level, plugins would:

No longer need to implement their own injection detection (reducing duplicated code and inconsistency)
Still be able to add domain-specific sanitization on top (e.g., Discord-specific delimiter tags, Telegram-specific markup stripping)
Benefit immediately — existing plugins that don't sanitize input would gain protection without code changes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions