Skip to content

fix: Critical thread crash from tool validation errors#1363

Open
gdeyoung wants to merge 1 commit intoagent0ai:mainfrom
gdeyoung:fix/tool-request-validation-crash
Open

fix: Critical thread crash from tool validation errors#1363
gdeyoung wants to merge 1 commit intoagent0ai:mainfrom
gdeyoung:fix/tool-request-validation-crash

Conversation

@gdeyoung
Copy link
Copy Markdown

Critical Bug: Thread lockup from unhandled ValueError

Impact

Agent threads lock up completely when the LLM output cannot be parsed as valid JSON. This happens at a significant rate — any time a model produces non-JSON output (reasoning text, markdown with no JSON block, malformed JSON, etc.), the entire conversation thread dies with:

ValueError: Tool request must be a dictionary

The thread becomes unrecoverable — no retry, no graceful degradation, just a dead thread.

Root Cause

In agent.py process_tools(), validate_tool_request() is called before the None check on tool_request:

# BROKEN — current code (lines 849-854)
tool_request = extract_tools.json_parse_dirty(msg)  # Returns None when parsing fails

# basic validation + extensions
await self.validate_tool_request(tool_request)       # 💥 CRASHES — ValueError on None!

if tool_request is not None:                          # Never reached
    ...

The existing else branch (line ~939) already handles None gracefully with a misformat warning — but it is unreachable because the validation crashes first.

Fix

  1. Move validate_tool_request() inside the None check — so unparseable output falls through to the existing graceful handler
  2. Wrap validation in try/except ValueError — converts crashes into logged warnings that allow the message loop to continue
  3. Add defensive tool_args parsing — handles models that return tool_args as a JSON string instead of a dict (confirmed issue with MiniMax M2.5 and others)
# FIXED — after this PR
tool_request = extract_tools.json_parse_dirty(msg)

if tool_request is not None:
    try:
        await self.validate_tool_request(tool_request)
    except ValueError as e:
        # Graceful recovery instead of thread death
        PrintStyle(font_color="red", padding=True).print(f"Tool validation error: {e}")
        wmsg = self.hist_add_warning(str(e))
        self.context.log.log(type="warning", content=f"{self.agent_name}: {e}", id=wmsg.id)
        return  # gracefully continue the message loop
    
    raw_tool_name = tool_request.get("tool_name", tool_request.get("tool",""))
    tool_args = tool_request.get("tool_args", tool_request.get("args", {}))
    
    # Defensive: some models return tool_args as a JSON string instead of dict
    if isinstance(tool_args, str):
        try:
            import json as _json
            tool_args = _json.loads(tool_args)
        except (ValueError, TypeError):
            tool_args = {}
    if not isinstance(tool_args, dict):
        tool_args = {}
    ...
else:
    # Existing graceful handler for unparseable output
    warning_msg_misformat = self.read_prompt("fw.msg_misformat.md")
    ...

Testing

  • Verified the fix handles None from json_parse_dirty() gracefully
  • Verified ValueError from validation is caught and logged as warning
  • Verified the message loop continues after the error (no thread death)
  • Verified defensive tool_args string parsing works with JSON string inputs

Why This Is Safe

  • Minimal change: Only moves existing code inside a conditional + adds error handling
  • No behavior change for valid tool calls: The validation still runs exactly as before for valid requests
  • Only affects error paths: The new code only executes when something is already wrong
  • Matches existing patterns: Uses the same warning/logging pattern as the existing else branch
  • The else branch already existed: This was always the intended behavior — the bug was just an ordering error

- Move validate_tool_request() inside None check in process_tools()
- Wrap validation in try/except to convert ValueError to warning
- Add defensive tool_args parsing for models returning JSON strings

Fixes critical bug where malformed LLM output causes unhandled
ValueError that locks up the entire agent thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant