Skip to content

Fix OpenTelemetry tool spans missing parent ID in CodeAgent#2187

Open
Ricardo-M-L wants to merge 1 commit intohuggingface:mainfrom
Ricardo-M-L:fix/otel-tool-span-parent-context
Open

Fix OpenTelemetry tool spans missing parent ID in CodeAgent#2187
Ricardo-M-L wants to merge 1 commit intohuggingface:mainfrom
Ricardo-M-L:fix/otel-tool-span-parent-context

Conversation

@Ricardo-M-L
Copy link
Copy Markdown

Summary

  • Fix tool spans losing their parent span context when executed by CodeAgent, causing them to appear as root spans in OpenTelemetry traces
  • Root cause: the timeout decorator uses ThreadPoolExecutor which runs code in a new thread where contextvars (used by OpenTelemetry for span context) are not propagated
  • Fix: use contextvars.copy_context() + ctx.run() to propagate context to the worker thread — the same pattern already used in ToolCallingAgent.process_tool_calls for parallel tool execution

Details

When CodeAgent executes tools, it goes through evaluate_python_code() which is wrapped by the timeout decorator. This decorator submits the function to a ThreadPoolExecutor, creating a new thread. Since OpenTelemetry stores the active span in contextvars, the span context is lost in the new thread, and any tool spans created during execution have no parent.

The fix captures the current context before submitting to the thread pool and uses ctx.run() to execute within that context, preserving the full contextvars state including OTel spans.

Before (broken):

FinalAnswerTool (type tool, no parent)  ← root span
SimpleTool (type tool, no parent)       ← root span
CodeAgent.run (type agent)
  └─ Step 1 (type chain)
       └─ OpenAIModel.generate (type llm)

After (fixed):

CodeAgent.run (type agent)
  └─ Step 1 (type chain)
       ├─ OpenAIModel.generate (type llm)
       ├─ SimpleTool (type tool)          ← correct parent
       └─ FinalAnswerTool (type tool)     ← correct parent

Fixes #1961

Test plan

  • Added test_timeout_propagates_contextvars — verifies the timeout decorator propagates contextvars to the worker thread
  • Added test_timeout_propagates_contextvars_to_evaluate_python_code — verifies context propagation through evaluate_python_code with timeout
  • Added test_timeout_propagates_contextvars_with_local_executor — end-to-end test with LocalPythonExecutor simulating real CodeAgent tool execution
  • All existing TestTimeout tests pass (13/13)

🤖 Generated with Claude Code

The `timeout` decorator in `local_python_executor.py` uses
`ThreadPoolExecutor` to run code in a worker thread. Since
OpenTelemetry span context is stored in `contextvars`, it was
lost when crossing the thread boundary, causing tool spans
created during CodeAgent execution to become root spans with
no parent.

Fix by capturing the current context via `contextvars.copy_context()`
and using `ctx.run()` to execute the function in the worker thread,
preserving the span hierarchy. This is the same pattern already used
in `ToolCallingAgent.process_tool_calls` for parallel tool execution.

Fixes huggingface#1961

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@VANDRANKI VANDRANKI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause analysis is correct. ThreadPoolExecutor in Python does not copy the caller's contextvars context into worker threads, so copy_context() + ctx.run() is exactly the right fix.

A few observations:

Consistency with existing code - this is the same pattern already used in ToolCallingAgent.process_tool_calls for parallel tool execution. Good to have the two code paths aligned.

Test design - the two new tests cover the right things: (1) a unit test verifying contextvar propagation through the raw decorator, and (2) an integration-style test through evaluate_python_code. Both are clear about what they are asserting and why (the docstrings explain the OTel connection).

Return value - ctx.run(func, *args, **kwargs) returns the return value of func(*args, **kwargs), so existing callers are unaffected.

One question: is there an async version of the timeout helper? If there is and it also uses ThreadPoolExecutor, the same fix would apply there.

Overall this is a clean, well-motivated fix with good test coverage. LGTM from my side.

Copy link
Copy Markdown

@VANDRANKI VANDRANKI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, targeted fix. contextvars.copy_context() captures a snapshot of the calling thread's context, and ctx.run(func, ...) executes the worker in that snapshot. Without this, ThreadPoolExecutor starts each worker in an empty context, so OpenTelemetry's active span ContextVar (and any other per-request contextvars) is absent in the worker - tool spans end up with no parent.

Tests are the best part. test_timeout_propagates_contextvars directly verifies the propagation at the @timeout decorator level. test_timeout_propagates_contextvars_to_evaluate_python_code goes end-to-end through evaluate_python_code, which is the actual call site in CodeAgent. Both tests use a test ContextVar rather than OTel internals, which is the right approach - it keeps the tests dependency-free while still verifying the mechanism.

Interaction with PR #2199: PR #2199 also modifies the timeout function in local_python_executor.py (replacing the with ThreadPoolExecutor block with manual executor.shutdown(wait=not timed_out)). These two PRs touch the same lines and will conflict. The context propagation fix in this PR should be carried forward into whichever version of the timeout function lands. Worth coordinating with the author of #2199.

LGTM on the contextvars fix itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Missing parent id for OpenTelemetry tool spans when using CodeAgent

2 participants