Fix LocalPythonExecutor timeout waiting for worker completion#2199
Fix LocalPythonExecutor timeout waiting for worker completion#2199shaun0927 wants to merge 8 commits intohuggingface:mainfrom
Conversation
The current timeout wrapper raises ExecutionTimeoutError only after the underlying worker finishes because executor shutdown waits for the thread when leaving the context manager. This change uses explicit shutdown behavior so callers regain control at the configured timeout while keeping Python's documented limitation that timed-out worker threads may continue in the background until completion. Constraint: Python threads cannot be forcefully terminated safely Rejected: Process-based execution for LocalPythonExecutor | much broader behavioral change than a bugfix Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep timeout semantics prompt for callers even though background worker cleanup remains best-effort Tested: PYTHONPATH=src pytest -q tests/test_local_python_executor.py -k 'timeout' Not-tested: Full test suite
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cda9f1e676
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
I checked the current CI state for this fork PR: both I also confirmed that |
The prompt-return timeout fix exposed a follow-up race: once a timed-out\nworker keeps running in the background, it can still mutate shared\nexecution dictionaries after the caller has moved on. That made later\nLocalPythonExecutor calls vulnerable to late variable/log overwrites and\nleft reused evaluate_python_code state dicts open to the same top-level\nleakage.\n\nThis snapshots state/custom tool dictionaries for each execution and only\ncommits them back after non-timeout completion. Timeouts still return\npromptly, but stale workers no longer write into the live dictionaries\nused by subsequent calls.\n\nConstraint: Python threads still cannot be force-killed safely\nRejected: Revert prompt-return timeout behavior | restores the original availability bug\nRejected: Process-based isolation | too broad for this targeted bugfix\nConfidence: high\nScope-risk: narrow\nReversibility: clean\nDirective: Keep timeout fixes paired with reuse-after-timeout regressions because stale workers can outlive the caller\nTested: PYTHONPATH=src pytest -q tests/test_local_python_executor.py -k 'timeout'\nTested: targeted local repro confirming timed-out workers no longer overwrite reused state/logs\nNot-tested: Full pytest suite / CI workflows still require maintainer approval on fork PRs
|
Follow-up pushed in What changed:
Validation:
CI note: the fork PR workflows still appear to require maintainer approval before GitHub Actions can run. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9028cb818
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The previous timeout follow-up only copied the top-level execution mapping, so timed-out workers could still mutate nested lists, dicts, or objects shared with later calls. That left the P1 race open for common in-place mutations like list append even after prompt-return semantics were fixed. This switches timed executions to deep-copied state/custom tool snapshots and adds regressions covering nested state reuse for both and . Constraint: Python threads still cannot be force-killed safely Rejected: Keep shallow copies | nested mutable values still leak after timeout Rejected: Process isolation for local executor | much broader behavior change than this bugfix Confidence: high Scope-risk: narrow Reversibility: clean Directive: Any future timeout-isolation change must cover nested mutable state, not just top-level dict writes Tested: PYTHONPATH=src pytest -q tests/test_local_python_executor.py -k 'timeout' Tested: PYTHONPATH=src pytest -q tests/test_local_python_executor.py -k 'not test_can_import_sklearn_if_explicitly_authorized' Tested: targeted nested-state timeout repro for evaluate_python_code and LocalPythonExecutor Not-tested: tests/test_local_python_executor.py::test_can_import_sklearn_if_explicitly_authorized (local sklearn missing) Not-tested: Full repo test suite / GitHub Actions still require maintainer approval on fork PRs
|
Fresh follow-up pushed in What changed:
Validation:
Note: the only excluded test in this environment is the existing sklearn-specific import test because |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa42324abc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| else: | ||
| state = copy.deepcopy(state) |
There was a problem hiding this comment.
Stop deep-copying persisted state unconditionally
evaluate_python_code now deep-copies state on every call, which requires all previously stored values to be deepcopyable; that breaks common stateful flows where prior code imported modules, because evaluate_import stores module objects in state and copy.deepcopy on a ModuleType raises TypeError before execution starts. In practice, a successful call like import time can make the next call with the same LocalPythonExecutor/state fail immediately, which is a functional regression introduced by this change.
Useful? React with 👍 / 👎.
|
Fresh follow-up pushed in What changed:
Validation:
CI note: fork PR workflows may still require maintainer approval before Actions can run. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cdc8139c3f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if isinstance(value, dict): | ||
| return {key: snapshot_execution_context(val) for key, val in value.items()} | ||
| if isinstance(value, list): |
There was a problem hiding this comment.
Preserve aliasing when cloning execution context
snapshot_execution_context rebuilds containers with fresh comprehensions, which drops shared-reference identity across keys/variables. If a prior run leaves state with aliases (e.g., b = a), the next evaluate_python_code call clones them into different objects, so mutating a no longer updates b and execution results become silently incorrect even without any timeout path. This is a behavior regression from the previous in-place state model and can corrupt multi-step executor sessions.
Useful? React with 👍 / 👎.
| if isinstance(value, dict): | ||
| return {key: snapshot_execution_context(val) for key, val in value.items()} |
There was a problem hiding this comment.
Handle cycles while snapshotting execution state
The new recursive snapshot logic has no memo/visited tracking, so cyclic inputs like s = {}; s['self'] = s recurse indefinitely and fail before user code runs. Previously, such states could be reused because no deep snapshot occurred; now they can trigger RecursionError on entry to evaluate_python_code, making valid Python object graphs unusable.
Useful? React with 👍 / 👎.
The timeout isolation follow-up fixed stale worker mutations, but the snapshot helper still broke shared-reference semantics and cyclic state graphs. This update keeps runtime objects like modules intact while preserving aliasing and cycles for reused executor state so subsequent calls keep behaving like normal Python stateful sessions. Constraint: Timed-out worker threads still cannot be forcefully terminated safely in-process Rejected: Keep unconditional deepcopy semantics | breaks imported modules, aliases, and cyclic state reuse Rejected: Drop per-call state snapshots | reintroduces stale-worker mutation races after timeout Confidence: high Scope-risk: narrow Reversibility: clean Directive: Any future state snapshot change must preserve aliasing/cycles as well as timeout isolation Tested: ruff check src/smolagents/local_python_executor.py tests/test_local_python_executor.py Tested: ruff format --check src/smolagents/local_python_executor.py tests/test_local_python_executor.py Tested: PYTHONPATH=src pytest -q tests/test_local_python_executor.py -k "timeout or imported_modules_after_timeout_fix or preserves_aliasing or preserves_cycles" Not-tested: Full pytest suite / GitHub Actions still require maintainer approval on fork PRs
|
Fresh follow-up pushed in What changed:
Validation:
The only remaining blocker I can see from my side is still maintainer approval for the fork PR workflows so CI can run on the latest commits. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f5dddf4604
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except Exception: | ||
| copied_value = value |
There was a problem hiding this comment.
Avoid sharing uncopyable context objects after timeout
When copy.deepcopy fails, snapshot_execution_context silently falls back to copied_value = value, which keeps the original object shared between the caller and the timed worker thread. This reintroduces late-mutation races for common non-deepcopyable mutables (for example queue.Queue, which fails deepcopy because of its internal lock): after ExecutionTimeoutError returns, the background thread can still mutate caller-visible state/custom tools and corrupt later executions. Because this change was intended to isolate timed-out workers, silently aliasing here undermines the fix.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 774565e.
What changed:
- snapshot
queue.Queue-style state objects into independent queue instances instead of letting them fall back to shared references - when timeouts are disabled, skip snapshotting entirely so ordinary reused state keeps the old in-place behavior
- if timeout isolation still cannot safely snapshot a remaining state/custom tool object after deepcopy + shallow-copy fallback, raise a clear error instead of silently aliasing it
- add regressions for reused queue state after timeout and for rejecting truly unsnapshotable timeout state objects
Validation:
ruff check src/smolagents/local_python_executor.py tests/test_local_python_executor.pyruff format --check src/smolagents/local_python_executor.py tests/test_local_python_executor.pyPYTHONPATH=src pytest -q tests/test_local_python_executor.py -k "timeout or imported_modules_after_timeout_fix or preserves_aliasing or preserves_cycles or queue_objects or uncopyable_state_objects"- result:
23 passed
The latest review pointed out that silently aliasing state when deepcopy fails still lets timed-out workers mutate caller-visible objects. This change snapshots queue.Queue-based objects explicitly, falls back to shallow copy before giving up, and raises a clear error when timeout isolation cannot safely snapshot a remaining state object. It also avoids snapshot overhead when timeouts are disabled and locks the new behavior with queue and uncopyable-object regressions.\n\nConstraint: Timeouted worker threads still run in the background, so shared state objects must not leak back into later executions\nRejected: Keep aliasing deepcopy failures | still lets timed-out workers mutate caller-visible queue state after timeout\nConfidence: medium\nScope-risk: narrow\nReversibility: clean\nDirective: Any future timeout snapshot helper must preserve aliasing/cycles for supported objects and fail loudly instead of silently sharing unsupported mutable state\nTested: ruff check src/smolagents/local_python_executor.py tests/test_local_python_executor.py; ruff format --check src/smolagents/local_python_executor.py tests/test_local_python_executor.py; PYTHONPATH=src pytest -q tests/test_local_python_executor.py -k "timeout or imported_modules_after_timeout_fix or preserves_aliasing or preserves_cycles or queue_objects or uncopyable_state_objects"\nNot-tested: Full test suite; GitHub Actions on fork PR (still maintainer approval gated)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 774565e4c8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if isinstance(value, dict): | ||
| copied_dict = {} |
There was a problem hiding this comment.
Preserve mapping subclasses when cloning execution context
The snapshot logic converts every dict subclass into a plain {} and then evaluate_python_code commits that cloned state back on successful timed runs, so subclass behavior is silently lost even when no timeout occurs. A common case is collections.defaultdict: after one execution with default timeout settings, missing-key access starts raising KeyError because the stored object has been downgraded to dict, which is a functional regression from the previous in-place state behavior.
Useful? React with 👍 / 👎.
| with value.mutex: | ||
| items = list(value.queue) | ||
| for item in items: | ||
| copied_queue.put_nowait(snapshot_execution_context(item, memo)) |
There was a problem hiding this comment.
Preserve Queue task counters when snapshotting state
The queue.Queue snapshot only copies buffered items and reconstructs a new queue via put_nowait, but it does not preserve bookkeeping like unfinished_tasks. When this cloned queue is committed back after a timed execution, join()/task_done() semantics can become incorrect (for example, outstanding work can be dropped and join() may return early), which can corrupt queue-based workflows that persist queue objects in executor state.
Useful? React with 👍 / 👎.
The latest open PR huggingface#2199 review pointed out a real regression: the timeout-state snapshot helper special-cased plain dict construction in a way that downgraded dict subclasses like collections.defaultdict into regular dicts on the next successful timed execution. This preserves dict subclass identity by snapshotting plain dicts and dict subclasses separately, and adds focused regressions for reused defaultdict state through both evaluate_python_code and LocalPythonExecutor. Constraint: Keep timeout isolation intact without regressing normal reusable Python state semantics Rejected: Widen this follow-up to queue unfinished-task bookkeeping changes | lower-priority behavioral surface beyond the current P1 mapping-subclass regression Confidence: high Scope-risk: narrow Reversibility: clean Directive: Any future snapshot helper changes must preserve container subtype behavior in addition to aliasing/cycle isolation Tested: ruff check src/smolagents/local_python_executor.py tests/test_local_python_executor.py Tested: ruff format --check src/smolagents/local_python_executor.py tests/test_local_python_executor.py Tested: PYTHONPATH=src pytest -q tests/test_local_python_executor.py -k "timeout or imported_modules_after_timeout_fix or preserves_aliasing or preserves_cycles or queue_objects or uncopyable_state_objects or defaultdict" Not-tested: Full pytest suite; GitHub Actions still require maintainer approval on fork PRs Related: PR huggingface#2199
|
Fresh follow-up pushed in What changed:
Validation:
I did not widen this follow-up to the lower-priority queue bookkeeping comment yet. That one touches |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4e7e534145
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if isinstance(value, (ModuleType, FunctionType, BuiltinFunctionType, MethodType, type)): | ||
| memo[obj_id] = value | ||
| return value |
There was a problem hiding this comment.
Preserve bound-method aliasing in state snapshots
snapshot_execution_context currently returns BuiltinFunctionType/MethodType objects as-is, even when their owner objects are being cloned in the same snapshot. With timeout-enabled executions (the default), a state like {"items": [...], "append_fn": items.append} is cloned into a new items list but keeps append_fn bound to the old list, so append_fn(...) mutates stale data while items appears unchanged. This silently breaks normal Python aliasing semantics across calls and can produce incorrect results without any timeout actually firing.
Useful? React with 👍 / 👎.
VANDRANKI
left a comment
There was a problem hiding this comment.
The executor shutdown fix is correct and addresses a real usability problem. The old with ThreadPoolExecutor(...) as executor block calls executor.shutdown(wait=True) on exit. On a timeout, this means the ExecutionTimeoutError is raised but the caller blocks waiting for the timed-out thread to finish - defeating the purpose of the timeout. Switching to executor.shutdown(wait=not timed_out, cancel_futures=timed_out) returns control immediately on timeout. The docstring correctly notes that the worker thread may still run to completion in the background - Python cannot forcibly kill threads.
snapshot_execution_context: the function is well-designed for its stated purpose - it handles the standard copy hazards (cycles via memo, modules/functions passed by reference, Queue items copied under mutex). The MethodType addition to the reference-passthrough list is correct - bound methods should not be deep-copied.
Where is snapshot_execution_context called? I can see the function definition in the diff but not its call sites. If it is not yet called from evaluate_python_code or the executor main loop, the snapshot machinery is unreachable dead code. Could you point to where it gets invoked?
Interaction with PR #2187: PR #2187 adds ctx = contextvars.copy_context() and ctx.run(func, ...) to the same timeout function. Both PRs are valid independently, but they modify the same lines and will conflict. The merged solution should include both the cancel_futures shutdown fix from this PR and the copy_context() propagation from #2187.
The executor shutdown fix is LGTM. The snapshot utility needs call-site evidence before I can assess its impact.
|
@VANDRANKI thanks for the review. After dropping the snapshot machinery the call-site question no longer applies. On the #2187 overlap: both PRs only touch the Keeping state isolation for timed-out workers out of scope here. The docstring already notes the worker thread may keep running in the background, so reusing state across a timeout is a caller concern and not something we can solve inside a pure-Python threading model. |
|
Pushed 9914fdb to reduce scope back to the original shutdown fix. The snapshot_execution_context helper and the added state-handling tests are gone, so the earlier call-site question is moot. What remains is the timeout() decorator change plus two prompt-return regression tests (evaluate_python_code and LocalPythonExecutor). Local check: PYTHONPATH=src python3 -m pytest tests/test_local_python_executor.py -k timeout -q -> 12 passed. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9914fdbf86
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| f"Code execution exceeded the maximum execution time of {timeout_seconds} seconds" | ||
| ) | ||
| finally: | ||
| executor.shutdown(wait=not timed_out, cancel_futures=timed_out) |
There was a problem hiding this comment.
Snapshot custom tools before allowing timed worker to continue
Switching to executor.shutdown(wait=not timed_out, ...) allows a timed-out worker to keep running, but evaluate_python_code still executes against the caller’s custom_tools mapping; for LocalPythonExecutor.__call__, that is the shared self.custom_tools reused across calls. If timed-out code reaches a late def/class statement (for example after sleep), evaluate_function_def/evaluate_class_def can still mutate that shared tool context after ExecutionTimeoutError has been raised, making subsequent executions observe nondeterministic tool availability.
Useful? React with 👍 / 👎.
|
The reduced scope is the right call. Dropping the snapshot machinery makes the PR much easier to reason about and review. What remains is clean:
On the state-isolation concern that Codex also flagged: your reasoning is sound. A pure-Python threading model cannot forcibly terminate the background thread, so snapshotting state before the call and restoring it after a timeout would need to happen at the caller level anyway. The docstring note that the worker may keep running is the right place to document this. LGTM from my side. |
Fixes #2197
Summary
Make
LocalPythonExecutortimeouts return control promptly without letting timed-out workers mutate later executions' live state, including nested mutable values.Problem / Root Cause
PR #1910 introduced a thread-based timeout mechanism for local Python execution. The original implementation wrapped
future.result(timeout=...)in awith ThreadPoolExecutor(...)block.When the timeout is exceeded,
future.result()raises as expected, but exiting thewithblock shuts the executor down in a way that waits for the worker thread to finish. In practice, callers do not regain control near the configured timeout boundary.Minimal repro before this patch:
Observed before patch:
elapsedis about3.00s, not about1.00s.After switching to prompt-return shutdown semantics, there are two stale-worker races to account for:
state(for example lists/dicts/objects) could still be mutated in place after the timeout if only a shallow copy was used.In particular, a reused
LocalPythonExecutorcould otherwise see late variable/log overwrites or nested list/dict mutations from the stale worker after a later call has already completed.This PR does not treat
LocalPythonExecutoras a security boundary. It fixes timeout semantics / caller behavior and isolates the live per-call execution dictionaries from stale timed-out workers as far as Python in-process threading allows.Solution
Switch from context-manager-based executor lifetime to explicit shutdown behavior:
wait=Trueon normal completion,wait=Falseandcancel_futures=Trueafter a timeout,Then isolate execution context per call by deep-copying
state/custom_toolsfor timed executions and only committing them back after non-timeout completion. That keeps prompt timeout behavior while preventing stale timed-out workers from mutating the dictionaries and nested mutable values reused by later executions.Regression tests
Added tests that verify:
evaluate_python_code(..., timeout_seconds=1)LocalPythonExecutor(..., timeout_seconds=1)evaluate_python_codedoes not overwrite a reusedstatedict laterLocalPythonExecutordoes not overwrite later variables/logs after a subsequent successful callevaluate_python_codedoes not mutate nested reused state values laterLocalPythonExecutordoes not mutate nested reused state values laterThe
evaluate_python_codetiming assertion runs in a fresh interpreter subprocess to avoid interference from background timed-out worker threads created by other tests.Validation
Passed locally. The only skipped full-file coverage in this environment is the existing sklearn-specific test because
sklearnis not installed locally.