Skip to content

Conversation

@kevalmahajan
Copy link
Member

@kevalmahajan kevalmahajan commented Jan 29, 2026

✨ Feature / Enhancement PR

πŸ”— Epic / Issue

Closes #2078


πŸš€ Summary (1-2 sentences)

This PR implements strict per-tool timeout enforcement for all transports (REST, SSE, StreamableHTTP) and significantly enhances the CircuitBreakerPlugin with half-open states, retry headers, and granular configuration overrides.

All failure logic is decoupled via the Plugin System:

ToolService: Executes requests, enforces strict limits (Timeouts), reports status.
CircuitBreakerPlugin: Decides if a request should be attempted based on history.


Configurations for timeout:

  1. Global Tool Call Timeout :
# Env Variable: 
TOOL_TIMEOUT 
  1. Per Tool Timeout in Tool Schema:
timeout_ms 

Example Precedence Calculation:

Tool timeout_ms value Global timeout used Effective timeout Notes
Tool A 10000 Ignored 10 seconds Explicit timeout overrides global
Tool B None TOOL_TIMEOUT = 60 60 seconds Falls back to global timeout
Tool C 0 TOOL_TIMEOUT = 60 60 seconds 0 treated as not set, uses global

πŸ§ͺ Checks

  • make lint passes
  • make test passes
  • CHANGELOG updated (if user-facing)

πŸ““ Notes (optional)

Key Changes

  1. Strict Timeout Enforcement (tool_service.py):
    • Wrapped all tool invocations (REST, SSE, StreamableHTTP, A2A) in asyncio.wait_for.
    • Ensures per-tool timeout_ms overrides global settings.
    • Fix: Modified get_httpx_client_factory
  2. Circuit Breaker Enhancements (plugins/circuit_breaker.py):
    • Half-Open State: Allows single "probe" request after cooldown. Success closes circuit; failure re-opens it.
    • Retry-After: Returns precise retry_after_seconds in violation response.
    • Timeout Handling: Timeouts now explicitly trigger the circuit breaker via cb_timeout_failure context flag.
  3. Metrics:
    • Added tool_timeout_total and circuit_breaker_open_total Prometheus counters.
flowchart TD
    Client -->|Invoke Tool| Gateway(ToolService)
    Gateway -->|1. check| CB[CircuitBreakerPlugin]
    CB -- Open? --> Reject[429 / Blocked]
    CB -- Closed? --> Gateway
    
    Gateway -->|2. Invoke w/ Timeout| Tool[External Tool]
    
    Tool -- Success --> Gateway
    Gateway -->|3. Record Success| CB
    
    Tool -- Timeout/Error --> Gateway
    Gateway -- Flag: cb_timeout_failure --> CB
    Gateway -->|4. Record Failure| CB
    CB -- Trip Threshold? --> CBState[Open Circuit]

Loading

Feature Map & Implementation Details:

Feature Request Implementation Mechanism Location
Strict Timeout Enforcement Wrapped invocations in asyncio.wait_for. Prioritizes tool-specific timeout_ms over global default. mcpgateway/services/tool_service.py
Circuit Breaker Logic Plugin-based implementation using sliding window for error rates and consecutive failure counters. plugins/circuit_breaker/circuit_breaker.py
Timeouts Count as Failures Core catches TimeoutError and sets context flag. Plugin reads flag and counts as failure even without HTTP status. tool_service.py (Setter), circuit_breaker.py (Getter)
Half-Open State Probe logic allows 1 request after cooldown. Success closes circuit; failure re-opens immediately. circuit_breaker.py (tool_pre_invoke)
Retry-After Header Calculates open_until - now and returns exact seconds in violation details. circuit_breaker.py
Granular / Per-Tool Config tool_overrides map allows custom thresholds for critical tools (e.g., higher patience for Payment APIs). plugins/config.yaml
Metrics Added tool_timeout_total and circuit_breaker_open_total Prometheus counters. mcpgateway/services/metrics.py

Verification

Unit Tests: Added TestToolTimeoutsAndRetries covering state transitions, timeout precedence, and plugin logic.

@kevalmahajan kevalmahajan marked this pull request as draft January 29, 2026 12:39
@kevalmahajan kevalmahajan force-pushed the 2078_circuit_breakers_tool_invocation branch from f09e918 to 2285ea2 Compare January 29, 2026 13:41
@kevalmahajan kevalmahajan marked this pull request as ready for review January 29, 2026 14:14
@crivetimihai
Copy link
Member

crivetimihai commented Jan 31, 2026

Code review

Found 1 issue:

  1. Inconsistent variable used for Prometheus counter labels - The timeout counter uses name while logging uses tool_name_computed or tool_name_original. When tools use name aliasing, metrics will not properly correlate with logs.

https://github.com/IBM/mcp-context-forge/blob/8fbd02da86ab3b8c7accf5712126b25a2fe01883/mcpgateway/services/tool_service.py#L2880-L2884

The counter at line 2882 uses name:

tool_timeout_counter.labels(tool_name=name).inc()

But the log at line 2870 uses tool_name_computed:

message=f"REST tool invocation timed out: {tool_name_computed}",

This same inconsistency appears in the SSE and StreamableHTTP timeout handlers (lines 3147 and 3291 use name for counter, while logs use tool_name_original).

Implement strict per-tool timeout enforcement for all transports
(REST, SSE, StreamableHTTP, A2A) and enhance the CircuitBreakerPlugin
with half-open states, retry headers, and granular configuration.

Changes:
- Wrap all tool invocations in asyncio.wait_for with effective_timeout
- Add per-tool timeout_ms support (ms to seconds conversion)
- Add half-open state for circuit breaker recovery testing
- Add half_open_in_flight flag to prevent concurrent probe requests
- Add retry_after_seconds in violation response for rate limiting
- Add tool_timeout_total and circuit_breaker_open_total Prometheus metrics
- Add cb_timeout_failure context flag for timeout detection in plugins
- Add tool_overrides for per-tool circuit breaker configuration
- Handle both asyncio.TimeoutError and httpx.TimeoutException
- Log actual elapsed time instead of configured timeout

Fixes applied during review:
- Fix _is_error() to detect camelCase isError from model_dump(by_alias=True)
- Fix half-open probe guard: only check when st.half_open is True
- Add stale-probe timeout to prevent permanent wedge if plugin blocks
- Add timeout enforcement to A2A tool invocations
- Call tool_post_invoke on exceptions so circuit breaker tracks failures
- Add ToolTimeoutError subclass to distinguish timeouts from other errors
- Only skip post_invoke for ToolTimeoutError (not all ToolInvocationError)
- Update README to document isError camelCase support

Timeout precedence:
1. Per-tool timeout_ms (if set and non-zero)
2. Global TOOL_TIMEOUT setting (default: 60s)

Closes #2078

Co-authored-by: Keval Mahajan <[email protected]>
Signed-off-by: Mihai Criveti <[email protected]>
@crivetimihai crivetimihai force-pushed the 2078_circuit_breakers_tool_invocation branch from 3ef8dbb to 5efaabb Compare February 1, 2026 01:11
@crivetimihai
Copy link
Member

Code Review Changes Applied

Rebased onto main and squashed commits. The following issues were identified and fixed during review:

High Severity Fixes

Issue Fix
isError vs is_error mismatch - model_dump(by_alias=True) produces isError but _is_error() only checked is_error, causing circuit breaker to miss most errors _is_error() now checks both is_error and isError
Exceptions bypass tool_post_invoke - HTTP 4xx/5xx and MCP failures skipped circuit breaker tracking Generic exception handler now calls tool_post_invoke with error result
Double-counting failures - Timeout handlers called post_invoke, then exception handler called it again Created ToolTimeoutError subclass; only skip post_invoke for ToolTimeoutError

Medium Severity Fixes

Issue Fix
A2A ignores per-tool timeouts - self._http_client.post() had no asyncio.wait_for wrapper Added timeout enforcement to A2A tool invocations
Half-open guard ineffective - Once open_until=0.0, condition was falsy and half_open_in_flight check was bypassed Moved check; only block when st.half_open is True
Half-open wedge scenario - If a later plugin blocks after half_open_in_flight=True is set, tool becomes permanently blocked Added half_open_started timestamp and stale-probe detection (auto-reset after cooldown)
Non-timeout errors skipped circuit breaker - OAuth failures, missing params raised ToolInvocationError which was caught and re-raised without post_invoke Only ToolTimeoutError skips post_invoke; other ToolInvocationError goes through normal handler

Low Severity Fixes

Issue Fix
README out of date - Docs said only is_error, but code now also checks isError Updated README to document both formats

New Tests Added

  • test_is_error_with_camel_case() - Verifies isError detection
  • test_stale_probe_detection_resets_half_open() - Verifies stale probe cleanup
  • Updated test_blocks_concurrent_probes_during_half_open() - Now requires st.half_open=True

Summary of Changes

 mcpgateway/services/tool_service.py        | +112 lines (timeout handling, ToolTimeoutError, exception handler)
 plugins/circuit_breaker/circuit_breaker.py | +45 lines (isError check, stale probe detection, half_open_started)
 plugins/circuit_breaker/README.md          | +1 line (isError documentation)
 tests/unit/plugins/test_circuit_breaker.py | +23 lines (new tests)

All 48 tests pass. Ready for merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Tool invocation timeouts and circuit breaker

3 participants