Skip to content

Auto-fall back to HTTP/1.1 on HTTP/2 protocol errors#410

Open
manan164 wants to merge 1 commit into
mainfrom
fix/http2-auto-fallback
Open

Auto-fall back to HTTP/1.1 on HTTP/2 protocol errors#410
manan164 wants to merge 1 commit into
mainfrom
fix/http2-auto-fallback

Conversation

@manan164
Copy link
Copy Markdown
Contributor

@manan164 manan164 commented Jun 3, 2026

Problem

Customers running workers behind proxies/load balancers that mishandle long-lived HTTP/2 (observed across GCP Cloud Run and AWS deployments) report workers that stop polling — queues back up, last-poll time drifts to tens of seconds, with no CPU or memory spike.

Root cause: HTTP/2 is enabled by default (CONDUCTOR_HTTP2_ENABLED=true). When a long-lived h2 connection produces a protocol-level error (GOAWAY storm, stale keep-alive reset), the existing self-healing reset rebuilds another HTTP/2 client that hits the same wall. The poll loop can cycle reset → fail → reset, and because the worker only polls when it has free capacity, stalled requests pin slots and polling effectively halts.

Fix

On ProtocolError / ReadError / WriteError, if HTTP/2 is currently enabled, the connection reset now downgrades to HTTP/1.1 for the remainder of the process (sticky) instead of rebuilding HTTP/2. This breaks the failure cycle while keeping the worker self-healing — no process restart required.

  • Healthy environments are unaffected: HTTP/2 stays default-on and is only dropped after an actual protocol error.
  • New opt-out: CONDUCTOR_HTTP2_AUTO_FALLBACK=false keeps retrying on HTTP/2.
  • Applied to both the sync (rest.py, TaskRunner) and async (async_rest.py, AsyncTaskRunner) clients. A one-time WARNING is logged when the downgrade happens.

Tests

  • tests/unit/api_client/test_rest_client.py — downgrade-on-protocol-error and fallback-disabled cases.
  • tests/unit/api_client/test_async_rest_client.py — new file, same coverage for the async client.

All 17 tests in the two files pass.

Notes / scope

  • This is a transport-resilience fix. A related, separate issue is that update-retry backoff (sleep(attempt*10)) runs while holding a worker slot, which can also starve polling regardless of transport — not addressed here.
  • The lease_extend / memory-leak fixes are already shipped (LeaseManager, 1.3.11); this PR is only the HTTP/2 piece.

🤖 Generated with Claude Code

Long-lived HTTP/2 connections through some proxies/load balancers
(e.g. GCP Cloud Run, AWS ALB) can produce protocol-level errors
(GOAWAY storms, stale keep-alive resets). The existing self-healing
reset rebuilt another HTTP/2 client that hit the same wall, so the
poll loop could cycle reset->fail->reset and effectively stop polling
with no CPU/memory spike.

On a ProtocolError/ReadError/WriteError, if HTTP/2 is enabled, the
reset now downgrades to HTTP/1.1 for the remainder of the process
(sticky) instead of rebuilding HTTP/2. Default-on HTTP/2 behavior is
unchanged for healthy environments. Opt out of the fallback with
CONDUCTOR_HTTP2_AUTO_FALLBACK=false.

Applied to both the sync (rest.py) and async (async_rest.py) clients;
added unit tests for both.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
src/conductor/client/http/async_rest.py 49.23% <100.00%> (+25.98%) ⬆️
src/conductor/client/http/rest.py 83.89% <100.00%> (-0.18%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant