Auto-fall back to HTTP/1.1 on HTTP/2 protocol errors by manan164 · Pull Request #410 · conductor-oss/python-sdk

manan164 · 2026-06-03T11:09:19Z

Problem

Customers running workers behind proxies/load balancers that mishandle long-lived HTTP/2 (observed across GCP Cloud Run and AWS deployments) report workers that stop polling — queues back up, last-poll time drifts to tens of seconds, with no CPU or memory spike.

Root cause: HTTP/2 is enabled by default (CONDUCTOR_HTTP2_ENABLED=true). When a long-lived h2 connection produces a protocol-level error (GOAWAY storm, stale keep-alive reset), the existing self-healing reset rebuilds another HTTP/2 client that hits the same wall. The poll loop can cycle reset → fail → reset, and because the worker only polls when it has free capacity, stalled requests pin slots and polling effectively halts.

Fix

On ProtocolError / ReadError / WriteError, if HTTP/2 is currently enabled, the connection reset now downgrades to HTTP/1.1 for the remainder of the process (sticky) instead of rebuilding HTTP/2. This breaks the failure cycle while keeping the worker self-healing — no process restart required.

Healthy environments are unaffected: HTTP/2 stays default-on and is only dropped after an actual protocol error.
New opt-out: CONDUCTOR_HTTP2_AUTO_FALLBACK=false keeps retrying on HTTP/2.
Applied to both the sync (rest.py, TaskRunner) and async (async_rest.py, AsyncTaskRunner) clients. A one-time WARNING is logged when the downgrade happens.

Tests

tests/unit/api_client/test_rest_client.py — downgrade-on-protocol-error and fallback-disabled cases.
tests/unit/api_client/test_async_rest_client.py — new file, same coverage for the async client.

All 17 tests in the two files pass.

Notes / scope

This is a transport-resilience fix. A related, separate issue is that update-retry backoff (sleep(attempt*10)) runs while holding a worker slot, which can also starve polling regardless of transport — not addressed here.
The lease_extend / memory-leak fixes are already shipped (LeaseManager, 1.3.11); this PR is only the HTTP/2 piece.

🤖 Generated with Claude Code

Long-lived HTTP/2 connections through some proxies/load balancers (e.g. GCP Cloud Run, AWS ALB) can produce protocol-level errors (GOAWAY storms, stale keep-alive resets). The existing self-healing reset rebuilt another HTTP/2 client that hit the same wall, so the poll loop could cycle reset->fail->reset and effectively stop polling with no CPU/memory spike. On a ProtocolError/ReadError/WriteError, if HTTP/2 is enabled, the reset now downgrades to HTTP/1.1 for the remainder of the process (sticky) instead of rebuilding HTTP/2. Default-on HTTP/2 behavior is unchanged for healthy environments. Opt out of the fallback with CONDUCTOR_HTTP2_AUTO_FALLBACK=false. Applied to both the sync (rest.py) and async (async_rest.py) clients; added unit tests for both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-03T11:12:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines	Coverage Δ
src/conductor/client/http/async_rest.py	`49.23% <100.00%> (+25.98%)`	⬆️
src/conductor/client/http/rest.py	`83.89% <100.00%> (-0.18%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-fall back to HTTP/1.1 on HTTP/2 protocol errors#410

Auto-fall back to HTTP/1.1 on HTTP/2 protocol errors#410
manan164 wants to merge 1 commit into
mainfrom
fix/http2-auto-fallback

manan164 commented Jun 3, 2026

Uh oh!

codecov Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

manan164 commented Jun 3, 2026

Problem

Fix

Tests

Notes / scope

Uh oh!

codecov Bot commented Jun 3, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant