Skip to content

feat: add opt-in ListenV2 stream keepalive to prevent L7 proxy idle timeout#3281

Open
shaunpatterson wants to merge 3 commits intohatchet-dev:mainfrom
shaunpatterson:fix/listenV2-stream-keepalive
Open

feat: add opt-in ListenV2 stream keepalive to prevent L7 proxy idle timeout#3281
shaunpatterson wants to merge 3 commits intohatchet-dev:mainfrom
shaunpatterson:fix/listenV2-stream-keepalive

Conversation

@shaunpatterson
Copy link

Summary

Adds an opt-in server-side stream keepalive for ListenV2 to prevent L7 proxies from killing idle gRPC streams. When enabled, the engine sends periodic STREAM_KEEPALIVE messages (no-op AssignedAction) on idle streams, generating DATA frames that reset the proxy's stream idle timer.

Fixes #3280

Problem

L7 proxies like Envoy (Azure Container Apps, AWS ALB, etc.) enforce per-stream idle timeouts (~4 minutes on ACA). When no tasks are being dispatched, the ListenV2 stream has zero DATA frames and gets killed. gRPC keepalive PINGs don't help because they're connection-level and proxies respond to them locally (envoyproxy/envoy#5142, microsoft/azure-container-apps#113).

Why this doesn't reintroduce #308

PR #308 removed server-side heartbeats from the stream because proxies were keeping dead worker connections alive — the server heartbeats flowed through the proxy, masking worker death.

This change is different:

  • Worker liveness is still determined solely by the separate Heartbeat RPC (unchanged)
  • The STREAM_KEEPALIVE message only prevents the proxy from killing the stream
  • If a worker dies, the Heartbeat RPC stops and the engine detects it within 30 seconds regardless of stream state
  • The engine does NOT use stream write-success as a liveness signal

Configuration

SERVER_LISTENV2_STREAM_KEEPALIVE_INTERVAL=120s  # disabled by default (0s)

Client Impact

Old SDKs receive ActionType=3 but don't match any handler — they skip it. No client changes required.

Note

The dispatcher.pb.go was manually updated. Proper protobuf regeneration should be done before merging.

Copilot AI review requested due to automatic review settings March 15, 2026 01:17
@vercel
Copy link

vercel bot commented Mar 15, 2026

@shaunpatterson is attempting to deploy a commit to the Hatchet Team on Vercel.

A member of the Team first needs to authorize it.

@promptless-for-oss
Copy link

📝 Documentation updates detected!

New suggestion: Document SERVER_LISTENV2_STREAM_KEEPALIVE_INTERVAL for L7 proxy support


Tip: Whenever you leave a comment tagged @Promptless on a Promptless PR, Promptless will remember it for future suggestions 🧠

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in server-side keepalive mechanism for ListenV2 gRPC streams to avoid L7 proxy stream_idle_timeout disconnects by periodically sending a no-op AssignedAction (STREAM_KEEPALIVE) when configured.

Changes:

  • Adds ListenV2StreamKeepaliveInterval runtime config and env binding (SERVER_LISTENV2_STREAM_KEEPALIVE_INTERVAL).
  • Wires the new interval into the dispatcher and sends periodic STREAM_KEEPALIVE messages from ListenV2.
  • Extends dispatcher protobuf contracts with a new ActionType enum value (STREAM_KEEPALIVE).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/config/server/server.go Adds runtime config field + env var binding for ListenV2 keepalive interval.
internal/services/dispatcher/server.go Implements periodic keepalive sends within the ListenV2 server-stream loop.
internal/services/dispatcher/dispatcher.go Plumbs keepalive interval through dispatcher options into DispatcherImpl.
cmd/hatchet-engine/engine/run.go Passes runtime keepalive interval into dispatcher construction for v0 and v1 config paths.
internal/services/dispatcher/contracts/dispatcher.pb.go Manually adds new ActionType_STREAM_KEEPALIVE enum value to generated Go protobuf output.
api-contracts/dispatcher/dispatcher.proto Adds STREAM_KEEPALIVE to the public dispatcher ActionType enum with “clients should ignore” guidance.

Comment on lines +358 to +369
// Optionally send periodic keepalive messages on the stream to prevent
// L7 proxies (e.g., Envoy on Azure Container Apps) from killing the stream
// due to stream_idle_timeout. This does NOT affect worker liveness detection,
// which is handled by the separate Heartbeat RPC.
var keepaliveTicker *time.Ticker
var keepaliveC <-chan time.Time

if s.listenV2StreamKeepaliveInterval > 0 {
keepaliveTicker = time.NewTicker(s.listenV2StreamKeepaliveInterval)
keepaliveC = keepaliveTicker.C
defer keepaliveTicker.Stop()
}
Comment on lines 79 to 101
type ActionType int32

const (
ActionType_START_STEP_RUN ActionType = 0
ActionType_CANCEL_STEP_RUN ActionType = 1
ActionType_START_GET_GROUP_KEY ActionType = 2
ActionType_STREAM_KEEPALIVE ActionType = 3
)

// Enum value maps for ActionType.
var (
ActionType_name = map[int32]string{
0: "START_STEP_RUN",
1: "CANCEL_STEP_RUN",
2: "START_GET_GROUP_KEY",
3: "STREAM_KEEPALIVE",
}
ActionType_value = map[string]int32{
"START_STEP_RUN": 0,
"CANCEL_STEP_RUN": 1,
"START_GET_GROUP_KEY": 2,
"STREAM_KEEPALIVE": 3,
}
Comment on lines +125 to +129
// STREAM_KEEPALIVE is a no-op message sent periodically on the ListenV2
// stream to prevent L7 proxies (e.g., Envoy) from killing the stream due
// to idle timeout. Clients should ignore this action type. Worker liveness
// is still determined by the separate Heartbeat RPC.
STREAM_KEEPALIVE = 3;
Comment on lines +400 to +405
if err := stream.Send(&contracts.AssignedAction{
ActionType: contracts.ActionType_STREAM_KEEPALIVE,
}); err != nil {
s.l.Debug().Err(err).Msgf("failed to send stream keepalive for worker %s, closing stream", request.WorkerId)
return nil
}
shaunpatterson and others added 3 commits March 19, 2026 15:26
L7 proxies like Envoy (used by Azure Container Apps, AWS ALB, etc.)
enforce per-stream idle timeouts that kill the ListenV2 gRPC stream when
no tasks are being dispatched. gRPC keepalive PING frames don't help
because they're connection-level and proxies respond to them locally
without forwarding (envoyproxy/envoy#5142).

This adds an opt-in server-side stream keepalive: when enabled, the
engine sends a periodic STREAM_KEEPALIVE message (a no-op AssignedAction)
on idle ListenV2 streams. This resets the proxy's stream idle timer with
actual DATA frames.

This does NOT reintroduce the bug fixed in PR hatchet-dev#308 (server-side heartbeats
masking dead workers through proxies) because:
- Worker liveness is still determined solely by the separate Heartbeat RPC
- The stream keepalive only prevents the proxy from killing the stream
- If a worker dies, the Heartbeat RPC stops and the engine detects it
  within 30 seconds regardless of stream state

Configuration:
  SERVER_LISTENV2_STREAM_KEEPALIVE_INTERVAL=120s

Set to 0 (default) to disable. Recommended: 120s for Azure Container Apps
(4-minute idle timeout) or any environment behind an L7 proxy with
stream_idle_timeout.

Fixes hatchet-dev#3280

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ding

1. Python SDK: Add STREAM_KEEPALIVE to ActionType enum so the SDK
   recognizes the new value instead of crashing on unknown enum.

2. Python SDK: Make convert_proto_enum_to_python return None for
   unknown enum values instead of raising KeyError. This makes the
   SDK forward-compatible with future ActionType additions.

3. Python SDK: Skip STREAM_KEEPALIVE (and any unknown) actions early
   in the action listener generator before attempting to parse payload.

4. Engine: Use subscribedWorker.sendMu mutex for keepalive sends to
   avoid racing with real AssignedAction dispatches from other goroutines.
   gRPC server streams are not safe for concurrent sends.

5. Engine: Update comment to say "periodic" instead of "idle" — the
   keepalive fires unconditionally at the interval, not only when idle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…TimeoutLock API

- Revert accidental rename of defaultMaxWorkerLockAcquisitionTime
- Use sendLock.Acquire()/Release() instead of non-existent sendMu.Lock()/Unlock()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@shaunpatterson shaunpatterson force-pushed the fix/listenV2-stream-keepalive branch from 08d312b to c9ce946 Compare March 19, 2026 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ListenV2 streams killed by L7 proxies with stream_idle_timeout (Azure Container Apps, Envoy)

3 participants