feat: add opt-in ListenV2 stream keepalive to prevent L7 proxy idle timeout#3281
Open
shaunpatterson wants to merge 3 commits intohatchet-dev:mainfrom
Open
feat: add opt-in ListenV2 stream keepalive to prevent L7 proxy idle timeout#3281shaunpatterson wants to merge 3 commits intohatchet-dev:mainfrom
shaunpatterson wants to merge 3 commits intohatchet-dev:mainfrom
Conversation
|
@shaunpatterson is attempting to deploy a commit to the Hatchet Team on Vercel. A member of the Team first needs to authorize it. |
|
📝 Documentation updates detected! New suggestion: Document SERVER_LISTENV2_STREAM_KEEPALIVE_INTERVAL for L7 proxy support Tip: Whenever you leave a comment tagged |
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an opt-in server-side keepalive mechanism for ListenV2 gRPC streams to avoid L7 proxy stream_idle_timeout disconnects by periodically sending a no-op AssignedAction (STREAM_KEEPALIVE) when configured.
Changes:
- Adds
ListenV2StreamKeepaliveIntervalruntime config and env binding (SERVER_LISTENV2_STREAM_KEEPALIVE_INTERVAL). - Wires the new interval into the dispatcher and sends periodic
STREAM_KEEPALIVEmessages fromListenV2. - Extends dispatcher protobuf contracts with a new
ActionTypeenum value (STREAM_KEEPALIVE).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/config/server/server.go | Adds runtime config field + env var binding for ListenV2 keepalive interval. |
| internal/services/dispatcher/server.go | Implements periodic keepalive sends within the ListenV2 server-stream loop. |
| internal/services/dispatcher/dispatcher.go | Plumbs keepalive interval through dispatcher options into DispatcherImpl. |
| cmd/hatchet-engine/engine/run.go | Passes runtime keepalive interval into dispatcher construction for v0 and v1 config paths. |
| internal/services/dispatcher/contracts/dispatcher.pb.go | Manually adds new ActionType_STREAM_KEEPALIVE enum value to generated Go protobuf output. |
| api-contracts/dispatcher/dispatcher.proto | Adds STREAM_KEEPALIVE to the public dispatcher ActionType enum with “clients should ignore” guidance. |
Comment on lines
+358
to
+369
| // Optionally send periodic keepalive messages on the stream to prevent | ||
| // L7 proxies (e.g., Envoy on Azure Container Apps) from killing the stream | ||
| // due to stream_idle_timeout. This does NOT affect worker liveness detection, | ||
| // which is handled by the separate Heartbeat RPC. | ||
| var keepaliveTicker *time.Ticker | ||
| var keepaliveC <-chan time.Time | ||
|
|
||
| if s.listenV2StreamKeepaliveInterval > 0 { | ||
| keepaliveTicker = time.NewTicker(s.listenV2StreamKeepaliveInterval) | ||
| keepaliveC = keepaliveTicker.C | ||
| defer keepaliveTicker.Stop() | ||
| } |
Comment on lines
79
to
101
| type ActionType int32 | ||
|
|
||
| const ( | ||
| ActionType_START_STEP_RUN ActionType = 0 | ||
| ActionType_CANCEL_STEP_RUN ActionType = 1 | ||
| ActionType_START_GET_GROUP_KEY ActionType = 2 | ||
| ActionType_STREAM_KEEPALIVE ActionType = 3 | ||
| ) | ||
|
|
||
| // Enum value maps for ActionType. | ||
| var ( | ||
| ActionType_name = map[int32]string{ | ||
| 0: "START_STEP_RUN", | ||
| 1: "CANCEL_STEP_RUN", | ||
| 2: "START_GET_GROUP_KEY", | ||
| 3: "STREAM_KEEPALIVE", | ||
| } | ||
| ActionType_value = map[string]int32{ | ||
| "START_STEP_RUN": 0, | ||
| "CANCEL_STEP_RUN": 1, | ||
| "START_GET_GROUP_KEY": 2, | ||
| "STREAM_KEEPALIVE": 3, | ||
| } |
Comment on lines
+125
to
+129
| // STREAM_KEEPALIVE is a no-op message sent periodically on the ListenV2 | ||
| // stream to prevent L7 proxies (e.g., Envoy) from killing the stream due | ||
| // to idle timeout. Clients should ignore this action type. Worker liveness | ||
| // is still determined by the separate Heartbeat RPC. | ||
| STREAM_KEEPALIVE = 3; |
Comment on lines
+400
to
+405
| if err := stream.Send(&contracts.AssignedAction{ | ||
| ActionType: contracts.ActionType_STREAM_KEEPALIVE, | ||
| }); err != nil { | ||
| s.l.Debug().Err(err).Msgf("failed to send stream keepalive for worker %s, closing stream", request.WorkerId) | ||
| return nil | ||
| } |
L7 proxies like Envoy (used by Azure Container Apps, AWS ALB, etc.) enforce per-stream idle timeouts that kill the ListenV2 gRPC stream when no tasks are being dispatched. gRPC keepalive PING frames don't help because they're connection-level and proxies respond to them locally without forwarding (envoyproxy/envoy#5142). This adds an opt-in server-side stream keepalive: when enabled, the engine sends a periodic STREAM_KEEPALIVE message (a no-op AssignedAction) on idle ListenV2 streams. This resets the proxy's stream idle timer with actual DATA frames. This does NOT reintroduce the bug fixed in PR hatchet-dev#308 (server-side heartbeats masking dead workers through proxies) because: - Worker liveness is still determined solely by the separate Heartbeat RPC - The stream keepalive only prevents the proxy from killing the stream - If a worker dies, the Heartbeat RPC stops and the engine detects it within 30 seconds regardless of stream state Configuration: SERVER_LISTENV2_STREAM_KEEPALIVE_INTERVAL=120s Set to 0 (default) to disable. Recommended: 120s for Azure Container Apps (4-minute idle timeout) or any environment behind an L7 proxy with stream_idle_timeout. Fixes hatchet-dev#3280 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ding 1. Python SDK: Add STREAM_KEEPALIVE to ActionType enum so the SDK recognizes the new value instead of crashing on unknown enum. 2. Python SDK: Make convert_proto_enum_to_python return None for unknown enum values instead of raising KeyError. This makes the SDK forward-compatible with future ActionType additions. 3. Python SDK: Skip STREAM_KEEPALIVE (and any unknown) actions early in the action listener generator before attempting to parse payload. 4. Engine: Use subscribedWorker.sendMu mutex for keepalive sends to avoid racing with real AssignedAction dispatches from other goroutines. gRPC server streams are not safe for concurrent sends. 5. Engine: Update comment to say "periodic" instead of "idle" — the keepalive fires unconditionally at the interval, not only when idle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…TimeoutLock API - Revert accidental rename of defaultMaxWorkerLockAcquisitionTime - Use sendLock.Acquire()/Release() instead of non-existent sendMu.Lock()/Unlock() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
08d312b to
c9ce946
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in server-side stream keepalive for
ListenV2to prevent L7 proxies from killing idle gRPC streams. When enabled, the engine sends periodicSTREAM_KEEPALIVEmessages (no-opAssignedAction) on idle streams, generating DATA frames that reset the proxy's stream idle timer.Fixes #3280
Problem
L7 proxies like Envoy (Azure Container Apps, AWS ALB, etc.) enforce per-stream idle timeouts (~4 minutes on ACA). When no tasks are being dispatched, the
ListenV2stream has zero DATA frames and gets killed. gRPC keepalive PINGs don't help because they're connection-level and proxies respond to them locally (envoyproxy/envoy#5142, microsoft/azure-container-apps#113).Why this doesn't reintroduce #308
PR #308 removed server-side heartbeats from the stream because proxies were keeping dead worker connections alive — the server heartbeats flowed through the proxy, masking worker death.
This change is different:
Configuration
SERVER_LISTENV2_STREAM_KEEPALIVE_INTERVAL=120s # disabled by default (0s)Client Impact
Old SDKs receive ActionType=3 but don't match any handler — they skip it. No client changes required.
Note
The dispatcher.pb.go was manually updated. Proper protobuf regeneration should be done before merging.