Skip to content

Commit cb79f02

Browse files
committed
release(v1.10.5): fix WSS supervisor wedge, debounce key-exchange storm, skip direct dial in compat mode
Three daemon-side bugs surfaced while diagnosing why service-agents appeared unresponsive from a v1.10.4 compat-mode daemon: the WSS supervisor exited permanently on the first failed redial, peers fired 4-5 "encrypted tunnel established" events per handshake because direct and relay copies of the same PILA were not coalesced, and every compat-mode dial wasted 25-78s on direct retries that physically cannot succeed. Fixes: - pkg/daemon/transport/wss: supervisor exits only on Close. A nil conn at loop top just skips drainReads and proceeds to backoff + redial. Regression test pins the failure mode. - pkg/daemon/keyexchange: same-pubkey PILA arriving within DuplicateHandshakeDebounce (250ms) of a freshly-installed Crypto skips the log line, bus event, and PostInstallHook. The asymmetric-recovery reply-on-stale path is intentionally outside the gate so the rc3 list-agents recovery scenario still works. - pkg/daemon: compat-mode dial pre-flips the peer to relay before the first SYN. directRetries=0 branch then skips the dead direct budget. Verified against beacon.pilotprotocol.network with a fresh v1.10.5 daemon: TCP/443-only, list-agents query round-trips in ~3s end-to-end.
1 parent 64244df commit cb79f02

7 files changed

Lines changed: 509 additions & 79 deletions

File tree

CHANGELOG.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,87 @@ project uses [Semantic Versioning](https://semver.org/).
77
Detailed per-release notes are on the
88
[GitHub Releases page](https://github.com/TeoSlayer/pilotprotocol/releases).
99

10+
## [1.10.5] - 2026-05-20
11+
12+
### Fixed
13+
- **WSS reconnect supervisor no longer wedges on a failed first redial.**
14+
v1.10.4 introduced the reconnect supervisor but contained an
15+
early-exit bug: when the supervisor cleared `t.conn` after a read
16+
failure and then the first redial attempt itself failed (network
17+
hiccup, beacon 5xx, nginx restart still in progress), the next
18+
loop iteration saw `conn == nil` at the top and `return`ed,
19+
killing the supervisor goroutine for the lifetime of the daemon.
20+
Operators saw a long stretch of `wss: reconnecting` Warn lines
21+
followed by silence; every subsequent `Send` then returned
22+
`ErrReconnecting` forever and only a daemon restart recovered.
23+
Observed in production today on a v1.10.4 daemon that ran ~10 h
24+
before the supervisor exited.
25+
26+
The supervisor now treats `t.closed.Load()` as the only exit
27+
condition. A nil `conn` at iteration start just skips
28+
`drainReads` and goes straight to backoff + redial — so a
29+
transient outage drains the backoff budget instead of killing
30+
the goroutine. Regression-pinned by
31+
`TestReconnect_SurvivesFailedRedialAttempts` (3 forced 503
32+
redial failures followed by recovery).
33+
34+
- **Duplicate key_exchange frames coalesced; no more "encrypted
35+
tunnel established" log storm.** Direct + relay copies of the
36+
same PILA frame plus peer-side retransmits caused the daemon to
37+
fire the full side-effect path (log, `tunnel.established` bus
38+
event, PostInstallHook with salvage replay + flushPending) 4–5×
39+
per peer within milliseconds. A new
40+
`DuplicateHandshakeDebounce = 250 ms` window coalesces
41+
same-X25519-pubkey frames arriving on top of a freshly-installed
42+
Crypto.
43+
44+
Crucially, the SendKeyExchangeToNode-when-stale path is NOT
45+
gated by the debounce: the asymmetric-recovery scenario (peer
46+
dropped crypto for us, retransmits PILA, our InboundDecryptStale
47+
is true) still elicits our PILA reply so the peer can re-derive
48+
the shared secret. Regression-pinned by
49+
`TestDuplicatePILACoalescedSuppressesLogAndHook`,
50+
`TestDuplicatePILAOutsideDebounceFiresHookAgain`, and
51+
`TestDuplicatePILAStillRepliesForAsymmetricRecovery`. The
52+
existing `TestAsymmetricRecoveryRepliesOnDuplicatePILAWhenStale`
53+
also still passes.
54+
55+
- **Compat-mode dial no longer burns 25–78 s on direct-retry
56+
attempts that can't succeed.** A compat-mode daemon has no
57+
public UDP socket — its data plane lives entirely on the WSS
58+
bridge to the beacon. Pre-v1.10.5, every outbound dial still
59+
ran the full 3-attempt direct phase before falling back to
60+
relay; each attempt consumed an ephemeral port and idled for
61+
the SYN-RTO budget. Under fan-out (e.g. the trusted-agents
62+
auto-handshake on first dial) this exhausted local ephemeral
63+
ports on macOS.
64+
65+
`DialConnectionContext` now pre-flips the peer's routing state
66+
to relay BEFORE sending the initial SYN when `-transport=compat`,
67+
and the existing `relayActive → directRetries=0` branch picks
68+
it up — first SYN goes out via the WSS bridge on the first try.
69+
70+
### Tests
71+
- `pkg/daemon/transport/wss/zz_wss_test.go::TestReconnect_SurvivesFailedRedialAttempts`
72+
`fakeBeacon` accepts an initial dial, kills the conn on
73+
the 2nd binary frame, then refuses the next 3 redial attempts
74+
with 503. Pre-fix: supervisor exits, test wedges out.
75+
Post-fix: supervisor exhausts the 503 streak with exponential
76+
backoff and the 4th attempt succeeds; a probe frame round-trips
77+
on the fresh conn.
78+
- `pkg/daemon/keyexchange/zz_duplicate_debounce_test.go` — three
79+
scenarios covering coalescing inside the window, re-firing
80+
after the window, and asymmetric-recovery preservation.
81+
82+
### Verified
83+
- Live smoke test against `beacon.pilotprotocol.network` from a
84+
fresh v1.10.5 daemon: `-transport compat` binds only TCP/443,
85+
WSS auth completes, trust handshake with `list-agents` lands
86+
within ~88 s of first dial, `send-message list-agents` returns
87+
2521 bytes of structured JSON in ~3 s end-to-end. No
88+
`wss: reconnecting` storm, no "tunnel established" log noise
89+
burst.
90+
1091
## [1.10.4] - 2026-05-19
1192

1293
### Fixed

pkg/daemon/daemon.go

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2924,6 +2924,17 @@ func (d *Daemon) dialConnectionLocked(ctx context.Context, dstAddr protocol.Addr
29242924
return nil, err
29252925
}
29262926

2927+
// Compat mode: the daemon has no public UDP socket, so any direct
2928+
// SYN we send out can't reach the peer. Pre-flip the routing
2929+
// state to relay BEFORE the first SYN so it goes out via the
2930+
// beacon WSS bridge on the first try, not after a 25-78s direct
2931+
// retry budget burns. Same reasoning as the directRetries=0
2932+
// branch in the retransmit loop below — we just need it earlier
2933+
// so the initial SYN is correctly routed.
2934+
if d.config.TransportMode == "compat" && !d.tunnels.IsRelayPeer(dstAddr.Node) {
2935+
d.tunnels.SetRelayPeer(dstAddr.Node, true)
2936+
}
2937+
29272938
// Auto-initiate handshake toward known trusted agents when we have no
29282939
// local trust entry yet. Scoped to the trusted-agents list so we don't
29292940
// spray handshakes at arbitrary peers. Fires non-blocking so the
@@ -2993,9 +3004,12 @@ func (d *Daemon) dialConnectionLocked(ctx context.Context, dstAddr protocol.Addr
29933004
retries := 0
29943005
directRetries := DialDirectRetries
29953006
maxRetries := DialMaxRetries
2996-
relayActive := d.tunnels.IsRelayPeer(dstAddr.Node) // may already be relay from prior attempt
3007+
// relayActive is set when this peer is pinned to relay via either a
3008+
// prior attempt's fallback OR the compat-mode pre-flip above. Either
3009+
// way the direct retry budget would burn for nothing — skip it.
3010+
relayActive := d.tunnels.IsRelayPeer(dstAddr.Node)
29973011
if relayActive {
2998-
directRetries = 0 // skip direct phase, go straight to relay
3012+
directRetries = 0
29993013
}
30003014
rto := DialInitialRTO
30013015
timer := time.NewTimer(rto)

pkg/daemon/keyexchange/handle.go

Lines changed: 75 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ import (
77
"encoding/binary"
88
"log/slog"
99
"net"
10+
"time"
1011

1112
"github.com/TeoSlayer/pilotprotocol/internal/crypto"
1213
)
@@ -89,6 +90,12 @@ func (m *Manager) HandleAuthFrame(data []byte, from *net.UDPAddr, fromRelay bool
8990
oldPC := m.env.Get(peerNodeID)
9091
hadCrypto := oldPC != nil
9192
keyChanged := hadCrypto && oldPC.PeerX25519Key != pc.PeerX25519Key
93+
// duplicate := same-pubkey frame arriving inside the debounce
94+
// window of a freshly-installed Crypto. See
95+
// DuplicateHandshakeDebounce — the goal is to coalesce the direct+
96+
// relay arrival pair and peer-side retransmits without dropping
97+
// real recovery work.
98+
duplicate := hadCrypto && !keyChanged && time.Since(oldPC.CreatedAt) < DuplicateHandshakeDebounce
9299
// Only replace the installed envelope.Crypto when there is no
93100
// existing entry OR the peer's X25519 ephemeral key actually changed.
94101
// Replacing on a duplicate key_exchange (same pubkey — common under
@@ -103,31 +110,45 @@ func (m *Manager) HandleAuthFrame(data []byte, from *net.UDPAddr, fromRelay bool
103110
// Cache the verified peer pubkey.
104111
m.SetPeerPubKey(peerNodeID, peerEd25519PubKey)
105112

106-
if keyChanged {
107-
slog.Info("peer rekeyed (auth), re-establishing tunnel", "peer_node_id", peerNodeID)
113+
// Side-effect gate: the log line, tunnel.established bus event, and
114+
// PostInstallHook fire ONLY when this is not a coalesced duplicate.
115+
// A real rekey (keyChanged) or a first-time install always falls
116+
// through to the side effects. The recovery-reply path below (the
117+
// SendKeyExchangeToNode call) is NOT gated on `duplicate` — the
118+
// asymmetric-recovery case (B dropped crypto for A while A retains
119+
// it) requires A to reply on B's retransmit even though A sees it
120+
// as a duplicate. Pinned by
121+
// TestAsymmetricRecoveryRepliesOnDuplicatePILAWhenStale.
122+
if duplicate {
123+
slog.Debug("auth key exchange: duplicate frame coalesced",
124+
"peer_node_id", peerNodeID, "age", time.Since(oldPC.CreatedAt), "relay", fromRelay)
108125
} else {
109-
slog.Info("encrypted tunnel established", "auth", true,
110-
"peer_node_id", peerNodeID, "endpoint", from, "relay", fromRelay)
111-
}
112-
m.publish("tunnel.established", map[string]any{
113-
"peer_node_id": peerNodeID,
114-
"authenticated": true,
115-
"relay": fromRelay,
116-
"rekeyed": keyChanged,
117-
})
118-
119-
if m.postInstall != nil {
120-
m.postInstall(PostInstallEvent{
121-
PeerNodeID: peerNodeID,
122-
From: from,
123-
FromRelay: fromRelay,
124-
Authenticated: true,
125-
HadCrypto: hadCrypto,
126-
KeyChanged: keyChanged,
127-
OldCrypto: oldPC,
128-
NewCrypto: pc,
129-
PeerEd25519: peerEd25519PubKey,
126+
if keyChanged {
127+
slog.Info("peer rekeyed (auth), re-establishing tunnel", "peer_node_id", peerNodeID)
128+
} else {
129+
slog.Info("encrypted tunnel established", "auth", true,
130+
"peer_node_id", peerNodeID, "endpoint", from, "relay", fromRelay)
131+
}
132+
m.publish("tunnel.established", map[string]any{
133+
"peer_node_id": peerNodeID,
134+
"authenticated": true,
135+
"relay": fromRelay,
136+
"rekeyed": keyChanged,
130137
})
138+
139+
if m.postInstall != nil {
140+
m.postInstall(PostInstallEvent{
141+
PeerNodeID: peerNodeID,
142+
From: from,
143+
FromRelay: fromRelay,
144+
Authenticated: true,
145+
HadCrypto: hadCrypto,
146+
KeyChanged: keyChanged,
147+
OldCrypto: oldPC,
148+
NewCrypto: pc,
149+
PeerEd25519: peerEd25519PubKey,
150+
})
151+
}
131152
}
132153

133154
if !hadCrypto || keyChanged {
@@ -218,36 +239,45 @@ func (m *Manager) HandleUnauthFrame(data []byte, from *net.UDPAddr, fromRelay bo
218239
oldPC := m.env.Get(peerNodeID)
219240
hadCrypto := oldPC != nil
220241
keyChanged := hadCrypto && oldPC.PeerX25519Key != pc.PeerX25519Key
242+
// duplicate := same-pubkey frame arriving inside DuplicateHandshakeDebounce
243+
// of a freshly-installed Crypto. See HandleAuthFrame for the
244+
// rationale; same coalescing window applies to PILK as well.
245+
duplicate := hadCrypto && !keyChanged && time.Since(oldPC.CreatedAt) < DuplicateHandshakeDebounce
221246
// Same rationale as HandleAuthFrame: don't replace on a duplicate
222247
// key_exchange (same pubkey).
223248
if !hadCrypto || keyChanged {
224249
m.env.Install(peerNodeID, pc)
225250
}
226251

227-
if keyChanged {
228-
slog.Info("peer rekeyed, re-establishing tunnel", "peer_node_id", peerNodeID)
252+
if duplicate {
253+
slog.Debug("unauth key exchange: duplicate frame coalesced",
254+
"peer_node_id", peerNodeID, "age", time.Since(oldPC.CreatedAt), "relay", fromRelay)
229255
} else {
230-
slog.Info("encrypted tunnel established",
231-
"peer_node_id", peerNodeID, "endpoint", from, "relay", fromRelay)
232-
}
233-
m.publish("tunnel.established", map[string]any{
234-
"peer_node_id": peerNodeID,
235-
"authenticated": false,
236-
"relay": fromRelay,
237-
"rekeyed": keyChanged,
238-
})
239-
240-
if m.postInstall != nil {
241-
m.postInstall(PostInstallEvent{
242-
PeerNodeID: peerNodeID,
243-
From: from,
244-
FromRelay: fromRelay,
245-
Authenticated: false,
246-
HadCrypto: hadCrypto,
247-
KeyChanged: keyChanged,
248-
OldCrypto: oldPC,
249-
NewCrypto: pc,
256+
if keyChanged {
257+
slog.Info("peer rekeyed, re-establishing tunnel", "peer_node_id", peerNodeID)
258+
} else {
259+
slog.Info("encrypted tunnel established",
260+
"peer_node_id", peerNodeID, "endpoint", from, "relay", fromRelay)
261+
}
262+
m.publish("tunnel.established", map[string]any{
263+
"peer_node_id": peerNodeID,
264+
"authenticated": false,
265+
"relay": fromRelay,
266+
"rekeyed": keyChanged,
250267
})
268+
269+
if m.postInstall != nil {
270+
m.postInstall(PostInstallEvent{
271+
PeerNodeID: peerNodeID,
272+
From: from,
273+
FromRelay: fromRelay,
274+
Authenticated: false,
275+
HadCrypto: hadCrypto,
276+
KeyChanged: keyChanged,
277+
OldCrypto: oldPC,
278+
NewCrypto: pc,
279+
})
280+
}
251281
}
252282

253283
// Respond with our key if this is a new peer or the peer rekeyed.

pkg/daemon/keyexchange/keyexchange.go

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,31 @@ const (
9393
// Prevents the "bounce" where a transient outbound packet immediately
9494
// restarts a cycle that just exhausted MaxRekeyAttempts.
9595
rekeyGaveUpCooldown = 60 * time.Second
96+
97+
// DuplicateHandshakeDebounce is the window during which a repeated
98+
// key_exchange frame from the same peer carrying the SAME X25519
99+
// ephemeral key is treated as a duplicate of an already-handled
100+
// handshake. Inside the window the frame is logged at Debug and the
101+
// observable side effects are skipped: no "encrypted tunnel
102+
// established" log line, no tunnel.established bus event, no
103+
// PostInstallHook invocation, and no reply key_exchange.
104+
//
105+
// Why this matters: peers retransmit their PILA via the
106+
// keyexchange.loop's RekeyRetransmitInterval (4 s) PLUS direct +
107+
// relay copies of the same frame can both land within ms when the
108+
// beacon relay is hot. Both legs of every duplicate used to fire
109+
// the postInstall hook (peer-endpoint update, salvage replay,
110+
// flushPending) and produce a noisy log burst that operators
111+
// mistook for a real handshake storm. Real rekey (peer's ephemeral
112+
// key actually changed) is NOT debounced — keyChanged forces the
113+
// full path so salvage + flushPending run.
114+
//
115+
// 250 ms is short enough that an actual second handshake (caused
116+
// by a real key rotation, not a duplicate frame) still progresses
117+
// promptly, and long enough to coalesce the direct+relay arrival
118+
// window plus most peer-side retransmits caused by an early ACK
119+
// being dropped.
120+
DuplicateHandshakeDebounce = 250 * time.Millisecond
96121
)
97122

98123
// PendingRekeyState tracks a key-exchange we sent and are waiting on.

0 commit comments

Comments
 (0)