fix(qwp): fix missed terminal send errors during Sender close#42
Open
jerrinot wants to merge 13 commits into
Open
fix(qwp): fix missed terminal send errors during Sender close#42jerrinot wants to merge 13 commits into
jerrinot wants to merge 13 commits into
Conversation
Fixes a race where QwpWebSocketSender.close() could return successfully even after a terminal QWP/WebSocket error was latched during close. **implementation details for reviewers** Track the exact `lastError` instance that `checkError()` has synchronously surfaced and make `QwpWebSocketSender.close()` suppress only that pre-owned instance. This avoids treating a freshly latched terminal error as already owned when it appears between separate `hasUnsurfacedError()` and `getLastError()` reads.
The latch leaked its invariants into callers: close() had to compose hasUnsurfacedError() + checkError() for the safety-net rethrow, and checkError() minted a fresh wrapper per call for raw causes, so drainOnClose() could rethrow an error the user had already caught. - recordFatal() wraps non-LineSenderException causes once, at latch time: every rethrow delivers the same instance, making close()'s identity suppression correct in every state. - checkUnsurfacedError() encapsulates the close() safety net; hasUnsurfacedError() becomes private, so the torn two-read ownership snapshot is no longer writable outside the class. - getLastTerminalServerError() derives the SenderError from the latched LineSenderServerException; the sibling field is gone.
No sender-level test asserted WHETHER close() throws: the existing close assertions (InitialConnectAsyncTest) install a handler and tolerate both outcomes, so deleting the safety net, inverting its handler gate, or always-suppressing the snapshot all survived the suite. CloseSafetyNetTest adds the strict pair, each awaiting the latched terminal deterministically before close(): - a config-string-only caller who never flushed must get the typed terminal thrown from close(); - a caller whose custom handler already received the error must get a silent close(). All three mutations above now fail the suite. The snapshot race itself stays covered by CloseOwnershipRaceTest at the accessor. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
lastError suggested most-recent-wins, but the field is a write-once first-writer-wins latch: transient failures reconnect upstream and never reach it; only the error that ends the loop is latched. The name now matches the documented invariant and the "latched terminal" language used throughout. recordFatal's javadoc also states the single-writer rule explicitly: the unsynchronized check-then-latch is only safe because every caller runs on the I/O thread. The retry-loop locals keep the lastError name — there it is accurate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pin the pre-existing close() double-signal: when an async custom error handler has already received the terminal, close_flush_timeout_millis > 0, and an unacked tail remains, close() -> drainOnClose() -> checkError() (QwpWebSocketSender.java:2657) re-throws the same terminal the handler already owns. The alreadyDeliveredToCustomHandler guard wraps only the step-2 checkUnsurfacedError() safety net, not the step-3 drain; the end-of-close self-suppression (terminalError == alreadyOwnedByUser) only covers SYNCHRONOUS ownership (getSynchronouslySurfacedError()), so the async-handler-only case is not suppressed. Contradicts the in-code comment: 'once the user has owned an error, close() should not double-signal it.' The server fixture gates the HALT rejection until after flush() returns so the error reaches the user only via the async handler (no synchronous surface), making the failure deterministic. Sibling green test CloseSafetyNetTest.testCloseStaysSilentWhenCustomHandlerAlreadyDelivered passes only because it uses close_flush_timeout_millis=0. Currently RED; turns green once the drain stops re-surfacing a terminal the custom handler already owns. Not introduced by this PR (out-of-diff).
…nal (M3) drainOnClose()'s loop called cursorSendLoop.checkError() unconditionally, which re-throws the latched terminal from close(). When an async custom error handler had already received the terminal, close_flush_timeout_millis > 0, and an unacked tail remained, close() re-surfaced an error the user already owned -- contradicting the in-code intent 'once the user has owned an error, close() should not double-signal it'. The checkError() call conflated two jobs: (1) breaking the drain loop when a terminal means acks will never reach target, and (2) surfacing the error. Job 1 is always needed; job 2 is a policy close() already gates. Fix: drainOnClose(boolean errorOwnedByCustomHandler) mirrors the step-2 safety-net gate. When the handler owns the terminal it stops silently on a latched error (getTerminalError() != null) instead of throwing. Otherwise it keeps checkError() -- which both surfaces the error for a config-string-only caller and breaks the loop; a synchronously-owned instance is still suppressed by close()'s terminalError == alreadyOwnedByUser check, so that path is not double-signalled either. Flips CloseDrainDoubleSignalTest green. Full close/drain/error/reconnect/ failover/durable-ack sweep: 162 tests, 0 failures.
bluestreak01
previously approved these changes
Jun 15, 2026
…ny error ever" (C1) close() gated its terminal safety net on hasDeliveredToCustomHandler() -- a lifetime-sticky "the custom handler saw ANY error, ever" flag. A routine DROP_AND_CONTINUE rejection (e.g. SCHEMA_MISMATCH / WRITE_ERROR) flipped that flag permanently, after which a later genuine HALT terminal was silently lost from every synchronous close() channel: the step-2 checkUnsurfacedError() was skipped and drainOnClose() took its silent branch. This regressed the base, whose unconditional drain checkError() was a loud backstop. Fix: track whether the dispatcher actually delivered THE latched terminal (the exact SenderError instance, identity-matched) to a non-default handler, and gate close() on that instead. - SenderErrorDispatcher: add markTerminal(SenderError) (write-once) + a terminal-specific deliveredTerminalToCustomHandler flag set only when the marked terminal instance is delivered to a custom handler; hasDeliveredTerminalToCustomHandler() accessor. Keep the old hasDeliveredToCustomHandler() as an ops/diagnostic signal. - CursorWebSocketSendLoop.recordFatal: mark the dispatcher's terminal under the write-once latch guard (the same SenderError dispatchError() delivers). - QwpWebSocketSender.close(): gate step-2 and drainOnClose() on hasDeliveredTerminalToCustomHandler(). Because the flag flips only on actual delivery of the terminal, all three loss paths now surface loudly: the DROP-then-HALT conflation, the setErrorHandler(null) revert (terminal reaches only the default handler), and the slow-handler case (terminal not yet delivered at close() time, so it is rethrown before the dispatcher's 100ms drain deadline abandons the queue). The legitimate single-HALT case still suppresses the double-signal (M3). Adds CloseTerminalConflationTest: fails on the old "any error ever" gate (reproduces C1 via the deterministic setErrorHandler(null) path), passes with the fix.
Contributor
[PR Coverage check]😍 pass : 42 / 46 (91.30%) file detail
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes a race where
QwpWebSocketSender.close()could return successfully even after a terminal QWP/WebSocket error was latched during close, and a related corner where close() could rethrow a wrapped terminal error the user had already caught fromflush().implementation details for reviewers
checkError()has synchronously surfaced and makeclose()suppress only that pre-owned instance, taken as the single readgetSynchronouslySurfacedError(). This avoids treating a freshly latched terminal as already owned when it appears between two separate latch reads. Guarded byCloseOwnershipRaceTest(probabilistic accessor race, near-certain catch per CI run, ~0.15s).recordFatalwraps non-LineSenderExceptioncauses once at latch time, so every rethrow delivers the same instance — close()'s identity suppression becomes correct in every state, which also fixes a pre-existing corner wheredrainOnClose()re-threw a fresh wrapper for an error the user already caught.checkUnsurfacedError()encapsulates the close() safety net andhasUnsurfacedError()becomes private, so the racy two-read snapshot is no longer expressible outside the class. The typedSenderErrorpayload is derived from the latchedLineSenderServerExceptioninstead of being tracked in a sibling field.lastErrortoterminalError(andgetLastError()togetTerminalError()): the field is a write-once, first-writer-wins latch — transient failures reconnect upstream and never reach it — so "last" was misleading.CloseSafetyNetTestpins both branches of close()'s safety net deterministically: a config-string-only caller who never flushed must see the latched terminal thrown from close(); a caller whose custom error handler already received it must get a silent close.