fix(qwp): fixes a rare store-and-forward sender race that could silently drop a batch by jerrinot · Pull Request #41 · questdb/java-questdb-client

jerrinot · 2026-06-09T12:14:11Z

Fixes a rare store-and-forward sender race that could silently drop a batch during high-throughput ingestion when a segment rotated while the sender was catching up.

implementation details for reviewers
Avoid skipping a frame published after trySendOne observes the old segment tip but before it observes that the producer rotated to a new active segment. Re-read the current segment's published offset before advancing so the I/O loop sends any newly published tail frame instead of moving to the next segment.

Avoid skipping a frame published after trySendOne observes the old segment tip but before it observes that the producer rotated to a new active segment. Re-read the current segment's published offset before advancing so the I/O loop sends any newly published tail frame instead of moving to the next segment.

mtopolnik · 2026-06-11T09:49:54Z

[PR Coverage check]

😍 pass : 7 / 7 (100.00%)

file detail

	path	covered line	new line	coverage
🔵	io/questdb/client/cutlass/qwp/client/sf/cursor/CursorWebSocketSendLoop.java	7	7	100.00%

jerrinot · 2026-06-11T09:54:42Z

Review: PR #41 — `fix(qwp): fixes a rare store-and-forward sender race that could silently drop a batch`

Level 3 (full mission-critical pass). This is a tightly-scoped 7-line concurrency fix in a single hot-path method plus a probabilistic regression test, so I performed the structured agent analysis (correctness, concurrency, resource, test, cross-context, adversarial) inline and verified every finding against source — including an empirical bisect.

What the change does

trySendOne() previously read the segment tip once (pub = publishedOffset()), and if it looked drained and the segment was no longer the active one, it advanced to the next segment. The producer publishes the segment's final frame and then rotates the active in separate, ordered steps, so the consumer could observe the rotation (sendingSegment != activeSegment()) without having observed the final frame — and advance past it, dropping it permanently. The fix re-reads publishedOffset() after confirming the segment is sealed; if the tail frame raced in, it falls through and sends it instead of advancing.

Correctness & concurrency — verified

The fix is correct, and the memory ordering holds under the JMM:

Producer (single thread, program order): tryAppend() publishes the tail frame via MmapSegment.publishedCursor (volatile, written last — MmapSegment.java:442), in an earlier appendOrFsn call than the rotation active = spare (volatile, SegmentRing.java:383). Rotation only happens when the old active is full, so the tail publish strictly precedes it.
Consumer (I/O thread): the re-read publishedOffset() (CursorWebSocketSendLoop.java:1275) is sequenced after the activeSegment() read (:1270). When that read observes the rotation, the active volatile establishes a synchronizes-with edge, so everything before active = spare — including the tail publish — happens-before the re-read. The re-read therefore must see the tail frame. Sound.

I also confirmed the control-flow rewrite drops no behavior:

The new else { return false; } exactly preserves the old "on active, nothing to send → wait" path. The fix correctly re-reads only in the sealed case — on the active segment a missed frame is picked up on the next iteration (the segment isn't going anywhere), so no re-read is needed there. The asymmetry is intentional and right.
All paths reaching the frame-decode block (:1288+) carry a pub that is either fresh-from-top (hot path, sendOffset < pub) or the fresh re-read — never a stale value that could drive an out-of-bounds read.
No double-send: after sending the raced-in tail, sendOffset advances and the next iteration advances the segment normally.

Trim-safety is unchanged: a segment holding an unsent frame cannot be trimmed (its frames aren't acked under a well-behaved server), so the re-read never touches freed memory.

Empirical verification

I proved the regression test is valid by bisecting:

On the fix: 17 runs, 0 failures.
On the reverted (buggy) code, properly recompiled: 2 failures in 12 runs (~17%) — rotation race reordered/dropped a frame — fsn 37912 on the wire was 37952 (a dropped batch) and a single-frame drop at fsn 42431. Matches the test's "reproduces a fair fraction of runs" claim and the PR's "drop a batch" wording.
Regression suite: the full cursor send-loop + ring + segment suite (97 tests) is green with the fix.

(One note: an early "passing" run on reverted code was a false negative because surefire:test doesn't recompile — once recompiled, the test reproduces the bug. The test itself is sound.)

Test review

Strong, faithful regression test. Uses assertMemoryLeak(); drives the real I/O thread against memory-mode segments (identical trySendOne() path to disk mode, so no disk-mode gap); payload-as-fsn makes drops show up as a contiguity gap; resource teardown order is correct (loop.close() joins the I/O thread before engine.close() frees the segments — verified loop.close() also closes the RecordingClient, so no native leak). The 30s drain cap is only hit on a failing run, which is the intended design.

Findings

Critical: none.

Moderate: none.

Minor (non-blocking):

Title mood — fix(qwp): fixes a rare … uses third-person "fixes"; Conventional Commits prefers imperative "fix a rare …". Also ~80 chars, over the ~72 guideline. Cosmetic; the squash title is otherwise fine.
RecordingClient member ordering — recorded / sentCount / recordedCount aren't grouped-by-visibility / alphabetized per the QuestDB style. Test-only, trivial.

Downgraded (considered, dismissed)

"Re-read could observe a stale tip and read freed bytes after trim" — dismissed: a segment with an unsent frame can't be acked→trimmed under a correct server, and reading the publishedCursor field is heap-safe regardless. The Byzantine-server angle (ack clamped at publishedFsn, not at "sent") is pre-existing defense-in-depth, not introduced here.
"advanceSegment rescan is O(sealed²)" — dismissed: pre-existing, unchanged by this PR, and bounded in production by ack-driven trim.

Summary

Approve. Correct, minimally-scoped concurrency fix with a regression test I empirically confirmed fails on the bug and passes on the fix; no behavioral regressions across 97 tests. 0 draft findings dropped as false positives during the in-diff pass; 2 minor nits, both cosmetic. In-diff: 2 minor / out-of-diff: 0 — appropriate here, since the only consumer of the changed private method is the I/O loop, which the new test exercises end-to-end.

jovfer

LGTM.
I can't say there are no other similar issues, but this particular change makes perfect sense.

jerrinot added the bug Something isn't working label Jun 9, 2026

jerrinot added 2 commits June 11, 2026 11:42

probabilistic test

463f4ce

Merge remote-tracking branch 'origin/main' into jh_send_one_recheck

e48eed8

jovfer approved these changes Jun 11, 2026

View reviewed changes

jerrinot merged commit ac1d4ba into main Jun 11, 2026
12 checks passed

jerrinot deleted the jh_send_one_recheck branch June 11, 2026 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(qwp): fixes a rare store-and-forward sender race that could silently drop a batch#41

fix(qwp): fixes a rare store-and-forward sender race that could silently drop a batch#41
jerrinot merged 3 commits into
mainfrom
jh_send_one_recheck

jerrinot commented Jun 9, 2026

Uh oh!

mtopolnik commented Jun 11, 2026

Uh oh!

jerrinot commented Jun 11, 2026

Uh oh!

jovfer left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jerrinot commented Jun 9, 2026

Uh oh!

mtopolnik commented Jun 11, 2026

[PR Coverage check]

file detail

Uh oh!

jerrinot commented Jun 11, 2026

Review: PR #41 — fix(qwp): fixes a rare store-and-forward sender race that could silently drop a batch

What the change does

Correctness & concurrency — verified

Empirical verification

Test review

Findings

Downgraded (considered, dismissed)

Summary

Uh oh!

jovfer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Review: PR #41 — `fix(qwp): fixes a rare store-and-forward sender race that could silently drop a batch`