feat: WAL-based RocksDB replication with HTTP streaming and failover by JackGuslerGit · Pull Request #366 · matrix-construct/tuwunel

JackGuslerGit · 2026-03-12T14:48:49Z

This relates to #35.

Summary:

Adds a primary/secondary replication system using RocksDB's WAL (Write-Ahead Log) streamed over HTTP
Secondary bootstraps from a full checkpoint on startup, then streams incremental WAL frames
Failover is triggered via POST /_tuwunel/replication/promote — no process restart needed
All replication endpoints are protected by a shared secret token

Test plan:

Ran two Docker containers (primary on :8008, secondary on :8009)
Secondary bootstrapped from primary checkpoint at seq 281 and began streaming
Stopped primary with docker stop (graceful SIGTERM)
Promoted secondary via curl — responded {"status":"promoted"}
All messages from before the failover were present on the promoted instance
Measured RPO ~0 on planned failover, RTO = seconds

Relevant config options added:

rocksdb_primary_url — URL of primary for WAL streaming
rocksdb_replication_token — shared secret for endpoint auth
rocksdb_replication_interval_ms — heartbeat interval (default 250ms)

pschichtel · 2026-03-15T17:33:03Z

this implements async replication, so some dataloss is to be expected after an expected failover (node failure, disk failure process crash, ...), right?

JackGuslerGit · 2026-03-16T12:10:43Z

@pschichtel yes, that is correct. Under normal write load, RPO is determined just by network RTT.

JackGuslerGit · 2026-03-17T20:55:02Z

Hey @x86pup. It seems the CI failures on this PR are all runner-side cache issues. I am seeing two issues:

1: failed to create locked file '/opt/rust/cargo/debian/x86_64-linux-gnu/git/db/rust-rocksdb-eed8465c83fc7d81/config.lock': File exists; class=Os (2); code=Locked (-14)

2: could not open '/opt/rust/cargo/debian/x86_64-linux-gnu/git/checkouts/ruma-8006605ea0e2ea25/30d063c/crates/ruma-events/src/poll/unstable_start/unstable_poll_answers_serde.rs' for writing: No such file or directory; class=Os (2)

Can you clean these up and re-run?

jevolk · 2026-03-17T22:09:14Z

Hey @x86pup. It seems the CI failures on this PR are all runner-side cache issues. I am seeing two issues:

1: failed to create locked file '/opt/rust/cargo/debian/x86_64-linux-gnu/git/db/rust-rocksdb-eed8465c83fc7d81/config.lock': File exists; class=Os (2); code=Locked (-14)

2: could not open '/opt/rust/cargo/debian/x86_64-linux-gnu/git/checkouts/ruma-8006605ea0e2ea25/30d063c/crates/ruma-events/src/poll/unstable_start/unstable_poll_answers_serde.rs' for writing: No such file or directory; class=Os (2)

Can you clean these up and re-run?

These docker flakes sometimes occur when CI is really busy, apologies! We'll be happy to rerun as necessary.

jevolk · 2026-03-17T22:35:54Z

I haven't had a chance to thoroughly review this yet since I'm currently away, but a few things stand out as suspicious.

Foremost it's not clear why WAL streaming is necessary. RocksDB already has internal mechanisms to synchronize primary and secondary; all that's missing is the promotion signalling. What is the basis for concerning ourselves with binary framing of rocksdb inner-workings at the user level? Is the rocksdb synchronization API being invoked here? Perhaps I missed it...

JackGuslerGit · 2026-03-17T23:42:46Z

@jevolk Yes, you are correct, TryCatchUpWithPrimary() handles sync internally. Correct me if I'm wrong, but it requires the secondary instance to open the primary's database directory directly. This works when both instances share the same filesystem (same machine or NFS mount).

In our case, we have a cluster of physical servers where each server has its own local disk. We can't use NFS/shared storage in our infrastructure. So core2 has no direct filesystem access to core1's RocksDB directory.

That's the gap we're trying to fill by replicating the WAL and SST files over the network, so core2 can stay in sync with core1 without shared storage. Once core2 has a local copy of the data, TryCatchUpWithPrimary() could potentially still be used if we mirror the primary's files locally, or we apply the WAL batches ourselves.

Is there a mechanism in RocksDB you'd recommend for this case, or is shared storage assumed in your deployment model?

jevolk · 2026-03-18T01:03:58Z

@jevolk Yes, you are correct, TryCatchUpWithPrimary() handles sync internally. Correct me if I'm wrong, but it requires the secondary instance to open the primary's database directory directly. This works when both instances share the same filesystem (same machine or NFS mount).

In our case, we have a cluster of physical servers where each server has its own local disk. We can't use NFS/shared storage in our infrastructure. So core2 has no direct filesystem access to core1's RocksDB directory.

That's the gap we're trying to fill by replicating the WAL and SST files over the network, so core2 can stay in sync with core1 without shared storage. Once core2 has a local copy of the data, TryCatchUpWithPrimary() could potentially still be used if we mirror the primary's files locally, or we apply the WAL batches ourselves.

Is there a mechanism in RocksDB you'd recommend for this case, or is shared storage assumed in your deployment model?

Alright so this is not limited to shared filesystem mounts, that's rather exciting actually. Keep up the good work 👍

JackGuslerGit · 2026-03-18T14:27:14Z

@jevolk Thanks! I see it has passed all checks, what's the next step here?

x86pup · 2026-03-18T15:44:19Z

It needs to be thoroughly reviewed here especially since the usage of AI is apparent, Jason is on vacation and will get to it soon. Thank you for ensuring CI passes to help this along.

JackGuslerGit · 2026-03-18T15:48:23Z

Okay sounds good, thanks for letting me know!

jevolk · 2026-04-02T01:46:19Z

Thank you for your patience 🙏 I'm right around the corner now...

Add query and stream features; enhance replication routes and logic

Signed-off-by: Jason Volk <jason@zemos.net>

Add query and stream features; enhance replication routes and logic

Signed-off-by: Jason Volk <jason@zemos.net>

JackGuslerGit · 2026-04-07T20:31:31Z

Hey @jevolk. I ran into an issue with the checkpoint logic while testing. The original code swapped the RocksDB database directory while RocksDB was already running, causing file corruption because open file descriptors still pointed to the old files while new writes went to the checkpoint copy. The fix moves the checkpoint download and filesystem swap to before RocksDB opens, so the database always starts fresh from a clean checkpoint with no live files being touched.

jevolk · 2026-04-08T00:52:17Z

Hey @jevolk. I ran into an issue with the checkpoint logic while testing. The original code swapped the RocksDB database directory while RocksDB was already running, causing file corruption because open file descriptors still pointed to the old files while new writes went to the checkpoint copy. The fix moves the checkpoint download and filesystem swap to before RocksDB opens, so the database always starts fresh from a clean checkpoint with no live files being touched.

Thanks for finding this. I made an attempt at merging this but ran out of time before the 1.6 release with a few loose ends still. The main issue primarily dealt with switching to CBOR for the wire format which makes more sense for several reasons.

jevolk · 2026-04-15T05:11:37Z

I'll be revisiting this again at the top of the 1.6.1 dev cycle (start of next week). I only have a small number of re-organizations and applying CBOR (which is hugely simplifying) so this should go in pretty early on. Thank you again for your patience 🙏🏻

Add query and stream features; enhance replication routes and logic

Signed-off-by: Jason Volk <jason@zemos.net>

Logically-agnostic refactor for patterns and conventions. Fix additional lints. Signed-off-by: Jason Volk <jason@zemos.net>

…ot. (#366) Signed-off-by: Jason Volk <jason@zemos.net>

Use strong Url type. Signed-off-by: Jason Volk <jason@zemos.net>

Signed-off-by: Jason Volk <jason@zemos.net>

Reduce additional log+err repeated message patterns. Compose with Url rather than format strings. Additional renames; tracing instruments. Reduce interval/heartbeat frequency. Signed-off-by: Jason Volk <jason@zemos.net>

Reduce additional log+err repeated message patterns. Compose with Url rather than format strings. Additional renames; tracing instruments. Reduce interval/heartbeat frequency. Bump tar RUSTSEC-2026-0067. Signed-off-by: Jason Volk <jason@zemos.net>

Signed-off-by: Jason Volk <jason@zemos.net>

Add query and stream features; enhance replication routes and logic

Signed-off-by: Jason Volk <jason@zemos.net>

Logically-agnostic refactor for patterns and conventions. Fix additional lints. Signed-off-by: Jason Volk <jason@zemos.net>

Split WAL related functions; shuffle/reorganize out of database modroot. Tuck maybe_bootstrap_checkpoint() back into replication service. Use strong Url type. Rename endpoints and service to cluster. Split and rename run_stream and wal endpoint to sync. Move backoff constants to config items. Use 'global' column instead of 'replication_meta' cf. Reduce additional log+err repeated message patterns. Compose with Url rather than format strings. Additional renames; tracing instruments. Reduce interval/heartbeat frequency. Bump tar RUSTSEC-2026-0067. Signed-off-by: Jason Volk <jason@zemos.net>

jevolk · 2026-04-21T16:40:53Z

Hi Jack, I took a second swipe at this and unfortunately I just haven't gotten it to where it needs to be. I should have taken notes to provide some details since there's way too much nuance to summarize here.

The tldr is that I have to revisit this after some higher priority tasks- either next week for 1.6.1 or on the backside of that release. Overall I think this feature has promise and we're very close now.

JackGuslerGit · 2026-04-21T16:44:35Z

Hey, no worries! Take your time. Let me know if there's anything I can do to help.

JackGuslerGit · 2026-04-23T14:16:50Z

Hey @jevolk, do you think this should also handle replicating media files?

jevolk · 2026-04-25T13:51:18Z

Hey @jevolk, do you think this should also handle replicating media files?

If we were to put media in RocksDB (and I have before) we would very likely be disabling the WAL for those columns and write-ops. We would need a different mode of transport. Now that we have S3 storage provider support, we have more possibilities for media backup. In fact I'm considering the ability to backup the database itself over an S3 connection- though that would be for "colder" storage and wouldn't replace this feature for "hot" failover of course 😅

JackGuslerGit added 4 commits March 10, 2026 15:57

Add replication module and related configurations

6b2fb7a

Add query and stream features; enhance replication routes and logic

ef679d2

Removed claude document

8cc62e0

fmt and clippy fixes

427859c

JackGuslerGit marked this pull request as ready for review March 12, 2026 14:54

JackGuslerGit marked this pull request as draft March 12, 2026 14:59

JackGuslerGit marked this pull request as ready for review March 13, 2026 15:04

fix: resolve clippy failures blocking PR matrix-construct#366

c6fbcfd

JackGuslerGit added 4 commits March 16, 2026 10:15

ci: retrigger

efeff5a

lint

806dc1f

fix syntax

0f227dc

lint

ca528fa

jevolk self-assigned this Mar 20, 2026

jevolk linked an issue Apr 4, 2026 that may be closed by this pull request

Hot-failover with load-balanced spare #35

Open

jevolk pushed a commit that referenced this pull request Apr 5, 2026

Add replication module and related configurations (#366)

cf67b93

Add query and stream features; enhance replication routes and logic

jevolk added a commit that referenced this pull request Apr 5, 2026

Apply changes and deconflicts while #366 was pending.

3aba94a

Signed-off-by: Jason Volk <jason@zemos.net>

jevolk added a commit that referenced this pull request Apr 5, 2026

Remove constant-time comparison because it's not real. (#366)

cf460c5

Signed-off-by: Jason Volk <jason@zemos.net>

jevolk pushed a commit that referenced this pull request Apr 5, 2026

Add replication module and related configurations (#366)

7e5e668

Add query and stream features; enhance replication routes and logic

jevolk added a commit that referenced this pull request Apr 5, 2026

Apply changes and deconflicts while #366 was pending.

3df358a

Signed-off-by: Jason Volk <jason@zemos.net>

jevolk added a commit that referenced this pull request Apr 5, 2026

Remove constant-time comparison because it's not real. (#366)

cf78bdc

Signed-off-by: Jason Volk <jason@zemos.net>

fix: bootstrap before opening db for replication checkpoint

5769386

lint

49bd28f

jevolk pushed a commit that referenced this pull request Apr 19, 2026

Add replication module and related configurations (#366)

738f99d

Add query and stream features; enhance replication routes and logic

jevolk added a commit that referenced this pull request Apr 19, 2026

Remove constant-time comparison because it's not real. (#366)

b0aeb88

Signed-off-by: Jason Volk <jason@zemos.net>

jevolk added a commit that referenced this pull request Apr 19, 2026

Miscellaneous style adustments; make use of existing utils. (#366)

d8f7912

Logically-agnostic refactor for patterns and conventions. Fix additional lints. Signed-off-by: Jason Volk <jason@zemos.net>

jevolk added a commit that referenced this pull request Apr 19, 2026

Split WAL related functions; shuffle/reorganize out of database modro…

08210c6

…ot. (#366) Signed-off-by: Jason Volk <jason@zemos.net>

jevolk added a commit that referenced this pull request Apr 19, 2026

Tuck maybe_bootstrap_checkpoint() back into replication service. (#366)

280d9ec

Use strong Url type. Signed-off-by: Jason Volk <jason@zemos.net>

jevolk added a commit that referenced this pull request Apr 19, 2026

Rename endpoints and service to cluster. (#366)

c17ddbc

Signed-off-by: Jason Volk <jason@zemos.net>

jevolk added a commit that referenced this pull request Apr 19, 2026

Split and rename run_stream and wal endpoint to sync. (#366)

55a5634

Signed-off-by: Jason Volk <jason@zemos.net>

jevolk added a commit that referenced this pull request Apr 19, 2026

Move backoff constants to config items. (#366)

1eddf90

Signed-off-by: Jason Volk <jason@zemos.net>

jevolk added a commit that referenced this pull request Apr 19, 2026

Switch WalFrame wire-format to CBOR. (#366)

28003df

Signed-off-by: Jason Volk <jason@zemos.net>

jevolk pushed a commit that referenced this pull request Apr 20, 2026

Add replication module and related configurations (#366)

33837cb

Add query and stream features; enhance replication routes and logic

jevolk added a commit that referenced this pull request Apr 20, 2026

Remove constant-time comparison because it's not real. (#366)

7b32906

Signed-off-by: Jason Volk <jason@zemos.net>

jevolk added a commit that referenced this pull request Apr 20, 2026

Miscellaneous style adustments; make use of existing utils. (#366)

5996fdd

Logically-agnostic refactor for patterns and conventions. Fix additional lints. Signed-off-by: Jason Volk <jason@zemos.net>

fix: better WAL gap handling

63ad870

Conversation

JackGuslerGit commented Mar 12, 2026

Uh oh!

pschichtel commented Mar 15, 2026

Uh oh!

JackGuslerGit commented Mar 16, 2026

Uh oh!

JackGuslerGit commented Mar 17, 2026

Uh oh!

jevolk commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jevolk commented Mar 17, 2026

Uh oh!

JackGuslerGit commented Mar 17, 2026

Uh oh!

jevolk commented Mar 18, 2026

Uh oh!

JackGuslerGit commented Mar 18, 2026

Uh oh!

x86pup commented Mar 18, 2026

Uh oh!

JackGuslerGit commented Mar 18, 2026

Uh oh!

jevolk commented Apr 2, 2026

Uh oh!

JackGuslerGit commented Apr 7, 2026

Uh oh!

jevolk commented Apr 8, 2026

Uh oh!

jevolk commented Apr 15, 2026

Uh oh!

jevolk commented Apr 21, 2026

Uh oh!

JackGuslerGit commented Apr 21, 2026

Uh oh!

JackGuslerGit commented Apr 23, 2026

Uh oh!

jevolk commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jevolk commented Mar 17, 2026 •

edited

Loading