-
-
Notifications
You must be signed in to change notification settings - Fork 100
feat: Raft-based High Availability using Apache Ratis #3731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
c075a0e
2bb3266
857966b
de8ef5c
a0ae3a8
b4c084d
2912b8f
361f5b6
fe1e9e5
ba84f90
9ef44df
b113234
e824b02
6d12202
325a084
cb24155
aac9ea5
5dffb5f
7f99906
3024441
e85b119
1b6fb2a
1740017
e7a27b2
25ddf81
1e75dda
e2beb89
4ce1ee1
ef173f8
1866393
4934b1f
9c1017e
3525842
7d1dc3b
240abc3
8575dab
fbc06c7
521d83d
6e4486a
50cf01b
1c54c40
345428f
aea79f7
034f5a7
ce994df
c90adb7
bd5d0a7
a1be481
b6ee699
8097ddf
ac6ff48
174c65c
db26fd7
4bc518f
77f233b
6be298f
38bc95b
7b846be
21bf1a8
f6831ef
ba8cfab
82136bb
5db9048
6c7facb
d83ed3c
f26c395
9fda056
7510939
0df6043
1876509
deb8238
b787035
ff538f2
b925160
2726ef6
fd9d41e
8e7d513
0ab541d
6020da6
4fdc68a
3f2892c
da6389c
e7e59e3
b9edf5a
b0fc656
e95fcf5
19978c3
e631764
70eb116
10abeaf
6bd076a
d9cb4d7
4572eb9
5551280
85cfc49
789ce95
5f52985
0c9ee23
f6baabd
99cd26e
3f9a051
19b64bf
7d82c3b
eb67df9
c1e846c
305c1ec
fc1d419
dcc9f47
80c1c4a
5b95c38
cd5141d
c880ad2
2fb39c8
373c366
2de9a38
2a22c49
ffdf399
e6e9511
7af04ac
0490826
0102907
bb9c3cd
e5773ad
8538876
33ddd14
7e5bc55
23bbdcf
ee0626f
e0269b1
b8b9be1
d2a9258
fe4c8d0
43e5d91
7848140
893d728
d123995
94c89e5
4c6cad4
dce74dc
6c77830
d8a7249
478168d
9897b13
52a9601
794ca80
a127d2b
b9faa6d
4a30d3f
aa911e7
4a1b93c
d8107dd
f233d50
64d1f26
b6c5c21
b17330f
ea335fc
e16a6ae
9a559df
7ac2ae3
a813671
9c863b9
c37d9c7
bc19730
41e8e82
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| name: E2E HA Tests | ||
|
|
||
| on: | ||
| workflow_dispatch: | ||
| schedule: | ||
| - cron: "0 0 * * *" # Runs daily at midnight | ||
| pull_request: | ||
| branches: | ||
| - main | ||
|
|
||
|
|
||
| jobs: | ||
| setup: | ||
| runs-on: ubuntu-latest | ||
| permissions: | ||
| contents: write | ||
| packages: write | ||
| attestations: write | ||
| id-token: write | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | ||
| - name: Ensure SHA pinned actions | ||
| uses: zgosalvez/github-actions-ensure-sha-pinned-actions@471d5ace1f08e3c4df1c4c2f7e6341aa75da434a # v5.0.3 | ||
| - name: Run pre-commit | ||
| uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0 | ||
| with: | ||
| python-version: "3.13.0" | ||
| cache: "pip" | ||
| - uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1 | ||
|
|
||
| - name: Set up JDK 21 | ||
| uses: actions/setup-java@be666c2fcd27ec809703dec50e508c2fdc7f6654 # v5.2.0 | ||
| with: | ||
| distribution: "temurin" | ||
| java-version: 21 | ||
|
|
||
| - name: Cache local Maven repository | ||
| uses: actions/cache@668228422ae6a00e4ad889ee87cd7109ec5666a7 # v5.0.4 | ||
| with: | ||
| path: ~/.m2/repository | ||
| key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }} | ||
| restore-keys: | | ||
| ${{ runner.os }}-maven- | ||
|
|
||
| - name: Set up QEMU | ||
| uses: docker/setup-qemu-action@ce360397dd3f832beb865e1373c09c0e9f86d70a # v4.0.0 | ||
|
|
||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@4d04d5d9486b7bd6fa91e7baf45bbb4f8b9deedd # v4.0.0 | ||
|
|
||
| - name: Build and package with Maven Docker profile | ||
| run: ./mvnw clean install -Pdocker -DskipTests --batch-mode --errors --show-version | ||
| env: | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
|
|
||
| - name: Run HA Tests | ||
| run: ./mvnw verify -DskipTests -Pintegration --batch-mode --errors --fail-never --show-version -pl e2e-ha | ||
| env: | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
|
|
||
| - name: Tests Reporter | ||
| uses: dorny/test-reporter@a43b3a5f7366b97d083190328d2c652e1a8b6aa2 # v3.0.0 | ||
| if: success() || failure() | ||
| with: | ||
| name: IT Tests Report | ||
| path: "**/failsafe-reports/TEST*.xml" | ||
| list-tests: "failed" | ||
| list-suites: "failed" | ||
| reporter: java-junit | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -557,6 +557,8 @@ dist | |
| # Test database files | ||
| *.lsmvecidx | ||
| *.metadata.json | ||
|
|
||
| notes.txt | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The file |
||
| /.claude/worktrees | ||
| /server/profiler | ||
| /server/chats | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,272 @@ | ||
| # HA Branch Comparison: `ha-redesign` vs `apache-ratis` | ||
|
|
||
| **Date:** 2026-04-06 (updated after second port round) | ||
| **Compared against:** `main` branch | ||
|
|
||
| Both branches rewrite ArcadeDB's High Availability stack on top of Apache Ratis. They share the same goal but differ in architecture, scope, and maturity. | ||
|
|
||
| --- | ||
|
|
||
| ## 1. Module Structure | ||
|
|
||
| | | `ha-redesign` | `apache-ratis` | | ||
| |---|---------------------------------------------------------|---| | ||
| | Location | Separate top-level module `ha-raft/` | Inside `server/` module | | ||
| | Package | `com.arcadedb.server.ha.raft` | `com.arcadedb.server.ha.ratis` | | ||
| | Server dep scope | `provided` (plugin-style) | `compile` (direct) | | ||
| | Ratis version | 3.2.2 | 3.2.1 | | ||
| | Activation | `HA_IMPLEMENTATION=raft` toggle, `ServiceLoader` plugin | Wired directly into `ArcadeDBServer` startup | | ||
| | Distribution | Shade plugin configured, ready for modular distribution | Bundled with server | | ||
|
|
||
| `ha-redesign` isolates the Raft subsystem as a publishable Maven artifact with `provided` scope on the server. `apache-ratis` embeds it directly in the server module. | ||
|
|
||
| --- | ||
|
|
||
| ## 2. Source Files | ||
|
|
||
| ### ha-redesign (14 main classes) | ||
|
|
||
| | Class | Purpose | | ||
| |-------|---------| | ||
| | `RaftReplicatedDatabase` | `DatabaseInternal` wrapper, intercepts `commit()` for Raft consensus | | ||
| | `RaftHAServer` | Ratis `RaftServer`/`RaftClient` lifecycle, peer parsing, lag monitor | | ||
| | `ArcadeStateMachine` | Ratis state machine with `SimpleStateMachineStorage`, election metrics | | ||
| | `RaftLogEntryCodec` | Encode/decode Raft log entries with LZ4 compression | | ||
| | `RaftGroupCommitter` | Batched Raft submissions via pipelined async sends | | ||
| | `RaftHAPlugin` | `ServerPlugin` for ServiceLoader-based HA discovery | | ||
| | `SnapshotHttpHandler` | HTTP handler serving database ZIP snapshots | | ||
| | `GetClusterHandler` | HTTP endpoint returning cluster status JSON | | ||
| | `SnapshotManager` | CRC32 checksum and file-diff utilities | | ||
| | `ClusterMonitor` | Replication lag tracking per replica | | ||
| | `HALog` | Structured HA logging (BASIC/DETAILED/TRACE) with cached level | | ||
| | `Quorum` | Enum: MAJORITY, ALL | | ||
| | `RaftLogEntryType` | Enum: TX_ENTRY, SCHEMA_ENTRY, INSTALL_DATABASE_ENTRY | | ||
| | `package-info.java` | Package documentation | | ||
|
|
||
| ### apache-ratis (7 main classes) | ||
|
|
||
| | Class | Purpose | | ||
| |-------|---------| | ||
| | `ArcadeDBStateMachine` | State machine with schema apply, command forwarding via `query()` | | ||
| | `RaftLogEntry` | Integrated entry format + serialization with compression | | ||
| | `RaftHAServer` | Server lifecycle, Quorum inner enum, K8s auto-join, dynamic membership | | ||
| | `RaftGroupCommitter` | Batched Raft submissions (configurable batch size) | | ||
| | `SnapshotHttpHandler` | HTTP handler serving database ZIP snapshots | | ||
| | `ClusterMonitor` | Replication lag tracking | | ||
| | `HALog` | Structured HA logging with cached level | | ||
|
|
||
| --- | ||
|
|
||
| ## 3. Shared Features (Both Branches) | ||
|
|
||
| These features exist on both sides with equivalent implementations: | ||
|
|
||
| | Feature | Notes | | ||
| |---------|-------| | ||
| | Group Committer | Batched Raft writes via pipelined `async().send()`, configurable batch size | | ||
| | ALL Quorum correctness | Success only reported after ALL watch completes (race condition fixed) | | ||
| | LZ4 WAL Compression | WAL data in log entries compressed via `CompressionFactory` | | ||
| | Snapshot Install | `notifyInstallSnapshotFromLeader()` + HTTP-based ZIP download | | ||
| | Snapshot auth | `X-ArcadeDB-Cluster-Token` header with timing-safe `MessageDigest.isEqual` | | ||
| | HALog | 3 verbosity levels (BASIC/DETAILED/TRACE) with cached level, no config read on hot path | | ||
| | Quorum Enum | MAJORITY and ALL modes, ALL enforced via Ratis Watch API | | ||
| | SimpleStateMachineStorage | Replaces hand-rolled last-applied tracking | | ||
| | Election Metrics | `electionCount`, `lastElectionTime`, exposed via cluster status | | ||
| | PBKDF2 Cluster Token | 100K-iteration PBKDF2WithHmacSHA256 derivation from cluster name + root password | | ||
| | Leader Lease | `LINEARIZABLE` reads enabled with 0.9 timeout ratio | | ||
| | Configurable election timeouts | `HA_ELECTION_TIMEOUT_MIN/MAX` for WAN cluster tuning | | ||
| | Configurable Ratis tuning | Log segment size, append buffer size, write buffer all configurable | | ||
| | NIO zip-slip protection | `Path.normalize().toAbsolutePath().startsWith()` for snapshot extraction | | ||
| | WAL deletion logging | Warning logged when stale `.wal` file deletion fails | | ||
| | Dynamic membership API | `addPeer()`, `removePeer()`, `transferLeadership()`, `stepDown()`, `leaveCluster()` with REST endpoints | | ||
| | K8s auto-join | `tryAutoJoinCluster()` on startup via Ratis AdminApi, `leaveCluster()` on K8s shutdown | | ||
| | Read consistency modes | EVENTUAL, READ_YOUR_WRITES, LINEARIZABLE with wait-for-apply notification pattern | | ||
| | 3-phase commit | Phase 1 (lock: capture WAL) -> Replication (no lock) -> Phase 2 (lock: apply locally). Leader steps down on Phase 2 failure | | ||
|
|
||
| --- | ||
|
|
||
| ## 4. Remaining Implementation Differences | ||
|
|
||
| ### 4.1 Command Forwarding | ||
|
|
||
| | | `ha-redesign` | `apache-ratis` | | ||
| |---|---|---| | ||
| | Mechanism | HTTP POST to leader via `HttpClient` | Ratis `query()` path (state machine) | | ||
| | Auth | Cluster token via HTTP header | Cluster token via HTTP header | | ||
| | Constraint validation | Delegated to leader's normal commit path | Explicit index key changes in TRANSACTION_FORWARD | | ||
|
|
||
| `apache-ratis` also has a `TRANSACTION_FORWARD` Raft log entry type that forwards writes from replicas with index key changes for constraint validation. However, this is noted as having a page visibility issue and is currently unused in favor of HTTP proxy forwarding. | ||
|
|
||
| ### 4.2 Log Entry Format | ||
|
|
||
| | | `ha-redesign` | `apache-ratis` | | ||
| |---|---|---| | ||
| | Architecture | Separate `RaftLogEntryCodec` + `RaftLogEntryType` enum | Single `RaftLogEntry` class | | ||
| | Entry types | TX_ENTRY, SCHEMA_ENTRY, INSTALL_DATABASE_ENTRY | TRANSACTION, TRANSACTION_FORWARD, COMMAND_FORWARD | | ||
| | Serialization | `DataInputStream`/`DataOutputStream` | `Binary` class (ArcadeDB native) | | ||
|
|
||
| ### 4.3 Wait-for-Apply Notification (applyNotifier) | ||
|
|
||
| `apache-ratis` replaced polling loops (`Thread.sleep(10)`) with a proper `Object` monitor. The state machine calls `raftHAServer.notifyApplied()` after each apply, waking up blocked readers. This eliminates polling overhead for READ_YOUR_WRITES consistency. | ||
|
|
||
| `ha-redesign` does not have `waitForAppliedIndex` / `waitForLocalApply` methods (different forwarding approach). Worth noting if read-after-write consistency is added. | ||
|
|
||
| ### 4.4 Ratis Configuration Defaults | ||
|
|
||
| | Setting | `ha-redesign` | `apache-ratis` | | ||
| |---------|---------------|----------------| | ||
| | Election timeout min (default) | 2000ms | 1500ms | | ||
| | Election timeout max (default) | 5000ms | 3000ms | | ||
| | Snapshot threshold (default) | 10,000 | 100,000 | | ||
|
|
||
| `ha-redesign` uses more conservative election timeouts (less likely to trigger false elections under load) and a lower snapshot threshold (more frequent log compaction). | ||
|
|
||
| --- | ||
|
|
||
| ## 5. Features Unique to Each Branch | ||
|
|
||
| ### Only in `ha-redesign` | ||
|
|
||
| | Feature | Description | | ||
| |---------|-------------| | ||
| | Modular plugin architecture | `RaftHAPlugin` via `ServiceLoader`, `HA_IMPLEMENTATION` toggle | | ||
| | `GetClusterHandler` | REST endpoint at `/api/v1/cluster` with election metrics, uptime | | ||
| | `INSTALL_DATABASE_ENTRY` | Raft log entry type for replicating `createDatabase()` | | ||
| | `SnapshotManager` utilities | CRC32 checksums and file-diff helpers for delta sync | | ||
| | `HA_RAFT_PERSIST_STORAGE` | Preserves Raft storage across restarts in tests | | ||
| | Enhanced Studio cluster UI | Topology visualization, election count, uptime | | ||
| | Comprehensive test suite | 40 test files with split-brain, chaos, read consistency, benchmarks | | ||
| | E2E chaos tests | 9 Toxiproxy-based ITs in `e2e-ha/` module | | ||
|
|
||
| ### Only in `apache-ratis` | ||
|
|
||
| | Feature | Description | | ||
| |---------|-------------| | ||
| | `TRANSACTION_FORWARD` entry type | Raft-native write forwarding with index key changes for constraint validation (currently unused due to page visibility issue) | | ||
| | Command forwarding via `query()` | Forwarded commands execute on leader's state machine (currently unused in favor of HTTP proxy) | | ||
| | BOLT + TLS support | `BOLT_SSL` config (DISABLED/OPTIONAL/REQUIRED) | | ||
|
|
||
| --- | ||
|
|
||
| ## 6. Test Coverage | ||
|
|
||
| | Category | `ha-redesign` | `apache-ratis` | | ||
| |----------|---------------|----------------| | ||
| | Unit tests | 13 classes | 3 classes | | ||
| | Integration tests | 27 classes | 3 classes | | ||
| | Test lines | ~5,800 | ~1,500 (est.) | | ||
| | Split-brain | 3-node and 5-node | None | | ||
| | Chaos/crash | Random crash, leader/replica recovery | Comprehensive IT only | | ||
| | Read consistency | Dedicated IT | None | | ||
| | Schema replication | 2 dedicated ITs | Covered in comprehensive IT | | ||
| | Snapshot resync | `RaftFullSnapshotResyncIT` | None | | ||
| | Benchmark | `RaftHAInsertBenchmark` | `HAInsertBenchmark` | | ||
| | E2E (Toxiproxy) | 9 ITs in `e2e-ha/` | Referenced | | ||
|
|
||
| --- | ||
|
|
||
| ## 7. Commit Activity | ||
|
|
||
| | | `ha-redesign` | `apache-ratis` | | ||
| |---|---|---| | ||
| | Commits ahead of main | 90 | 21 | | ||
| | Files changed | 111 (+23,988 / -380) | ~94 (+8,677 / -6,711) | | ||
|
|
||
| --- | ||
|
|
||
| ## 8. Future Consideration | ||
|
|
||
| Features from `apache-ratis` that could be added to `ha-redesign` in future iterations: | ||
|
|
||
| | Item | Effort | Reason | | ||
| |------|--------|--------| | ||
| | TRANSACTION_FORWARD | Large | More efficient follower writes (noted as having page visibility issues, currently unused on apache-ratis) | | ||
|
|
||
| --- | ||
|
|
||
| ## 9. Summary | ||
|
|
||
| After three rounds of porting, `ha-redesign` now includes all production-relevant features from `apache-ratis`: | ||
|
|
||
| - **Performance:** Group committer with batched Raft writes, LZ4 WAL compression, configurable Ratis tuning, 3-phase commit (lock released during Raft replication for concurrent write throughput) | ||
| - **Correctness:** ALL quorum race fix, snapshot-based resync for lagging replicas, NIO zip-slip protection | ||
| - **Security:** PBKDF2 cluster token derivation, timing-safe token comparison, cluster token header auth for snapshots | ||
| - **Operability:** HALog with cached verbosity levels, configurable election timeouts, WAL deletion logging, Studio cluster UI | ||
| - **Cluster Management:** Dynamic membership API (addPeer/removePeer/transferLeadership/stepDown/leaveCluster with REST endpoints), K8s auto-join discovery, multiple read consistency modes (EVENTUAL, READ_YOUR_WRITES, LINEARIZABLE) | ||
|
|
||
| The only remaining `apache-ratis`-exclusive features are an experimental write-forwarding mechanism (`TRANSACTION_FORWARD`) that is currently unused due to a page visibility issue, and BOLT with TLS support. | ||
|
|
||
| `ha-redesign` is the production-ready choice: modular architecture, 40-file test suite with chaos engineering, safe rollout via `HA_IMPLEMENTATION` toggle, and now feature-complete with all security, performance, and cluster management features from `apache-ratis`. | ||
|
|
||
|
|
||
| ## 10. Benchmark Results | ||
|
|
||
| ArcadeDB Raft HA Insert Benchmark | ||
|
|
||
| Sync: 5,000 records (batch 100/tx) | Async: 100,000 records (8 threads) | ||
|
|
||
| 1 server (no HA) - embedded | ||
| ------------------------------------------------------- | ||
| Ops: 50 operations (1 thread) | ||
| Throughput: 1,073 ops/sec | ||
| Avg: 932 us | Median: 840 us | ||
| Min: 614 us | P95: 1,417 us | ||
| P99: 2,764 us | Max: 2,764 us | ||
|
|
||
| 3 servers (Raft HA) - embedded on leader | ||
| ------------------------------------------------------- | ||
| Ops: 50 operations (1 thread) | ||
| Throughput: 67 ops/sec | ||
| Avg: 15,033 us | Median: 15,010 us | ||
| Min: 10,974 us | P95: 21,761 us | ||
| P99: 23,951 us | Max: 23,951 us | ||
|
|
||
| 5 servers (Raft HA) - embedded on leader | ||
| ------------------------------------------------------- | ||
| Ops: 50 operations (1 thread) | ||
| Throughput: 69 ops/sec | ||
| Avg: 14,539 us | Median: 14,820 us | ||
| Min: 9,411 us | P95: 20,014 us | ||
| P99: 22,106 us | Max: 22,106 us | ||
|
|
||
| 3 servers (Raft HA) - remote via follower proxy | ||
| ------------------------------------------------------- | ||
| Ops: 5,000 operations (1 thread) | ||
| Throughput: 87 ops/sec | ||
| Avg: 11,555 us | Median: 11,430 us | ||
| Min: 3,716 us | P95: 15,791 us | ||
| P99: 19,943 us | Max: 38,777 us | ||
|
|
||
| 3 servers (Raft HA) - concurrent (3 threads) | ||
| ------------------------------------------------------- | ||
| Ops: 4,998 operations (3 threads) | ||
| Throughput: 96 ops/sec | ||
| Avg: 31,321 us | Median: 30,516 us | ||
| Min: 11,922 us | P95: 42,913 us | ||
| P99: 52,618 us | Max: 144,822 us | ||
|
|
||
| 5 servers (Raft HA) - concurrent (5 threads) | ||
| ------------------------------------------------------- | ||
| Ops: 5,000 operations (5 threads) | ||
| Throughput: 117 ops/sec | ||
| Avg: 42,356 us | Median: 42,212 us | ||
| Min: 10,499 us | P95: 57,886 us | ||
| P99: 67,147 us | Max: 141,147 us | ||
|
|
||
| 1 server (no HA) - async | ||
| ------------------------------------------------------- | ||
| Ops: 100,000 records (8 async threads, commitEvery=5000) | ||
| Throughput: 423,500 inserts/sec | ||
| Elapsed: 0.2 seconds | ||
|
|
||
| 3 servers (Raft HA) - async on leader | ||
| ------------------------------------------------------- | ||
| Ops: 100,000 records (8 async threads, commitEvery=5000) | ||
| Throughput: 278,048 inserts/sec | ||
| Elapsed: 0.4 seconds | ||
|
|
||
| 5 servers (Raft HA) - async on leader | ||
| ------------------------------------------------------- | ||
| Ops: 100,000 records (8 async threads, commitEvery=5000) | ||
| Throughput: 313,998 inserts/sec | ||
| Elapsed: 0.3 seconds |
Uh oh!
There was an error while loading. Please reload this page.