Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
164 commits
Select commit Hold shift + click to select a range
c075a0e
feat(ha-raft): add Maven module skeleton with Ratis dependencies
robfrank Feb 15, 2026
2bb3266
feat(ha-raft): add HA_IMPLEMENTATION, HA_RAFT_PORT, HA_REPLICATION_LA…
robfrank Feb 15, 2026
857966b
feat(ha-raft): add RaftLogEntryCodec for TX and SCHEMA entry serializ…
robfrank Feb 15, 2026
de8ef5c
feat(ha-raft): add ArcadeStateMachine with WAL and schema entry appli…
robfrank Feb 15, 2026
a0ae3a8
feat(ha-raft): add ClusterMonitor for replication lag tracking
robfrank Feb 15, 2026
b4c084d
feat(ha-raft): add RaftHAServer wrapping Ratis RaftServer with peer l…
robfrank Feb 15, 2026
2912b8f
feat(ha-raft): add SnapshotManager with checksum-based incremental sync
robfrank Feb 15, 2026
361f5b6
feat(ha-raft): add RaftReplicatedDatabase wrapping LocalDatabase with…
robfrank Feb 15, 2026
fe1e9e5
feat(ha-raft): add RaftHAPlugin with ServiceLoader registration and c…
robfrank Feb 15, 2026
ba84f90
docs: add HA Raft redesign design doc and implementation plan
robfrank Feb 15, 2026
9ef44df
feat(ha-raft): wire Raft HA plugin into server startup with database …
robfrank Feb 15, 2026
b113234
feat(ha-raft): wire end-to-end Raft pipeline, 2-node integration test…
robfrank Feb 15, 2026
e824b02
feat(ha-raft): add 3-node majority quorum integration test
robfrank Feb 15, 2026
6d12202
test(ha-raft): add failure scenario, schema replication, and quorum l…
robfrank Feb 16, 2026
325a084
add notex to ignore
robfrank Feb 16, 2026
cb24155
wip
robfrank Feb 20, 2026
aac9ea5
feat(ha-raft): make Raft gRPC port configurable via HA_RAFT_PORT
robfrank Feb 20, 2026
5dffb5f
fixed pom
robfrank Feb 20, 2026
7f99906
wip
robfrank Feb 23, 2026
3024441
feat(ha-raft): replica-to-leader auth forwarding with cluster token
robfrank Feb 24, 2026
e85b119
test(ha-raft): add ratis-test dependency for MiniRaftCluster support
robfrank Feb 24, 2026
1b6fb2a
feat(ha-raft): add HA_RAFT_PERSIST_STORAGE and HA_RAFT_SNAPSHOT_THRES…
robfrank Feb 24, 2026
1740017
test(ha-raft): add persistentRaftStorage() + restartServer() to BaseR…
robfrank Feb 24, 2026
e7a27b2
feat(ha-raft): replica crash-and-recover test with Raft log replay
robfrank Feb 25, 2026
25ddf81
test(ha-raft): add RaftLeaderCrashAndRecoverIT with leader rejoin and…
robfrank Feb 25, 2026
1e75dda
feat(ha-raft): add takeSnapshot() for log compaction + RaftFullSnapsh…
robfrank Feb 25, 2026
e2beb89
fix(ha-raft): code quality fixes from review
robfrank Feb 25, 2026
4ce1ee1
test(ha-raft): add BaseMiniRaftTest using MiniRaftClusterWithGrpc
robfrank Feb 25, 2026
ef173f8
test(ha-raft): implement split-brain tests via MiniRaftClusterWithGrpc
robfrank Feb 25, 2026
1866393
test params increased
robfrank Feb 25, 2026
4934b1f
wip
robfrank Feb 28, 2026
9c1017e
version fix
robfrank Mar 10, 2026
3525842
feat(ha-raft): implement leader command forwarding via HTTP and fix p…
robfrank Mar 24, 2026
7d1dc3b
feat(ha-raft): improve Raft log readability with clean cluster event …
robfrank Mar 25, 2026
240abc3
test parameters
robfrank Mar 25, 2026
8575dab
feat(e2e-ha): add TestContainers + Toxiproxy e2e tests for Raft HA
robfrank Mar 25, 2026
fbc06c7
fix(e2e-ha): use Docker network disconnect for partition tests
robfrank Mar 25, 2026
521d83d
fix(e2e-ha): use DNS-valid container hostnames and improve test resil…
robfrank Mar 26, 2026
6e4486a
add workflow for e2e-ha tests
robfrank Mar 26, 2026
50cf01b
fix sha
robfrank Mar 26, 2026
1c54c40
remove duplicated server dependency
robfrank Mar 26, 2026
345428f
try to have the wf visible on pr
robfrank Mar 26, 2026
aea79f7
add logback conf to ha-tests
robfrank Mar 26, 2026
034f5a7
fix(ha-raft): suppress Ratis DEBUG noise and fix missing root passwor…
robfrank Mar 26, 2026
ce994df
fix npe
robfrank Mar 26, 2026
c90adb7
fix npe
robfrank Mar 26, 2026
bd5d0a7
fix tentative
robfrank Mar 26, 2026
a1be481
fix tentative
robfrank Mar 27, 2026
b6ee699
fix(ha-raft): refresh RaftClient gRPC channels on leader change to re…
robfrank Mar 28, 2026
8097ddf
fix(e2e-ha): wait for Raft leader election before writing during roll…
robfrank Mar 28, 2026
ac6ff48
fix(e2e-ha): retry assertThatUserCountIs to tolerate Raft replication…
robfrank Apr 1, 2026
174c65c
rebasewd
robfrank Apr 3, 2026
db26fd7
build the image in workflow
robfrank Apr 3, 2026
4bc518f
fix(e2e-ha): fix flaky ConditionTimeoutException in NetworkPartitionI…
robfrank Apr 3, 2026
77f233b
docs: add HA-raft test porting design spec
robfrank Apr 4, 2026
6be298f
docs: add ha-raft test porting implementation plan
robfrank Apr 4, 2026
38bc95b
test(ha-raft): port HTTP layer IT tests from server/ha
robfrank Apr 4, 2026
7b846be
fix(ha-raft-tests): fix AssertJ withFailMessage ordering and add cros…
robfrank Apr 4, 2026
21bf1a8
fix(ha-raft-tests): fix thread exit on null, port literal, replicatio…
robfrank Apr 4, 2026
f6831ef
test(ha-raft): port write forwarding and schema/view IT tests from se…
robfrank Apr 4, 2026
ba8cfab
fix(ha-raft): throw ServerIsNotTheLeaderException for schema changes …
robfrank Apr 4, 2026
82136bb
fix(ha-raft-tests): replace ArrayIndexOutOfBoundsException assertion …
robfrank Apr 4, 2026
5db9048
test(ha-raft): port index compaction and operations IT tests from ser…
robfrank Apr 4, 2026
6c7facb
test(ha-raft): re-enable index tests that don't require compaction re…
robfrank Apr 5, 2026
d83ed3c
refactor(ha-raft-tests): promote findLeaderIndex to BaseRaftHATest, i…
robfrank Apr 5, 2026
f26c395
test(ha-raft): port database utilities, config validation, and chaos …
robfrank Apr 5, 2026
9fda056
fix(ha-raft-tests): address code quality issues in Task 4 tests and p…
robfrank Apr 5, 2026
7510939
fix(ha-raft): restore arcadedb-integration test dependency for backup…
robfrank Apr 5, 2026
0df6043
style(ha-raft): replace em dashes with regular dashes in Task 4 files
robfrank Apr 5, 2026
1876509
test(ha-raft): port HAInsertBenchmark from apache-ratis branch as Raf…
robfrank Apr 5, 2026
deb8238
fix(benchmark): add 1s port-release delay between benchmark scenarios
robfrank Apr 5, 2026
b787035
fix(benchmark): move to port range 3480/3434 to avoid Oracle port con…
robfrank Apr 5, 2026
ff538f2
fix(test): write only via leader in RaftServerDatabaseSqlScriptIT
robfrank Apr 5, 2026
b925160
docs: add design spec for porting apache-ratis improvements to ha-red…
robfrank Apr 5, 2026
2726ef6
docs: add implementation plan for porting apache-ratis improvements
robfrank Apr 5, 2026
fd9d41e
feat(ha-raft): add HALog verbose logging utility with configurable le…
robfrank Apr 5, 2026
8e7d513
feat(ha-raft): add LZ4 compression for WAL data in Raft log entries
robfrank Apr 5, 2026
0ab541d
feat(ha-raft): add Quorum enum replacing string-based HA_QUORUM config
robfrank Apr 5, 2026
6020da6
perf(ha-raft): capture group-commit baseline benchmark numbers
robfrank Apr 5, 2026
4fdc68a
feat(ha-raft): upgrade ArcadeStateMachine to SimpleStateMachineStorag…
robfrank Apr 5, 2026
3f2892c
feat(ha-raft): add RaftGroupCommitter for batched Raft submissions
robfrank Apr 5, 2026
da6389c
fix(ha-raft): update ArcadeStateMachineTest for BaseStateMachine INIT…
robfrank Apr 6, 2026
e7e59e3
feat(studio): port enhanced cluster monitor UI from apache-ratis branch
robfrank Apr 6, 2026
b9edf5a
feat(ha-raft): wire snapshot install to Ratis + migrate debug logging…
robfrank Apr 6, 2026
b0fc656
test(ha-raft): add RaftReadConsistencyIT for linearizable read verifi…
robfrank Apr 6, 2026
e95fcf5
docs: add ha-redesign vs apache-ratis branch comparison report
robfrank Apr 6, 2026
19978c3
docs: update ha-redesign vs apache-ratis comparison with latest changes
robfrank Apr 6, 2026
e631764
fix(ha-raft): fix ALL quorum race in group committer, add HALog level…
robfrank Apr 6, 2026
70eb116
fix(ha-raft): add cluster token auth for snapshots, NIO zip-slip, WAL…
robfrank Apr 6, 2026
10abeaf
fix(ha-raft): PBKDF2 cluster token, configurable election timeouts an…
robfrank Apr 6, 2026
6bd076a
docs: update branch comparison after porting security and config fixes
robfrank Apr 6, 2026
d9cb4d7
fix(ha-raft): guard against null raftHAServer during server restart
robfrank Apr 6, 2026
4572eb9
fix(ha-raft): propagate ArcadeDB exceptions from Raft commit instead …
robfrank Apr 6, 2026
5551280
- docs: Ratis version bump in comparison doc U
robfrank Apr 6, 2026
85cfc49
fix(ha-raft): fix stale raftHAServer reference after server restart
robfrank Apr 6, 2026
789ce95
fix(ha-raft): wait for Raft replication before database comparison in…
robfrank Apr 6, 2026
5f52985
fix(ha-raft): use 3 nodes in snapshot resync test so writes succeed w…
robfrank Apr 6, 2026
0c9ee23
docs(ha-raft): update @Disabled message for lsmVectorReplication test
robfrank Apr 6, 2026
f6baabd
fix(ha-raft): prevent Raft log purging in crash test that causes infi…
robfrank Apr 6, 2026
99cd26e
fix(ha-raft): force leadership transfer after crash restart to fix st…
robfrank Apr 6, 2026
3f9a051
fix(ha-raft): disable persistent Raft storage in crash test to avoid …
robfrank Apr 6, 2026
19b64bf
test diaabled
robfrank Apr 6, 2026
7d82c3b
refactor(server): relocate ServerSocketFactory to com.arcadedb.server…
robfrank Apr 6, 2026
eb67df9
feat(ha): remove legacy HA implementation
robfrank Apr 6, 2026
c1e846c
chore(config): remove legacy-only HA configuration entries
robfrank Apr 7, 2026
305c1ec
feat(ha-raft): add dynamic membership, K8s auto-join, and read consis…
robfrank Apr 7, 2026
fc1d419
fix(ha-raft): bounded Raft client retry and robust membership/leaders…
robfrank Apr 7, 2026
dcc9f47
fix(ha-raft): fix verify database checksum comparison for cluster con…
robfrank Apr 7, 2026
80c1c4a
fix(ha): forward server write commands to leader and make client clus…
robfrank Apr 7, 2026
5b95c38
fix(e2e-ha): force leadership transfer after partition heals to refre…
robfrank Apr 7, 2026
cd5141d
fix(e2e-ha): restore Docker network aliases on reconnect after partition
robfrank Apr 7, 2026
c880ad2
fix(e2e-ha): replace unsupported 'none' quorum with 'majority' in Raf…
robfrank Apr 7, 2026
2fb39c8
fix(network): remove redundant Authenticator from HttpClient to fix 4…
robfrank Apr 7, 2026
373c366
fix(e2e-ha): increase PacketLossIT convergence timeouts to 180s
robfrank Apr 7, 2026
2de9a38
fix(server): fix RaftHAPlugin auto-discovery when not in server.plugi…
robfrank Apr 7, 2026
2a22c49
fix(server): include X-ArcadeDB-Forwarded-User header when forwarding…
robfrank Apr 7, 2026
ffdf399
fix(server): forward Basic auth as-is instead of using cluster token …
robfrank Apr 8, 2026
e6e9511
docs: add 3-phase commit port design spec
robfrank Apr 8, 2026
7af04ac
docs: add 3-phase commit port implementation plan
robfrank Apr 8, 2026
0490826
refactor(ha-raft): add ReplicationPayload record for 3-phase commit
robfrank Apr 8, 2026
0102907
docs: update branch comparison after 3-phase commit port
robfrank Apr 8, 2026
bb9c3cd
fix(ha-raft): persist deferred schema save for read-only leader trans…
robfrank Apr 8, 2026
e5773ad
fix(ha-raft): fix schema save under lock and exception propagation in…
robfrank Apr 8, 2026
8538876
add benchmark results
robfrank Apr 8, 2026
33ddd14
feat(ha-raft): port logging improvements from apache-ratis branch
robfrank Apr 9, 2026
7e5bc55
fix(e2e-ha): fix schema comparison and test timeouts in HA e2e tests
robfrank Apr 9, 2026
23bbdcf
fix(e2e-ha): restart isolated nodes after partition to fix gRPC chann…
robfrank Apr 9, 2026
ee0626f
feat(ha-raft): save benchmark results to target/reports/RaftHAInsertB…
robfrank Apr 9, 2026
e0269b1
fix(e2e-ha): remove compareAllDatabases from HA test tearDown
robfrank Apr 9, 2026
b8b9be1
feat(ha-raft): add DROP_DATABASE_ENTRY and SECURITY_USERS_ENTRY log e…
robfrank Apr 10, 2026
d2a9258
refactor(ha-raft): extend DecodedEntry with usersJson and forceSnapsh…
robfrank Apr 10, 2026
fe4c8d0
feat(ha-raft): encode/decode DROP_DATABASE_ENTRY
robfrank Apr 10, 2026
43e5d91
feat(ha-raft): add forceSnapshot flag to INSTALL_DATABASE_ENTRY codec
robfrank Apr 10, 2026
7848140
feat(ha-raft): encode/decode SECURITY_USERS_ENTRY
robfrank Apr 10, 2026
893d728
feat(server): add HAReplicatedDatabase.dropInReplicas() interface method
robfrank Apr 10, 2026
d123995
feat(ha-raft): implement dropInReplicas and createInReplicas(forceSna…
robfrank Apr 10, 2026
94c89e5
feat(ha-raft): apply DROP_DATABASE_ENTRY in state machine
robfrank Apr 10, 2026
4c6cad4
feat(server): route drop database through Raft when HA is enabled
robfrank Apr 10, 2026
dce74dc
test(ha-raft): integration test for drop database propagation across …
robfrank Apr 10, 2026
6c77830
feat(ha-raft): honour forceSnapshot flag in applyInstallDatabaseEntry
robfrank Apr 10, 2026
d8a7249
feat(server): replicate restored database to replicas via forceSnapsh…
robfrank Apr 10, 2026
478168d
fix(ha-raft): fix snapshot install for drop+restore and concurrent re…
robfrank Apr 10, 2026
9897b13
test(ha-raft): integration test for restore database propagation
robfrank Apr 10, 2026
52a9601
feat(server): create imported database via Raft before running importer
robfrank Apr 10, 2026
794ca80
fix(engine): unwrap database in ImportDatabaseStatement for HA replic…
robfrank Apr 10, 2026
a127d2b
test(ha-raft): integration test for import database propagation
robfrank Apr 10, 2026
b9faa6d
feat(security): expose JSON payload helpers for HA user replication
robfrank Apr 10, 2026
4a30d3f
feat(server): add HAServerPlugin.replicateSecurityUsers default method
robfrank Apr 10, 2026
aa911e7
feat(ha-raft): implement RaftHAPlugin.replicateSecurityUsers
robfrank Apr 10, 2026
4a1b93c
feat(ha-raft): apply SECURITY_USERS_ENTRY in state machine
robfrank Apr 10, 2026
d8107dd
feat(server): route create user through Raft when HA is enabled
robfrank Apr 10, 2026
f233d50
feat(server): route drop user through Raft when HA is enabled
robfrank Apr 10, 2026
64d1f26
test(ha-raft): integration test for user replication across 3 nodes
robfrank Apr 11, 2026
b6c5c21
fix(security): drop synchronized on ServerSecurity users hooks to unb…
robfrank Apr 11, 2026
b17330f
feat(ha-raft): seed new peer with current users after peer-add
robfrank Apr 11, 2026
ea335fc
test(ha-raft): integration test for peer-add user seed
robfrank Apr 11, 2026
e16a6ae
test(e2e-ha): container scenario for drop database propagation
robfrank Apr 11, 2026
9a559df
test(e2e-ha): container scenario for restore database propagation
robfrank Apr 11, 2026
7ac2ae3
test(e2e-ha): container scenario for import database propagation
robfrank Apr 11, 2026
a813671
test(e2e-ha): container scenario for user management propagation
robfrank Apr 11, 2026
9c863b9
test(e2e-ha): container smoke test for peer-add seed endpoint wiring
robfrank Apr 11, 2026
c37d9c7
feat(ha): port production-resilience features from apache-ratis branch
robfrank Apr 12, 2026
bc19730
fix(server): share single LeaderProxy per HttpServer instead of per h…
robfrank Apr 12, 2026
41e8e82
fix(ha-raft): disable HealthMonitor in tests to prevent thread exhaus…
robfrank Apr 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions .github/workflows/e2e-ha.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
name: E2E HA Tests

on:
workflow_dispatch:
schedule:
- cron: "0 0 * * *" # Runs daily at midnight
pull_request:
branches:
- main


jobs:
setup:
runs-on: ubuntu-latest
permissions:
contents: write
packages: write
attestations: write
id-token: write

steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Ensure SHA pinned actions
uses: zgosalvez/github-actions-ensure-sha-pinned-actions@471d5ace1f08e3c4df1c4c2f7e6341aa75da434a # v5.0.3
- name: Run pre-commit
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: "3.13.0"
cache: "pip"
- uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1

- name: Set up JDK 21
uses: actions/setup-java@be666c2fcd27ec809703dec50e508c2fdc7f6654 # v5.2.0
with:
distribution: "temurin"
java-version: 21

- name: Cache local Maven repository
uses: actions/cache@668228422ae6a00e4ad889ee87cd7109ec5666a7 # v5.0.4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
restore-keys: |
${{ runner.os }}-maven-

- name: Set up QEMU
uses: docker/setup-qemu-action@ce360397dd3f832beb865e1373c09c0e9f86d70a # v4.0.0

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@4d04d5d9486b7bd6fa91e7baf45bbb4f8b9deedd # v4.0.0

- name: Build and package with Maven Docker profile
run: ./mvnw clean install -Pdocker -DskipTests --batch-mode --errors --show-version
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

- name: Run HA Tests
run: ./mvnw verify -DskipTests -Pintegration --batch-mode --errors --fail-never --show-version -pl e2e-ha
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

- name: Tests Reporter
uses: dorny/test-reporter@a43b3a5f7366b97d083190328d2c652e1a8b6aa2 # v3.0.0
if: success() || failure()
with:
name: IT Tests Report
path: "**/failsafe-reports/TEST*.xml"
list-tests: "failed"
list-suites: "failed"
reporter: java-junit
11 changes: 9 additions & 2 deletions .github/workflows/mvn-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,14 @@ jobs:
with:
distribution: "temurin"
java-version: 21
cache: "maven"

- name: Cache local Maven repository
uses: actions/cache@668228422ae6a00e4ad889ee87cd7109ec5666a7 # v5.0.4
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
restore-keys: |
${{ runner.os }}-maven-

- name: Set up QEMU
uses: docker/setup-qemu-action@ce360397dd3f832beb865e1373c09c0e9f86d70a # v4.0.0
Expand Down Expand Up @@ -223,7 +230,7 @@ jobs:
key: maven-repo-${{ github.run_id }}-${{ github.run_attempt }}

- name: Run Integration Tests with Coverage
run: ./mvnw verify -DskipTests -Pintegration -Pcoverage --batch-mode --errors --fail-never --show-version -pl !e2e,!load-tests
run: ./mvnw verify -DskipTests -Pintegration -Pcoverage --batch-mode --errors --fail-never --show-version -pl !e2e,!load-tests,!e2e-ha
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -557,6 +557,8 @@ dist
# Test database files
*.lsmvecidx
*.metadata.json

notes.txt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file notes.txt appears to be a personal note file. It's generally better to add such user-specific files to your global .gitignore file (e.g., ~/.config/git/ignore) or the repository's local exclude file (.git/info/exclude) rather than the project's shared .gitignore. This helps keep the project's ignore list clean and focused on project-specific generated files and artifacts.

/.claude/worktrees
/server/profiler
/server/chats
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
import com.arcadedb.log.LogManager;
import com.arcadedb.server.ArcadeDBServer;
import com.arcadedb.server.ServerException;
import com.arcadedb.server.ha.network.ServerSocketFactory;
import com.arcadedb.server.network.ServerSocketFactory;

import java.io.IOException;
import java.io.InputStream;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
import com.arcadedb.GlobalConfiguration;
import com.arcadedb.server.ArcadeDBServer;
import com.arcadedb.server.ServerPlugin;
import com.arcadedb.server.ha.network.DefaultServerSocketFactory;
import com.arcadedb.server.network.DefaultServerSocketFactory;

/**
* Server plugin that enables Neo4j BOLT protocol support.
Expand Down
272 changes: 272 additions & 0 deletions docs/ha-branch-comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
# HA Branch Comparison: `ha-redesign` vs `apache-ratis`

**Date:** 2026-04-06 (updated after second port round)
**Compared against:** `main` branch

Both branches rewrite ArcadeDB's High Availability stack on top of Apache Ratis. They share the same goal but differ in architecture, scope, and maturity.

---

## 1. Module Structure

| | `ha-redesign` | `apache-ratis` |
|---|---------------------------------------------------------|---|
| Location | Separate top-level module `ha-raft/` | Inside `server/` module |
| Package | `com.arcadedb.server.ha.raft` | `com.arcadedb.server.ha.ratis` |
| Server dep scope | `provided` (plugin-style) | `compile` (direct) |
| Ratis version | 3.2.2 | 3.2.1 |
| Activation | `HA_IMPLEMENTATION=raft` toggle, `ServiceLoader` plugin | Wired directly into `ArcadeDBServer` startup |
| Distribution | Shade plugin configured, ready for modular distribution | Bundled with server |

`ha-redesign` isolates the Raft subsystem as a publishable Maven artifact with `provided` scope on the server. `apache-ratis` embeds it directly in the server module.

---

## 2. Source Files

### ha-redesign (14 main classes)

| Class | Purpose |
|-------|---------|
| `RaftReplicatedDatabase` | `DatabaseInternal` wrapper, intercepts `commit()` for Raft consensus |
| `RaftHAServer` | Ratis `RaftServer`/`RaftClient` lifecycle, peer parsing, lag monitor |
| `ArcadeStateMachine` | Ratis state machine with `SimpleStateMachineStorage`, election metrics |
| `RaftLogEntryCodec` | Encode/decode Raft log entries with LZ4 compression |
| `RaftGroupCommitter` | Batched Raft submissions via pipelined async sends |
| `RaftHAPlugin` | `ServerPlugin` for ServiceLoader-based HA discovery |
| `SnapshotHttpHandler` | HTTP handler serving database ZIP snapshots |
| `GetClusterHandler` | HTTP endpoint returning cluster status JSON |
| `SnapshotManager` | CRC32 checksum and file-diff utilities |
| `ClusterMonitor` | Replication lag tracking per replica |
| `HALog` | Structured HA logging (BASIC/DETAILED/TRACE) with cached level |
| `Quorum` | Enum: MAJORITY, ALL |
| `RaftLogEntryType` | Enum: TX_ENTRY, SCHEMA_ENTRY, INSTALL_DATABASE_ENTRY |
| `package-info.java` | Package documentation |

### apache-ratis (7 main classes)

| Class | Purpose |
|-------|---------|
| `ArcadeDBStateMachine` | State machine with schema apply, command forwarding via `query()` |
| `RaftLogEntry` | Integrated entry format + serialization with compression |
| `RaftHAServer` | Server lifecycle, Quorum inner enum, K8s auto-join, dynamic membership |
| `RaftGroupCommitter` | Batched Raft submissions (configurable batch size) |
| `SnapshotHttpHandler` | HTTP handler serving database ZIP snapshots |
| `ClusterMonitor` | Replication lag tracking |
| `HALog` | Structured HA logging with cached level |

---

## 3. Shared Features (Both Branches)

These features exist on both sides with equivalent implementations:

| Feature | Notes |
|---------|-------|
| Group Committer | Batched Raft writes via pipelined `async().send()`, configurable batch size |
| ALL Quorum correctness | Success only reported after ALL watch completes (race condition fixed) |
| LZ4 WAL Compression | WAL data in log entries compressed via `CompressionFactory` |
| Snapshot Install | `notifyInstallSnapshotFromLeader()` + HTTP-based ZIP download |
| Snapshot auth | `X-ArcadeDB-Cluster-Token` header with timing-safe `MessageDigest.isEqual` |
| HALog | 3 verbosity levels (BASIC/DETAILED/TRACE) with cached level, no config read on hot path |
| Quorum Enum | MAJORITY and ALL modes, ALL enforced via Ratis Watch API |
| SimpleStateMachineStorage | Replaces hand-rolled last-applied tracking |
| Election Metrics | `electionCount`, `lastElectionTime`, exposed via cluster status |
| PBKDF2 Cluster Token | 100K-iteration PBKDF2WithHmacSHA256 derivation from cluster name + root password |
| Leader Lease | `LINEARIZABLE` reads enabled with 0.9 timeout ratio |
| Configurable election timeouts | `HA_ELECTION_TIMEOUT_MIN/MAX` for WAN cluster tuning |
| Configurable Ratis tuning | Log segment size, append buffer size, write buffer all configurable |
| NIO zip-slip protection | `Path.normalize().toAbsolutePath().startsWith()` for snapshot extraction |
| WAL deletion logging | Warning logged when stale `.wal` file deletion fails |
| Dynamic membership API | `addPeer()`, `removePeer()`, `transferLeadership()`, `stepDown()`, `leaveCluster()` with REST endpoints |
| K8s auto-join | `tryAutoJoinCluster()` on startup via Ratis AdminApi, `leaveCluster()` on K8s shutdown |
| Read consistency modes | EVENTUAL, READ_YOUR_WRITES, LINEARIZABLE with wait-for-apply notification pattern |
| 3-phase commit | Phase 1 (lock: capture WAL) -> Replication (no lock) -> Phase 2 (lock: apply locally). Leader steps down on Phase 2 failure |

---

## 4. Remaining Implementation Differences

### 4.1 Command Forwarding

| | `ha-redesign` | `apache-ratis` |
|---|---|---|
| Mechanism | HTTP POST to leader via `HttpClient` | Ratis `query()` path (state machine) |
| Auth | Cluster token via HTTP header | Cluster token via HTTP header |
| Constraint validation | Delegated to leader's normal commit path | Explicit index key changes in TRANSACTION_FORWARD |

`apache-ratis` also has a `TRANSACTION_FORWARD` Raft log entry type that forwards writes from replicas with index key changes for constraint validation. However, this is noted as having a page visibility issue and is currently unused in favor of HTTP proxy forwarding.

### 4.2 Log Entry Format

| | `ha-redesign` | `apache-ratis` |
|---|---|---|
| Architecture | Separate `RaftLogEntryCodec` + `RaftLogEntryType` enum | Single `RaftLogEntry` class |
| Entry types | TX_ENTRY, SCHEMA_ENTRY, INSTALL_DATABASE_ENTRY | TRANSACTION, TRANSACTION_FORWARD, COMMAND_FORWARD |
| Serialization | `DataInputStream`/`DataOutputStream` | `Binary` class (ArcadeDB native) |

### 4.3 Wait-for-Apply Notification (applyNotifier)

`apache-ratis` replaced polling loops (`Thread.sleep(10)`) with a proper `Object` monitor. The state machine calls `raftHAServer.notifyApplied()` after each apply, waking up blocked readers. This eliminates polling overhead for READ_YOUR_WRITES consistency.

`ha-redesign` does not have `waitForAppliedIndex` / `waitForLocalApply` methods (different forwarding approach). Worth noting if read-after-write consistency is added.

### 4.4 Ratis Configuration Defaults

| Setting | `ha-redesign` | `apache-ratis` |
|---------|---------------|----------------|
| Election timeout min (default) | 2000ms | 1500ms |
| Election timeout max (default) | 5000ms | 3000ms |
| Snapshot threshold (default) | 10,000 | 100,000 |

`ha-redesign` uses more conservative election timeouts (less likely to trigger false elections under load) and a lower snapshot threshold (more frequent log compaction).

---

## 5. Features Unique to Each Branch

### Only in `ha-redesign`

| Feature | Description |
|---------|-------------|
| Modular plugin architecture | `RaftHAPlugin` via `ServiceLoader`, `HA_IMPLEMENTATION` toggle |
| `GetClusterHandler` | REST endpoint at `/api/v1/cluster` with election metrics, uptime |
| `INSTALL_DATABASE_ENTRY` | Raft log entry type for replicating `createDatabase()` |
| `SnapshotManager` utilities | CRC32 checksums and file-diff helpers for delta sync |
| `HA_RAFT_PERSIST_STORAGE` | Preserves Raft storage across restarts in tests |
| Enhanced Studio cluster UI | Topology visualization, election count, uptime |
| Comprehensive test suite | 40 test files with split-brain, chaos, read consistency, benchmarks |
| E2E chaos tests | 9 Toxiproxy-based ITs in `e2e-ha/` module |

### Only in `apache-ratis`

| Feature | Description |
|---------|-------------|
| `TRANSACTION_FORWARD` entry type | Raft-native write forwarding with index key changes for constraint validation (currently unused due to page visibility issue) |
| Command forwarding via `query()` | Forwarded commands execute on leader's state machine (currently unused in favor of HTTP proxy) |
| BOLT + TLS support | `BOLT_SSL` config (DISABLED/OPTIONAL/REQUIRED) |

---

## 6. Test Coverage

| Category | `ha-redesign` | `apache-ratis` |
|----------|---------------|----------------|
| Unit tests | 13 classes | 3 classes |
| Integration tests | 27 classes | 3 classes |
| Test lines | ~5,800 | ~1,500 (est.) |
| Split-brain | 3-node and 5-node | None |
| Chaos/crash | Random crash, leader/replica recovery | Comprehensive IT only |
| Read consistency | Dedicated IT | None |
| Schema replication | 2 dedicated ITs | Covered in comprehensive IT |
| Snapshot resync | `RaftFullSnapshotResyncIT` | None |
| Benchmark | `RaftHAInsertBenchmark` | `HAInsertBenchmark` |
| E2E (Toxiproxy) | 9 ITs in `e2e-ha/` | Referenced |

---

## 7. Commit Activity

| | `ha-redesign` | `apache-ratis` |
|---|---|---|
| Commits ahead of main | 90 | 21 |
| Files changed | 111 (+23,988 / -380) | ~94 (+8,677 / -6,711) |

---

## 8. Future Consideration

Features from `apache-ratis` that could be added to `ha-redesign` in future iterations:

| Item | Effort | Reason |
|------|--------|--------|
| TRANSACTION_FORWARD | Large | More efficient follower writes (noted as having page visibility issues, currently unused on apache-ratis) |

---

## 9. Summary

After three rounds of porting, `ha-redesign` now includes all production-relevant features from `apache-ratis`:

- **Performance:** Group committer with batched Raft writes, LZ4 WAL compression, configurable Ratis tuning, 3-phase commit (lock released during Raft replication for concurrent write throughput)
- **Correctness:** ALL quorum race fix, snapshot-based resync for lagging replicas, NIO zip-slip protection
- **Security:** PBKDF2 cluster token derivation, timing-safe token comparison, cluster token header auth for snapshots
- **Operability:** HALog with cached verbosity levels, configurable election timeouts, WAL deletion logging, Studio cluster UI
- **Cluster Management:** Dynamic membership API (addPeer/removePeer/transferLeadership/stepDown/leaveCluster with REST endpoints), K8s auto-join discovery, multiple read consistency modes (EVENTUAL, READ_YOUR_WRITES, LINEARIZABLE)

The only remaining `apache-ratis`-exclusive features are an experimental write-forwarding mechanism (`TRANSACTION_FORWARD`) that is currently unused due to a page visibility issue, and BOLT with TLS support.

`ha-redesign` is the production-ready choice: modular architecture, 40-file test suite with chaos engineering, safe rollout via `HA_IMPLEMENTATION` toggle, and now feature-complete with all security, performance, and cluster management features from `apache-ratis`.


## 10. Benchmark Results

ArcadeDB Raft HA Insert Benchmark

Sync: 5,000 records (batch 100/tx) | Async: 100,000 records (8 threads)

1 server (no HA) - embedded
-------------------------------------------------------
Ops: 50 operations (1 thread)
Throughput: 1,073 ops/sec
Avg: 932 us | Median: 840 us
Min: 614 us | P95: 1,417 us
P99: 2,764 us | Max: 2,764 us

3 servers (Raft HA) - embedded on leader
-------------------------------------------------------
Ops: 50 operations (1 thread)
Throughput: 67 ops/sec
Avg: 15,033 us | Median: 15,010 us
Min: 10,974 us | P95: 21,761 us
P99: 23,951 us | Max: 23,951 us

5 servers (Raft HA) - embedded on leader
-------------------------------------------------------
Ops: 50 operations (1 thread)
Throughput: 69 ops/sec
Avg: 14,539 us | Median: 14,820 us
Min: 9,411 us | P95: 20,014 us
P99: 22,106 us | Max: 22,106 us

3 servers (Raft HA) - remote via follower proxy
-------------------------------------------------------
Ops: 5,000 operations (1 thread)
Throughput: 87 ops/sec
Avg: 11,555 us | Median: 11,430 us
Min: 3,716 us | P95: 15,791 us
P99: 19,943 us | Max: 38,777 us

3 servers (Raft HA) - concurrent (3 threads)
-------------------------------------------------------
Ops: 4,998 operations (3 threads)
Throughput: 96 ops/sec
Avg: 31,321 us | Median: 30,516 us
Min: 11,922 us | P95: 42,913 us
P99: 52,618 us | Max: 144,822 us

5 servers (Raft HA) - concurrent (5 threads)
-------------------------------------------------------
Ops: 5,000 operations (5 threads)
Throughput: 117 ops/sec
Avg: 42,356 us | Median: 42,212 us
Min: 10,499 us | P95: 57,886 us
P99: 67,147 us | Max: 141,147 us

1 server (no HA) - async
-------------------------------------------------------
Ops: 100,000 records (8 async threads, commitEvery=5000)
Throughput: 423,500 inserts/sec
Elapsed: 0.2 seconds

3 servers (Raft HA) - async on leader
-------------------------------------------------------
Ops: 100,000 records (8 async threads, commitEvery=5000)
Throughput: 278,048 inserts/sec
Elapsed: 0.4 seconds

5 servers (Raft HA) - async on leader
-------------------------------------------------------
Ops: 100,000 records (8 async threads, commitEvery=5000)
Throughput: 313,998 inserts/sec
Elapsed: 0.3 seconds
Loading
Loading