Conversation
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| BestPractice | 6 medium 2 minor |
| CodeStyle | 24 minor |
🟢 Metrics 572 complexity
Metric Results Complexity 572
TIP This summary will be updated as you push new changes. Give us feedback
Code ReviewThis is a substantial and well-motivated change - replacing ~6,000 lines of ad-hoc Raft-like code with Apache Ratis 3.2.1 is a significant improvement in correctness guarantees. The new test suite (36 tests including 17 comprehensive ones) is thorough. Below are issues worth addressing before merge. Bugs1.
// HALog.java - remove this line:
System.out.println("[HA-" + level + "] " + formatted);2. The 3. final String formatted = String.format(message.replace("%s", "%s").replace("%d", "%s"), args);
4. Spurious In }
{ // <-- unnecessary anonymous block
try {
proxyToLeader(exchange, leaderAddr);
return;
} catch ...
}
sendErrorResponse(...)The extra Data Durability Concern5. Leader commits locally before achieving quorum
Standard Raft commits only after quorum acknowledgement. This design is explicitly documented as intentional, but it weakens the durability guarantees. At a minimum, this should be prominently documented as a trade-off (the new Security6. Timing-attack-vulnerable cluster token comparison // validateClusterForwardedAuth:
!expectedToken.equals(providedToken)
!MessageDigest.isEqual(expectedToken.getBytes(StandardCharsets.UTF_8), providedToken.getBytes(StandardCharsets.UTF_8))7. User enumeration via error message sendErrorResponse(exchange, 401, "Unknown forwarded user: " + forwardedUserValues.getFirst(), null, null);Exposing the username in the response body of an auth failure enables user enumeration. Return a generic Design / Maintainability8. // RemoteHttpComponent:
if (maxRetry < 3)
maxRetry = 3;This overrides the caller's intent without signalling it. Callers passing 9. ATTRIBUTIONS.md not updated Per 10.
Minor
Overall this is a high-quality implementation of a very complex change. The comprehensive test coverage and the transparent design documentation are appreciated. Addressing the data durability concern (#5) and the security issues (#6, #7) are the highest priority before merge. |
There was a problem hiding this comment.
Code Review
This pull request replaces the custom HA replication protocol with Apache Ratis, providing a formally verified Raft consensus implementation. The changes include a new Ratis state machine, snapshot management, and improved cluster monitoring. I have identified a flaw in the HTTP proxy's session-based authentication forwarding, a reliance on unreliable Thread.sleep in tests, and some unused code that should be cleaned up. Please address these issues to ensure robust authentication and reliable test execution.
| final String basicAuth = exchange.getAttachment(BASIC_AUTH_KEY); | ||
| if (basicAuth != null) { | ||
| // Extract username from stored Basic auth | ||
| final String decoded = new String(Base64.getDecoder().decode( | ||
| basicAuth.substring(AUTHORIZATION_BASIC.length() + 1)), java.nio.charset.StandardCharsets.UTF_8); | ||
| conn.setRequestProperty(HEADER_FORWARDED_USER, decoded.split(":")[0]); | ||
| } |
There was a problem hiding this comment.
There appears to be a flaw in the logic for forwarding session-based authentication. This code attempts to get the original Basic auth from an attachment on the current exchange. However, if the request was authenticated with a session Bearer token, this attachment will not be present, and the forwarded user will not be set.
To fix this, you should retrieve the session object from the session token and get the user's name from there. This would be more robust and would not require attaching the basic auth string to the exchange.
Here's a suggested implementation:
if (auth != null && auth.startsWith("Bearer AU-")) {
// Session token: use cluster-internal auth headers instead
conn.setRequestProperty(HEADER_CLUSTER_TOKEN, raftHA.getClusterToken());
final String sessionToken = auth.substring(AUTHORIZATION_BEARER.length()).trim();
final HttpAuthSession session = httpServer.getAuthSessionManager().getSession(sessionToken);
if (session != null) {
conn.setRequestProperty(HEADER_FORWARDED_USER, session.getUser().getName());
} else {
// Handle case where session is invalid or expired
}
} else if (auth != null) {
// ...| final JSONObject cluster = getClusterInfo(container); | ||
| if (cluster.has("isLeader") && cluster.getBoolean("isLeader")) | ||
| return true; | ||
| } catch (final Exception ignored) {} |
There was a problem hiding this comment.
Empty catch blocks should be avoided, even in tests, as they can swallow exceptions and hide underlying problems. It would be better to at least log the exception at a DEBUG or TRACE level. This would make debugging test failures easier. This applies to other similar empty catch blocks in this file and other new test files.
} catch (final Exception e) {
// Ignored during polling for leader
}| // Toxiproxy "slow_close" + "limit_data" simulate packet loss behavior. | ||
| // The "timeout" toxic with a probability achieves drop behavior. | ||
| // Using bandwidth toxic with very low rate to simulate packet-level loss. |
There was a problem hiding this comment.
The comment here is a bit misleading. It mentions slow_close, limit_data, and timeout toxics to simulate packet loss, but the implementation uses the bandwidth toxic. While limiting bandwidth can simulate some effects of packet loss (like timeouts), it's not the same as randomly dropping packets. The comment should be updated to accurately reflect the implementation, for example, by stating that it simulates packet loss effects by severely limiting bandwidth.
| private ResultSet forwardCommandToLeader(final String language, final String query, final Map<String, Object> namedParams, | ||
| final Object[] positionalParams) { | ||
| HALog.log(this, HALog.DETAILED, "Forwarding command to leader: %s %s (db=%s)", language, query, getName()); | ||
|
|
||
| // Rollback the local transaction started by DatabaseAbstractHandler.transaction() wrapper. | ||
| // The command executes on the leader, so no local changes should be committed. | ||
| if (isTransactionActive()) | ||
| rollback(); | ||
|
|
||
| final RaftHAServer raftHA = server.getRaftHA(); | ||
| final byte[] resultBytes = raftHA.forwardCommand(getName(), language, query, namedParams, positionalParams); | ||
| HALog.log(this, HALog.TRACE, "Command forwarded successfully: %d bytes result", resultBytes.length); | ||
|
|
||
| // Wait for the leader's WAL changes to be applied locally on this follower. | ||
| // Without this, a subsequent read on this server may not see the changes yet. | ||
| raftHA.waitForLocalApply(); | ||
|
|
||
| // Check for error response | ||
| if (resultBytes.length > 0 && resultBytes[0] == 'E') { | ||
| final String error = new String(resultBytes, 1, resultBytes.length - 1); | ||
| throw new com.arcadedb.exception.CommandExecutionException(error); | ||
| } | ||
|
|
||
| // Deserialize binary result into ResultSet | ||
| final java.util.List<Map<String, Object>> rows = | ||
| com.arcadedb.server.ha.ratis.RaftLogEntry.deserializeCommandResult(resultBytes); | ||
| final com.arcadedb.query.sql.executor.InternalResultSet rs = new com.arcadedb.query.sql.executor.InternalResultSet(); | ||
| for (final Map<String, Object> row : rows) { | ||
| final com.arcadedb.query.sql.executor.ResultInternal result = new com.arcadedb.query.sql.executor.ResultInternal(proxied); | ||
| result.setPropertiesFromMap(row); | ||
| rs.add(result); | ||
| } | ||
| return rs; | ||
| } |
There was a problem hiding this comment.
The private method forwardCommandToLeader seems to be unused. The logic in command() now throws ServerIsNotTheLeaderException, and the forwarding is handled by the HTTP proxy layer. This method appears to be a remnant of a different implementation strategy. To improve maintainability and avoid confusion, it should be removed if it's no longer needed.
| .atMost(5, TimeUnit.MINUTES) | ||
| .pollInterval(1, TimeUnit.SECONDS) | ||
| .until(() -> getServer(serverNumber).getHA().getMessagesInQueue() == 0); | ||
| // With Ratis, replication is handled internally. Wait for state machine application. |
There was a problem hiding this comment.
Replacing the explicit wait for the replication queue to be empty with a fixed Thread.sleep() is unreliable and can lead to flaky tests. A better approach would be to implement a more deterministic wait. For example, you could get the leader's commit index and then poll the followers until their applied index matches the leader's commit index. This would ensure that replication is actually complete before proceeding with test assertions.
// With Ratis, replication is handled internally. Await for state machine application on followers.
final ArcadeDBServer leader = getLeaderServer();
if (leader == null)
return; // NO LEADER, CANNOT WAIT
final long leaderCommitIndex = leader.getHA().getCommitIndex();
for (int i = 0; i < getServerCount(); ++i) {
final ArcadeDBServer server = getServer(i);
if (server != leader)
Awaitility.await().atMost(1, TimeUnit.MINUTES).until(() -> server.getHA().getLastAppliedIndex() >= leaderCommitIndex);
}
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
server/src/main/java/com/arcadedb/server/ha/ratis/ArcadeDBStateMachine.java
Fixed
Show fixed
Hide fixed
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
|
Created first HA benchmark with multiple server, testing an insert and the results are VERY PPOR: Key findings:
The benchmark is tagged @tag("benchmark") so it's excluded from normal test runs. Run with: mvn test -pl server -Dtest=HAInsertBenchmark -Dgroups=benchmark We need to analyze this before merging. |
|
Code review posted - see full review details in the next comment |
|
test4 ignore |
|
Deleting test - please ignore this comment |
Code Review - Replace custom Raft protocol with Apache RatisThis is a major and well-structured change. Replacing a custom, ad-hoc Raft implementation with Apache Ratis is the right call for long-term reliability. The overall architecture is sound, the test coverage is impressive (36 tests + Docker e2e), and the documentation is thorough. Below are issues found during review. CRITICAL 1. System.out.println in production code (HALog.java) HALog.log() unconditionally calls System.out.println in addition to LogManager. Per project conventions (CLAUDE.md): remove any System.out you used for debug when you have finished. This will spam stdout in production when arcadedb.ha.logVerbose is nonzero. Also, the .replace in the format call is a no-op, and building a String.format() string just for the println while also calling LogManager is redundant. The println should simply be removed. 2. Platform-default charset in UUID.nameUUIDFromBytes(clusterName.getBytes()) in RaftHAServer.java All nodes must agree on the same groupId or they will never form a cluster. Using .getBytes() without a charset argument is platform-dependent. This should use StandardCharsets.UTF_8 explicitly, as already done for the token derivation two lines later: UUID.nameUUIDFromBytes(clusterName.getBytes(StandardCharsets.UTF_8))HIGH 3. Busy-wait polling in waitForAppliedIndex() and waitForLocalApply() Both methods spin with Thread.sleep(10) up to the quorum timeout. This adds up to 10ms latency on every READ_YOUR_WRITES and LINEARIZABLE read and holds a thread. Since ArcadeDBStateMachine already has lastAppliedIndex as an AtomicLong, a LockSupport-based condition or a CompletableFuture completed when the index advances would be lower-latency. 4. Dead code: forwardCommandToLeader() private method in ReplicatedDatabase.java The method is defined but never called. All command paths now throw ServerIsNotTheLeaderException instead, relying on the HTTP proxy. The method should be removed to avoid confusion. 5. No connection/read timeout on the HTTP proxy in AbstractServerHttpHandler.proxyToLeader() The call to new java.net.URI(targetUrl).toURL().openConnection() has no setConnectTimeout or setReadTimeout. If the leader is slow or unreachable, the follower thread blocks indefinitely. Explicit timeouts matching quorumTimeout should be set. 6. Fully qualified names in production code CLAUDE.md says do not use fully qualified names if possible, always import the class. There are many FQN usages in production code in RaftHAServer.java and ReplicatedDatabase.java. For example: peerHttpAddresses is declared as new java.util.concurrent.ConcurrentHashMap, lagMonitorExecutor as java.util.concurrent.ScheduledExecutorService, and getFollowerStates returns java.util.List of java.util.Map. These should be imported at the top of the file. MEDIUM 7. isCurrentNodeLeader() race in ArcadeDBStateMachine.applyTransaction() The state machine skips apply on the leader (returns early if isCurrentNodeLeader()). During a leadership transition, a node that was leader when commit2ndPhase() ran could lose leadership before applyTransaction() is called, causing double-application of WAL page changes. Worth adding a comment explaining why this is safe in the Ratis model, or verifying it actually cannot happen. 8. Breaking change: connectCluster / disconnectCluster throw UnsupportedOperationException Existing users calling POST /api/v1/server with command connect or disconnect will get a 500 response instead of a helpful message. These should return a proper HTTP 400 with a migration hint rather than an uncaught exception. 9. Spurious double-block in exception handler in AbstractServerHttpHandler.java There is an extra unnecessary block wrapping the proxyToLeader try block in the ServerIsNotTheLeaderException catch handler. It is a no-op but visually confusing and should be removed. LOW / NITS 10. Cluster token written back to GlobalConfiguration initClusterToken() calls configuration.setValue(GlobalConfiguration.HA_CLUSTER_TOKEN, this.clusterToken). Since GlobalConfiguration values can appear in server info dumps and logs, the derived token (which is a function of the root password) could be inadvertently exposed. Consider keeping it only in the clusterToken field without writing it back to the global config. 11. getFollowerStates() index correlation assumption The method assumes matchIndices[i] corresponds to the i-th non-self peer in raftGroup.getPeers(). This assumption depends on Ratis internal ordering that may not be guaranteed across versions. Should be validated against the Ratis API contract. 12. System.out.println in test infrastructure RaftClusterStarter System.out usage is fine as CLI output, but actual test classes such as RaftHAComprehensiveIT should not emit to stdout to avoid cluttering CI output. Summary
The architectural direction is excellent and the test suite is thorough. The System.out.println in HALog and the platform-default charset in the group ID derivation are the two that should be addressed before merge. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3798 +/- ##
==========================================
- Coverage 64.88% 64.22% -0.66%
==========================================
Files 1581 1557 -24
Lines 116573 116457 -116
Branches 24722 24797 +75
==========================================
- Hits 75636 74799 -837
- Misses 30593 31336 +743
+ Partials 10344 10322 -22 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
Code Review - PR 3798: Replace Custom Raft with Apache RatisThis is a substantial, well-motivated architectural replacement. Replacing a hand-rolled Raft protocol with Apache Ratis is the right call - the PR description is honest about the limitations of the old implementation. The new code is well-structured and comes with an impressive test suite (17 comprehensive tests + Docker e2e tests with network fault injection). The documentation in docs/arcadedb-ha-26.4.1.md is thorough. Below are the issues found, grouped by severity. SECURITY - High1. Weak cluster token derivation (RaftHAServer.initClusterToken())
Recommendation: use HMAC-SHA256 ( Mac mac = Mac.getInstance("HmacSHA256");
mac.init(new SecretKeySpec(rootPassword.getBytes(StandardCharsets.UTF_8), "HmacSHA256"));
this.clusterToken = Base64.getUrlEncoder().withoutPadding().encodeToString(mac.doFinal(clusterName.getBytes(StandardCharsets.UTF_8)));2. SnapshotHttpHandler - unauthenticated database download surface The snapshot endpoint is registered on SECURITY - Medium3. validateClusterForwardedAuth() - username blindly trusted from header After validating the cluster token, the username from 4. proxyToLeader() - silent privilege escalation to root conn.setRequestProperty(HEADER_CLUSTER_TOKEN, raftHA.getClusterToken());
conn.setRequestProperty(HEADER_FORWARDED_USER, "root");If a request arrives at a follower with no Authorization header (from a handler where 5. Snapshot downloaded over plain HTTP with no integrity verification
BUGS - High6. System.out.println in production HALog.java
Additionally the format string mutation 7. DDL replication inside the database write lock (ReplicatedDatabase.recordFileChanges())
BUGS - Medium8. Files.list() stream not closed in RaftHAServer.startService() final boolean storageExists = java.nio.file.Files.exists(storagePath)
&& java.nio.file.Files.list(storagePath).findAny().isPresent();
9. Potential double-apply for TRANSACTION_FORWARD entries
10. getFollowerStates() - array index alignment not guaranteed The code assumes PERFORMANCE11. proxyToLeader() uses HttpURLConnection with no connection pool or read timeout Every proxied follower-to-leader request opens a new TCP connection. The default read timeout is infinite (0). Under load this creates many TIME_WAIT sockets and risks blocking proxy threads indefinitely. 12. Busy-poll in waitForLocalApply() under LINEARIZABLE reads The 10ms poll loop for applied index creates unnecessary scheduling overhead. Ratis provides CODE QUALITY13. Fully-qualified class names in production code Multiple files use 14. Duplicate Javadoc block on waitForLocalApply() Two consecutive 15. forwardCommandToLeader() in ReplicatedDatabase appears to be dead code Writes now throw 16. Magic byte 'C' for state machine routing ASCII Summary
Must fix before merge:
The overall design and test coverage are excellent. The Ratis integration is architecturally sound. These are fixable issues that should be addressed before merging to |
|
Test pseudo-code block: |
Positive Observations
Overall this is production-quality work. Items 1 and 2 are worth addressing before merge, particularly the stuck-follower case after snapshot download failure. |
Minor Issues4. arcadedb.ha.replicationLagWarning units are ambiguous The description says "Raft log gap threshold for lag warnings (0=disabled)" but the unit is Raft log indices, not milliseconds. Operators familiar with typical monitoring thresholds will likely misread this. Suggest appending: "A value of N means the replica is N committed log entries behind the leader." 5. RaftLogEntry hardcoded limits - lines 138-140
6. Missing input validation for ha add peer The command parses peerId and address from user input without validating host:port format or port bounds. Ratis produces opaque errors for invalid addresses; an early validation check would give operators a cleaner error message. 7. ha leave with exactly 2 nodes
|
Client-visible Behavior Worth Documenting3. Double-timeout in RaftGroupCommitter - line 84-100 When a transaction is dispatched to Raft but the initial |
|
2. TOCTOU race in SnapshotHttpHandler - line 138 vs 166
|
PR Review: Replace custom Raft with Apache RatisThis is a significant, well-architected change. Replacing a custom Raft implementation with Apache Ratis is the right call - Ratis is battle-tested and formally correct. The new code is generally clean and well-structured. Below are observations ranging from correctness issues to minor suggestions. Correctness Issues1. Snapshot download has no automatic retry after full failure - ArcadeDBStateMachine.java line 853-863 If Consider resetting the flag on failure so the next leader notification triggers a fresh attempt. In the catch block, add 2. TOCTOU race in SnapshotHttpHandler - line 138 vs 166
Client-visible Behavior Worth Documenting3. Double-timeout in RaftGroupCommitter - line 84-100 When a transaction is dispatched to Raft but the initial Minor Issues4. arcadedb.ha.replicationLagWarning units are ambiguous The description says "Raft log gap threshold for lag warnings (0=disabled)" but the unit is Raft log indices, not milliseconds. Operators familiar with typical monitoring thresholds will likely misread this. Suggest appending: "A value of N means the replica is N committed log entries behind the leader." 5. RaftLogEntry hardcoded limits - lines 138-140
6. Missing input validation for ha add peer The command parses peerId and address from user input without validating host:port format or port bounds. Ratis produces opaque errors for invalid addresses; an early validation check would give operators a cleaner error message. 7. ha leave with exactly 2 nodes
Positive Observations
Overall this is production-quality work. Items 1 and 2 are worth addressing before merge, particularly the stuck-follower case after snapshot download failure. |
Code Review - Apache Ratis HA IntegrationThis is an impressive, well-structured replacement of the custom Raft stack. The core design - Ratis for consensus, group commit for throughput, HTTP-based snapshot transfer, and a clear three-phase commit protocol - is solid. Below are observations grouped by severity. HIGH - Potential Bugs
When the submission queue is full, the code throws
In
The join retry jitter is derived from MEDIUM - Design and Correctness Concerns
The threshold at which a follower is forced to download a full snapshot instead of catching up via log replay is hardcoded at 10 entries. For write-heavy workloads this is very aggressive (a brief GC pause could trigger a full snapshot). Expose this as a configuration setting (e.g.,
The method called during log replay to create new database files does not check whether the file already exists before creating it. On a retry (e.g., leader re-sends after a partial commit), this could overwrite a partially-written file. Add a guard: skip creation if the file already exists and has non-zero size.
In
LOW - Minor / Style
Values like
The lag warning debounce interval is hardcoded at 60 seconds. For operators tuning alerting pipelines it would be useful as a configuration option alongside
The cap of 2 concurrent snapshot downloads is a reasonable NIC protection measure but unexplained in the code. A comment noting the rationale (and ideally making it configurable) would help future maintainers. Strengths Worth Noting
Overall this is production-quality work. The main items before merge: fix the misleading exception on queue full, log (not swallow) the |
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| BestPractice | 9 medium |
| ErrorProne | 1 high |
| CodeStyle | 17 minor |
🟢 Metrics 834 complexity
Metric Results Complexity 834
TIP This summary will be updated as you push new changes. Give us feedback
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
Code Review - Follow-upThis is a follow-up to earlier Claude review comments. Prior reviews flagged: System.out.println in HALog, missing proxy timeouts, PBKDF2 charset, timing-safe token comparison, busy-wait polling. This comment adds additional observations. Correction to prior review ATTRIBUTIONS.md IS updated - The first review incorrectly flagged this as missing. The diff shows +15 additions covering the new Ratis dependencies. New observations 1. Ratis version discrepancy in documentation The PR description says Apache Ratis 3.2.1 throughout but server/pom.xml adds version 3.2.2. The docs should reference 3.2.2 consistently. 2. Group commit phantom-commit ambiguity RaftGroupCommitter marks timed-out entries as cancelled. However, if an entry was already submitted to the Ratis client when the timeout fires, the entry may still be durable in the Raft log even though the local thread received a timeout exception. The client gets an error but the write succeeded. This needs documentation and a test asserting the server does not diverge after such an event. 3. K8s auto-join security tryAutoJoinCluster discovers peers via DNS. Any pod resolving the expected hostname can join with the cluster token as the only gate. Consider documenting that K8s NetworkPolicy should restrict gRPC port access to only pods in the ArcadeDB StatefulSet. 4. Origin-node skip logic safety ArcadeDBStateMachine skips WAL application on the origin node. This assumes the leader committed locally in phase 2 before applyTransaction is called. If Ratis replays entries on restart, double-application would corrupt the database. A comment explaining why Ratis will not replay already-applied entries on the origin node would prevent future regressions. 5. Snapshot ZIP path traversal edge case SnapshotHttpHandler checks for symlinks and path traversal, but ZIP entry names come from actual filenames. If a database filename itself contains .., the ZIP entry could embed a traversal path. Verify ZIP entry names are normalized via Paths.get(filename).getFileName().toString() when added to the stream. 6. connectCluster / disconnectCluster breaking change These throw UnsupportedOperationException, giving migrating users a confusing 500. An HTTP 400 with 'Use ha add peer/ha remove peer instead' would improve the migration experience. 7. TODO section in docs docs/arcadedb-ha-26.4.1.md contains a TODO section with planned features. Remove it or convert to GitHub issues before merging. Priority summary
The overall architecture is solid and the test coverage is impressive. The Docker e2e tests covering network partitions and rolling restarts are especially valuable for a consensus system. |
📜 License Compliance Check✅ License check passed. See artifacts for full report. License Summary (first 50 lines) |
PR Review: Replace custom Raft protocol with Apache RatisThis is a major, well-scoped architectural improvement. Replacing the hand-rolled Raft implementation with Apache Ratis 3.2.2 is the right call - battle-tested consensus via a library used by Apache Ozone and IoTDB is strictly better than maintaining a custom protocol. The breadth of test coverage (unit, integration, Docker e2e, network-partition, packet-loss) is excellent and gives confidence in the correctness of the new implementation. Below are findings grouped by severity. Security[Concern] Weak cluster token when root password is empty In Suggestion: In addition to logging the warning, refuse to start HA mode with an auto-derived token when the root password is empty and no explicit token is configured - or at minimum escalate to an [Minor] Plain HTTP for inter-node snapshot download
[Positive] Constant-time token comparison Using SHA-256 + [Positive] Path traversal prevention in snapshot handler The database name validation in Correctness[Minor]
[Minor] The health monitor checks every 3 seconds and calls [Positive] Origin-node skip logic The [Positive] Schema change ordering Files are created before WAL apply and removed after - correct ordering that prevents orphan pages or missing targets during follower replay. Performance[Positive] Group commit via Batching multiple pending Raft entries into a single pipelined async submission is a sound optimization that amortizes gRPC round-trip latency under concurrent write load. The atomic state machine ( [Minor] Snapshot ZIP created in-memory before sending
[Minor] This is a reasonable trade-off for runtime log-level changes without restart. No issue here, just worth noting it adds a Code Style / Minor Issues[Minor] The description says "number of uncommitted entries" - this is a Raft log index gap, not wall-clock milliseconds - but the configuration key name ( [Positive] No All logging goes through the ArcadeDB logger or [Positive] Apache Ratis (Apache 2.0) is properly attributed. Required NOTICE content is incorporated. License compliance is clean. Test CoverageThe test suite is comprehensive:
One gap worth noting: there are no tests for the "weak token on empty root password" scenario described above. SummaryThis is a high-quality replacement of a complex internal system. The architecture is sound, the security practices are solid for the threat model (trusted private network), and the test coverage is exceptional for a PR of this size. The items above are mostly hardening suggestions rather than blocking issues. The snapshot streaming concern is the most operationally impactful one for large datasets. |
Code Review - Apache Ratis HA IntegrationOverall this is a well-engineered, high-quality PR that replaces a custom Raft implementation with battle-tested Apache Ratis 3.2.2. The architecture is clean, the concurrency handling is careful, and the test coverage is comprehensive. Positive Observations
IssuesCritical / Should Fix Before Merge1. Snapshot gap detection race -
Suggested fix: In the state machine, if 2.
Suggested fix: Only set 3. Leadership transfer notification race - The leadership transfer loop waits on Suggested fix: Use a 4. Log entry serialization format has no version byte - Entry type codes (1, 3) are hardcoded with no format version. During rolling upgrades, if the serialization format changes, followers with the new code receiving old-format entries (or vice versa) may silently deserialize corrupted data. Suggested fix: Add a version byte after the entry type code and validate at deserialization time. High - Should Address Soon5. Negative replication lag possible -
Suggested fix: 6. Actual worst-case quorum timeout can exceed documented 2x - The Javadoc states effective timeout is "up to 2x quorumTimeout", but because the second wait in Suggested fix: Compute Medium / Low7. Empty cluster token silently degrades auth - The PBKDF2 derivation accepts whatever is in Suggested fix: Validate non-empty at startup and log a warning if shorter than 32 characters. 8. Health monitor detects stalled server but does not recover - The health monitor loop logs when the server is 9. Snapshot completion not logged at INFO - Snapshot start errors are logged at SEVERE; successful completion has no corresponding INFO log, making operational tracing asymmetric. Minor Style Notes
The batching, lock-free state tracking, and origin-node skip logic in the state machine are all well done. Issues #1-4 are worth addressing before merge given this is consensus/replication code where subtle races can manifest only under production load. |
Code Review: Replace custom Raft protocol with Apache RatisThis is a substantial and well-motivated architectural change. Replacing a home-grown Raft-like implementation with Apache Ratis (battle-tested in Apache Ozone, IoTDB, and Alluxio) is sound engineering. The PR has clearly gone through several review iterations with many issues already resolved. The test coverage - unit, integration, and Docker e2e - is thorough. Below are the remaining issues I found, grouped by severity. Critical - Should Fix Before Merge1. Snapshot gap never resolved if leader does not change (
Fix: After 2. In the leader catch-up path, Fix: Only set 3. Lost-wakeup race in leadership transfer wait ( The leadership transfer loop checks a condition and then calls Fix: Replace with a 4. No serialization format version in Entry type codes are hardcoded with no format version byte. During rolling upgrades, old and new nodes may disagree on deserialization, silently producing corrupted data rather than a clear error. Fix: Add a 1-byte version field after the entry type byte. Validate at deserialization and fail fast on unrecognized versions. High - Should Address Soon5. Snapshot ZIP fully buffered in heap ( The entire database ZIP is accumulated into a Fix: Stream the 6. Effective The Javadoc documents the effective timeout as "up to 2x quorumTimeout". However, the second wait creates a fresh deadline rather than using the remaining time from the first wait. Under network delay, operators who dimension their system based on the documented 2x will be surprised by the actual behavior. Fix: Compute 7. Negative replication lag possible (
Fix: 8.
Fix: Check 9. If the cause of the Ratis server closing is persistent (port conflict, bad storage path, full disk), the health monitor calls Fix: Add a consecutive-failure counter with exponential backoff. After N failures, log at SEVERE and stop retrying (require manual intervention). Security10. Weak cluster token when root password is empty (
Fix: Refuse to start HA mode with an auto-derived token when 11.
Fix: Use 12. Snapshot transferred over plain HTTP (
Code Quality (per CLAUDE.md conventions)13. The setting measures a Raft log index gap (number of entries), not time. The word "lag" will mislead operators into thinking it is milliseconds. Consider renaming to What Was Already Fixed (well done)
Summary
The PR is architecturally solid and the test coverage is impressive. Issues 1-4 should be addressed before merge. Issues 5-9 can be tracked as follow-up. The rest are polish. |
This comes from @robfrank after working into improving HA for many months with poor results. The main reason was the RAFT implementation itself I created years ago (my fault) that was quite limited and poorly tested. In order to reach the level of rock-solidity we want, it might take 1 year or more. So @robfrank had the idea to replace our internal RAFT system with Apache Ratis 3.2.2 - a battle-tested, formally correct implementation of the Raft consensus protocol used in production by Apache Ozone, IoTDB, and Alluxio.
After reviewing Apache Ratis internal architecture, I started from scratch with the integration to go in parallel with @robfrank branch
ha-redesign. I've taken from @robfrank's branchha-redesignmany tests and some components.This change is transparent to users - the HTTP API, database API, query languages, and client libraries remain unchanged. The internal HA protocol now uses gRPC (shaded by Ratis) instead of custom TCP binary messages.
92 files changed, +7,917 / -6,639 lines (net +1,278)
What was removed (34 files)
The entire custom HA stack: HAServer, Leader2ReplicaNetworkExecutor, Replica2LeaderNetworkExecutor, LeaderNetworkListener, ReplicationLogFile, ReplicationProtocol, 21 message classes (TxRequest, CommandForwardRequest, etc.), and related infrastructure.
What was added (22 new files)
Core HA (5 production files in server/ha/ratis/):
Tests (10 files):
Docker e2e tests (7 files, tagged e2e-ha, require Docker):
Key features
New configuration settings
Test plan