Issue:
I’m reporting an issue after my XDC mainnet node suffered what appeared to be a database corruption / bad state condition.
At the moment, I do not have a clean fatal panic line from the logs, but the node behavior became abnormal and appeared stuck in a loop rather than recovering normally. Logs are attached for review.
Environment
- Network: XDC mainnet
- Client version:
XDC/v2.6.8-stable/linux-amd64/go1.23.12
- Chain ID:
50
- Consensus: XDPoS
- Observed local header on startup:
90,885,901
- Peers later observed around:
101,033,xxx to 101,036,xxx range
What I observed
The node appeared to be far behind network peers and did not recover cleanly. Instead, it repeatedly:
- discarded propagated votes / blocks as “too far away”
- looped on
get snapshot from gap block
- looped on
promoteExecutables
- continued state download / import activity for a long period
- repeatedly logged
parentHeader is nil, wait block to be writen in disk
The Dashboard reported the node's blocks rolling back, and I observed this behavior on several nodes that had fallen behind from current block on the network. At this point, I stopped the node, and started the recovery process.
Examples:
-
Local node loaded header at 90,885,901, while peers were advertising blocks in the 101,034,xxx+ range.
-
Repeated messages such as:
Discarded propagated vote, too far away
Discarded propagated block, too far away
-
Node found a common ancestor and started downloading headers / bodies / receipts / state:
Found common ancestor
Downloading block bodies
Downloading transaction receipts
Imported new state entries ...
-
Repeated loop-like activity:
get snapshot from gap block number=101,032,650 ...
start promoteExecutables
end promoteExecutables
-
Repeated message:
[V2 Hook Penalty] parentHeader is nil, wait block to be writen in disk parentNumber=101,033,352
Additional observations
There were many inbound/outbound peers from other networks/testnets, but they appear to have been rejected correctly due to genesis mismatch, for example:
Genesis block mismatch - d9b046... (!= 4a9d748...)
Genesis block mismatch - bdea512... (!= 4a9d748...)
Because of that, I do not think those mismatched peers were the root cause, but I’m including it in case it matters.
Why I’m opening this
From the logs, the node does not look like it recovered cleanly from the bad state. It looks more like it became stuck in a repeated snapshot / promoteExecutables / far-behind-sync condition.
The part that concerns me most is the combination of:
- being permanently far behind current peers
- repeated
parentHeader is nil, wait block to be writen in disk
- repeated
get snapshot from gap block
- repeated
promoteExecutables
- no obvious forward recovery to current head from the captured logs
Questions
- Does this pattern indicate known database corruption or state inconsistency behavior in
v2.6.8-stable?
- Is the repeated
parentHeader is nil, wait block to be writen in disk expected during recovery, or is it a sign of a broken local chain/state DB?
- Is there a recommended recovery path beyond full resync / snapshot restore?
- Are there known issues around
get snapshot from gap block / promoteExecutables loops when a node is recovering from corrupted data?
Relevant log excerpts
Startup / local head:
Starting peer-to-peer node instance=XDC/v2.6.8-stable/linux-amd64/go1.23.12
Loaded most recent local header number=90,885,901
Loaded most recent local fast block number=90,885,901
Far-behind / discarded blocks:
Discarded propagated vote, too far away
Discarded propagated block, too far away
Recovery attempt:
Found common ancestor peer=...
Downloading block bodies origin=101,033,160
Downloading transaction receipts origin=101,033,160
Imported new state entries ...
Suspicious repeated message:
[V2 Hook Penalty] parentHeader is nil, wait block to be writen in disk parentNumber=101,033,352
Loop-like behavior:
get snapshot from gap block number=101,032,650 ...
start promoteExecutables
end promoteExecutables
Wrong-network peers being rejected:
Ethereum handshake failed ... err="Genesis block mismatch ..."
Node full recovery:
The node had to be recovered by downloading the snapshot, and let the node sync up to the network. Node is currently (as of April 5th) been fully recovered and back to Masternode status, as it had been slashed during the entire time the server fell out of sync. I have also checked that ports not being used are closed and only authorized IP addresses can access the server for management over specific ports.
Issue:
I’m reporting an issue after my XDC mainnet node suffered what appeared to be a database corruption / bad state condition.
At the moment, I do not have a clean fatal panic line from the logs, but the node behavior became abnormal and appeared stuck in a loop rather than recovering normally. Logs are attached for review.
Environment
XDC/v2.6.8-stable/linux-amd64/go1.23.125090,885,901101,033,xxxto101,036,xxxrangeWhat I observed
The node appeared to be far behind network peers and did not recover cleanly. Instead, it repeatedly:
get snapshot from gap blockpromoteExecutablesparentHeader is nil, wait block to be writen in diskThe Dashboard reported the node's blocks rolling back, and I observed this behavior on several nodes that had fallen behind from current block on the network. At this point, I stopped the node, and started the recovery process.
Examples:
Local node loaded header at
90,885,901, while peers were advertising blocks in the101,034,xxx+range.Repeated messages such as:
Discarded propagated vote, too far awayDiscarded propagated block, too far awayNode found a common ancestor and started downloading headers / bodies / receipts / state:
Found common ancestorDownloading block bodiesDownloading transaction receiptsImported new state entries ...Repeated loop-like activity:
get snapshot from gap block number=101,032,650 ...start promoteExecutablesend promoteExecutablesRepeated message:
[V2 Hook Penalty] parentHeader is nil, wait block to be writen in disk parentNumber=101,033,352Additional observations
There were many inbound/outbound peers from other networks/testnets, but they appear to have been rejected correctly due to genesis mismatch, for example:
Genesis block mismatch - d9b046... (!= 4a9d748...)Genesis block mismatch - bdea512... (!= 4a9d748...)Because of that, I do not think those mismatched peers were the root cause, but I’m including it in case it matters.
Why I’m opening this
From the logs, the node does not look like it recovered cleanly from the bad state. It looks more like it became stuck in a repeated snapshot / promoteExecutables / far-behind-sync condition.
The part that concerns me most is the combination of:
parentHeader is nil, wait block to be writen in diskget snapshot from gap blockpromoteExecutablesQuestions
v2.6.8-stable?parentHeader is nil, wait block to be writen in diskexpected during recovery, or is it a sign of a broken local chain/state DB?get snapshot from gap block/promoteExecutablesloops when a node is recovering from corrupted data?Relevant log excerpts
Startup / local head:
Far-behind / discarded blocks:
Recovery attempt:
Suspicious repeated message:
Loop-like behavior:
Wrong-network peers being rejected:
Node full recovery:
The node had to be recovered by downloading the snapshot, and let the node sync up to the network. Node is currently (as of April 5th) been fully recovered and back to Masternode status, as it had been slashed during the entire time the server fell out of sync. I have also checked that ports not being used are closed and only authorized IP addresses can access the server for management over specific ports.