Skip to content

Possible stalled sync / database inconsistency after corruption event on XDC mainnet node (v2.6.8-stable) #258

@MrBlockchain22

Description

@MrBlockchain22

Issue:

I’m reporting an issue after my XDC mainnet node suffered what appeared to be a database corruption / bad state condition.

At the moment, I do not have a clean fatal panic line from the logs, but the node behavior became abnormal and appeared stuck in a loop rather than recovering normally. Logs are attached for review.

Environment

  • Network: XDC mainnet
  • Client version: XDC/v2.6.8-stable/linux-amd64/go1.23.12
  • Chain ID: 50
  • Consensus: XDPoS
  • Observed local header on startup: 90,885,901
  • Peers later observed around: 101,033,xxx to 101,036,xxx range

What I observed

The node appeared to be far behind network peers and did not recover cleanly. Instead, it repeatedly:

  • discarded propagated votes / blocks as “too far away”
  • looped on get snapshot from gap block
  • looped on promoteExecutables
  • continued state download / import activity for a long period
  • repeatedly logged parentHeader is nil, wait block to be writen in disk

The Dashboard reported the node's blocks rolling back, and I observed this behavior on several nodes that had fallen behind from current block on the network. At this point, I stopped the node, and started the recovery process.

Examples:

  • Local node loaded header at 90,885,901, while peers were advertising blocks in the 101,034,xxx+ range.

  • Repeated messages such as:

    • Discarded propagated vote, too far away
    • Discarded propagated block, too far away
  • Node found a common ancestor and started downloading headers / bodies / receipts / state:

    • Found common ancestor
    • Downloading block bodies
    • Downloading transaction receipts
    • Imported new state entries ...
  • Repeated loop-like activity:

    • get snapshot from gap block number=101,032,650 ...
    • start promoteExecutables
    • end promoteExecutables
  • Repeated message:

    • [V2 Hook Penalty] parentHeader is nil, wait block to be writen in disk parentNumber=101,033,352

Additional observations

There were many inbound/outbound peers from other networks/testnets, but they appear to have been rejected correctly due to genesis mismatch, for example:

  • Genesis block mismatch - d9b046... (!= 4a9d748...)
  • Genesis block mismatch - bdea512... (!= 4a9d748...)

Because of that, I do not think those mismatched peers were the root cause, but I’m including it in case it matters.

Why I’m opening this

From the logs, the node does not look like it recovered cleanly from the bad state. It looks more like it became stuck in a repeated snapshot / promoteExecutables / far-behind-sync condition.

The part that concerns me most is the combination of:

  • being permanently far behind current peers
  • repeated parentHeader is nil, wait block to be writen in disk
  • repeated get snapshot from gap block
  • repeated promoteExecutables
  • no obvious forward recovery to current head from the captured logs

Questions

  1. Does this pattern indicate known database corruption or state inconsistency behavior in v2.6.8-stable?
  2. Is the repeated parentHeader is nil, wait block to be writen in disk expected during recovery, or is it a sign of a broken local chain/state DB?
  3. Is there a recommended recovery path beyond full resync / snapshot restore?
  4. Are there known issues around get snapshot from gap block / promoteExecutables loops when a node is recovering from corrupted data?

Relevant log excerpts

Startup / local head:

Starting peer-to-peer node instance=XDC/v2.6.8-stable/linux-amd64/go1.23.12
Loaded most recent local header number=90,885,901
Loaded most recent local fast block number=90,885,901

Far-behind / discarded blocks:

Discarded propagated vote, too far away
Discarded propagated block, too far away

Recovery attempt:

Found common ancestor peer=...
Downloading block bodies origin=101,033,160
Downloading transaction receipts origin=101,033,160
Imported new state entries ...

Suspicious repeated message:

[V2 Hook Penalty] parentHeader is nil, wait block to be writen in disk parentNumber=101,033,352

Loop-like behavior:

get snapshot from gap block number=101,032,650 ...
start promoteExecutables
end promoteExecutables

Wrong-network peers being rejected:

Ethereum handshake failed ... err="Genesis block mismatch ..."

Node full recovery:

The node had to be recovered by downloading the snapshot, and let the node sync up to the network. Node is currently (as of April 5th) been fully recovered and back to Masternode status, as it had been slashed during the entire time the server fell out of sync. I have also checked that ports not being used are closed and only authorized IP addresses can access the server for management over specific ports.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions