Possible stalled sync / database inconsistency after corruption event on XDC mainnet node (v2.6.8-stable)

**Issue:**

I’m reporting an issue after my XDC mainnet node suffered what appeared to be a database corruption / bad state condition.

At the moment, I do **not** have a clean fatal panic line from the logs, but the node behavior became abnormal and appeared stuck in a loop rather than recovering normally. Logs are attached for review. 

## Environment

* **Network:** XDC mainnet
* **Client version:** `XDC/v2.6.8-stable/linux-amd64/go1.23.12`
* **Chain ID:** `50`
* **Consensus:** XDPoS
* **Observed local header on startup:** `90,885,901`
* **Peers later observed around:** `101,033,xxx` to `101,036,xxx` range   

## What I observed



The node appeared to be far behind network peers and did not recover cleanly. Instead, it repeatedly:

* discarded propagated votes / blocks as “too far away”
* looped on `get snapshot from gap block`
* looped on `promoteExecutables`
* continued state download / import activity for a long period
* repeatedly logged `parentHeader is nil, wait block to be writen in disk`

The Dashboard reported the node's blocks rolling back, and I observed this behavior on several nodes that had fallen behind from current block on the network.  At this point, I stopped the node, and started the recovery process. 

Examples:

* Local node loaded header at `90,885,901`, while peers were advertising blocks in the `101,034,xxx+` range.  
* Repeated messages such as:

  * `Discarded propagated vote, too far away`
  * `Discarded propagated block, too far away`  
* Node found a common ancestor and started downloading headers / bodies / receipts / state:

  * `Found common ancestor`
  * `Downloading block bodies`
  * `Downloading transaction receipts`
  * `Imported new state entries ...`  
* Repeated loop-like activity:

  * `get snapshot from gap block number=101,032,650 ...`
  * `start promoteExecutables`
  * `end promoteExecutables`   
* Repeated message:

  * `[V2 Hook Penalty] parentHeader is nil, wait block to be writen in disk parentNumber=101,033,352` 

## Additional observations

There were many inbound/outbound peers from other networks/testnets, but they appear to have been rejected correctly due to genesis mismatch, for example:

* `Genesis block mismatch - d9b046... (!= 4a9d748...)`
* `Genesis block mismatch - bdea512... (!= 4a9d748...)`   

Because of that, I do **not** think those mismatched peers were the root cause, but I’m including it in case it matters.

## Why I’m opening this

From the logs, the node does not look like it recovered cleanly from the bad state. It looks more like it became stuck in a repeated snapshot / promoteExecutables / far-behind-sync condition.

The part that concerns me most is the combination of:

* being permanently far behind current peers
* repeated `parentHeader is nil, wait block to be writen in disk`
* repeated `get snapshot from gap block`
* repeated `promoteExecutables`
* no obvious forward recovery to current head from the captured logs   

## Questions

1. Does this pattern indicate known database corruption or state inconsistency behavior in `v2.6.8-stable`?
2. Is the repeated `parentHeader is nil, wait block to be writen in disk` expected during recovery, or is it a sign of a broken local chain/state DB?
3. Is there a recommended recovery path beyond full resync / snapshot restore?
4. Are there known issues around `get snapshot from gap block` / `promoteExecutables` loops when a node is recovering from corrupted data?

## Relevant log excerpts

Startup / local head:

```text
Starting peer-to-peer node instance=XDC/v2.6.8-stable/linux-amd64/go1.23.12
Loaded most recent local header number=90,885,901
Loaded most recent local fast block number=90,885,901
```

Far-behind / discarded blocks:

```text
Discarded propagated vote, too far away
Discarded propagated block, too far away
```

Recovery attempt:

```text
Found common ancestor peer=...
Downloading block bodies origin=101,033,160
Downloading transaction receipts origin=101,033,160
Imported new state entries ...
```

Suspicious repeated message:

```text
[V2 Hook Penalty] parentHeader is nil, wait block to be writen in disk parentNumber=101,033,352
```

Loop-like behavior:

```text
get snapshot from gap block number=101,032,650 ...
start promoteExecutables
end promoteExecutables
```

Wrong-network peers being rejected:

```text
Ethereum handshake failed ... err="Genesis block mismatch ..."
```

Node full recovery:

The node had to be recovered by downloading the snapshot, and let the node sync up to the network.  Node is currently (as of April 5th) been fully recovered and back to Masternode status, as it had been slashed during the entire time the server fell out of sync.  I have also checked that ports not being used are closed and only authorized IP addresses can access the server for management over specific ports.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible stalled sync / database inconsistency after corruption event on XDC mainnet node (v2.6.8-stable) #258

Environment

What I observed

Additional observations

Why I’m opening this

Questions

Relevant log excerpts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible stalled sync / database inconsistency after corruption event on XDC mainnet node (v2.6.8-stable) #258

Description

Environment

What I observed

Additional observations

Why I’m opening this

Questions

Relevant log excerpts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions