avoid blocking main thread on reconstruction by arnetheduck · Pull Request #8346 · status-im/nimbus-eth2

arnetheduck · 2026-04-27T09:41:45Z

When running column reconstruction, the current approach blocks until timeout or the reconstruction is finished, delaying other processing that should be happening at the same time.

With a TSP, we can allow async processing to continue while the reconstruction is done in the background also taking care to not stuff the thread pool with tasks (since it's shared).

Further work is needed to enable reconstruction, in particular:

Reconstruction should likely not run in onSlotEnd, ie it should probably start somewhere close to the arrival of 50+% of the columns
As reconstruction is running, we might receive columns from the network - we should top up the known colums before creating new tasks so as to avoid unnecessary recomputations
We should be distributing the computed columns incrementally on the network instead of just saving them to the DB - this of course relies on computing them earlier - care must be taken to not start this process "too" early, ie before a new head has been established by sufficient attestations..

When running column reconstruction, the current approach blocks until timeout or the reconstruction is finished, delaying other processing that should be happening at the same time. With a TSP, we can allow async processing to continue while the reconstruction is done in the background also taking care to not stuff the thread pool with tasks (since it's shared). Further work is needed to enable reconstruction, in particular: * Reconstruction should likely not run in onSlotEnd, ie it should probably start somewhere close to the arrival of 50+% of the columns * As reconstruction is running, we might receive columns from the network - we should top up the known colums before creating new tasks so as to avoid unnecessary recomputations * We should be distributing the computed columns incrementally on the network instead of just saving them to the DB - this of course relies on computing them earlier - care must be taken to not start this process "too" early, ie before a new head has been established by sufficient attestations..

github-actions · 2026-04-27T10:33:40Z

Unit Test Results

        9 files -         3   2 124 suites - 708 1h 0m 9s ⏱️ - 29m 3s
15 921 tests ±        0 14 356 ✔️ ±        0 1 565 💤 ±  0 0 ❌ ±0
57 518 runs - 19 174 55 856 ✔️ - 19 096 1 662 💤 - 78 0 ❌ ±0

Results for commit b55fd07. ± Comparison against base commit 1acfee6.

♻️ This comment has been updated with latest results.

agnxsh · 2026-04-28T11:02:41Z

the changes look reasonable to me, however, with some discussions in the last month or so, we have decided to use reconstruction in a different way,

with some optimizations such as signature caching, and batched kzg verifications at the time of gossip we have decided to expand the column quarantine to a 128 columns, thereby nullifying the need to go for slot by slot reconstruction of columns
in times of syncing, for the ease of syncing we have come up with a few stability modes, where we downshift our custody columns to something as low as minimum (especially in times of a turbulent sync), accordingly change what we can serve in enr and metadata, also updating our earliest available slot from which we can reliably serve data, this would enable us to sync a part of history as a lightweight full node and then upshift as a supernode (or as per validator balance) once we have synced
as per current data right now, once we are stable, most of 128 columns are provided by getBlobsV2, that's almost a 9/10 slots on Hoodi and 7/10 slots on Mainnet.

coming to real world use cases, one would need reconstruction to be able to serve blob data to L2s, for that we always need the first half of 128 columns, that should enable us serve blob data without having to reconstruct. we currently do this in our lightsupernode flag, which does custody the first half columns.

with all of these things in place, we've decided to make reconstruction a sort of lazy backfill mechanism where we can maintain a slot-keyed recovery matrix in the backfilling service, this would essentially take multiple slots to slowly reconstruct data column sidecars for every slot, and do it upto blob retention period backwards, till then we adjust our earliest_available_slot to the last stability period that we achieved (i.e, the slot from which we could reliably use getBlobs/gossip validate/short-range sync 128 columns, in case we actually do custody all 128)

agnxsh · 2026-04-28T11:13:53Z

lazy backfill mechanism where we can maintain a slot-keyed recovery matrix in the backfilling service, this would essentially take multiple slots to slowly reconstruct data column sidecars for every slot

so instead of spawning task for every blob idx (based on availability and other factors ofc), we can actually do 1 task for 1 blob idx per slot and keep filling the recovery matrix buffer (and i think that we can take the main thread, it won't be too harmful), so in every 20-ish odd seconds we are gradually able to revive 1 non reconstructed slot

with the onset of slot availability advertisements, reconstruction has almost turned out altruistic in today's network, in mainnet 3/10 slots are worth reconstructing, if getBlobs is well utilised, so we might as well take decently long to reconstruct for these 3 slots?

arnetheduck · 2026-04-30T20:42:24Z

expand the column quarantine to a 128 columns,

If this is about waiting for the columns to arrive from the network, I'd assume we need to run a reconstruction anyway in case some columns are not available. More importantly, we must start attesting as soon as we have 64 and not wait for all to arrive, since the full 128 may take time to arrive, if at all (and attestations can go out as soon as we have 64).

and i think that we can take the main thread,

I don't think we should be running any computation of this nature on the main thread, ie it's perfectly runnable on a separate thread with negligible overhead -> perfect candidate.

so we might as well take decently long to reconstruct for these 3 slots?

from a network health perspective, if we're going to be reconstructing we might as well do it right and publish the results on the network - the additional effort is negligible.

This indeed implies running the reconstruction incrementally and continuously but there's no reason not to start sooner rather than layer, and using the technique in this PR which interleaves reconstruction work with other taskpool work .. importantly, we should never block the main thread with these kinds of loads.

arnetheduck requested a review from agnxsh April 27, 2026 09:41

arnetheduck added 2 commits April 27, 2026 14:49

oops

fbd5309

lint

2bad668

Merge branch 'unstable' into async-recon

b55fd07

arnetheduck marked this pull request as draft May 1, 2026 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid blocking main thread on reconstruction#8346

avoid blocking main thread on reconstruction#8346
arnetheduck wants to merge 4 commits intounstablefrom
async-recon

arnetheduck commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

agnxsh commented Apr 28, 2026

Uh oh!

agnxsh commented Apr 28, 2026

Uh oh!

arnetheduck commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arnetheduck commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

agnxsh commented Apr 28, 2026

Uh oh!

agnxsh commented Apr 28, 2026

Uh oh!

arnetheduck commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 27, 2026 •

edited

Loading