avoid blocking main thread on reconstruction#8346
avoid blocking main thread on reconstruction#8346arnetheduck wants to merge 4 commits intounstablefrom
Conversation
When running column reconstruction, the current approach blocks until timeout or the reconstruction is finished, delaying other processing that should be happening at the same time. With a TSP, we can allow async processing to continue while the reconstruction is done in the background also taking care to not stuff the thread pool with tasks (since it's shared). Further work is needed to enable reconstruction, in particular: * Reconstruction should likely not run in onSlotEnd, ie it should probably start somewhere close to the arrival of 50+% of the columns * As reconstruction is running, we might receive columns from the network - we should top up the known colums before creating new tasks so as to avoid unnecessary recomputations * We should be distributing the computed columns incrementally on the network instead of just saving them to the DB - this of course relies on computing them earlier - care must be taken to not start this process "too" early, ie before a new head has been established by sufficient attestations..
Unit Test Results 9 files - 3 2 124 suites - 708 1h 0m 9s ⏱️ - 29m 3s Results for commit b55fd07. ± Comparison against base commit 1acfee6. ♻️ This comment has been updated with latest results. |
|
the changes look reasonable to me, however, with some discussions in the last month or so, we have decided to use reconstruction in a different way,
coming to real world use cases, one would need reconstruction to be able to serve blob data to L2s, for that we always need the first half of 128 columns, that should enable us serve blob data without having to reconstruct. we currently do this in our lightsupernode flag, which does custody the first half columns. with all of these things in place, we've decided to make reconstruction a sort of lazy backfill mechanism where we can maintain a slot-keyed recovery matrix in the backfilling service, this would essentially take multiple slots to slowly reconstruct data column sidecars for every slot, and do it upto blob retention period backwards, till then we adjust our earliest_available_slot to the last stability period that we achieved (i.e, the slot from which we could reliably use getBlobs/gossip validate/short-range sync 128 columns, in case we actually do custody all 128) |
so instead of spawning task for every blob idx (based on availability and other factors ofc), we can actually do 1 task for 1 blob idx per slot and keep filling the recovery matrix buffer (and i think that we can take the main thread, it won't be too harmful), so in every 20-ish odd seconds we are gradually able to revive 1 non reconstructed slot with the onset of slot availability advertisements, reconstruction has almost turned out altruistic in today's network, in mainnet 3/10 slots are worth reconstructing, if getBlobs is well utilised, so we might as well take decently long to reconstruct for these 3 slots? |
If this is about waiting for the columns to arrive from the network, I'd assume we need to run a reconstruction anyway in case some columns are not available. More importantly, we must start attesting as soon as we have 64 and not wait for all to arrive, since the full 128 may take time to arrive, if at all (and attestations can go out as soon as we have 64).
I don't think we should be running any computation of this nature on the main thread, ie it's perfectly runnable on a separate thread with negligible overhead -> perfect candidate.
from a network health perspective, if we're going to be reconstructing we might as well do it right and publish the results on the network - the additional effort is negligible. This indeed implies running the reconstruction incrementally and continuously but there's no reason not to start sooner rather than layer, and using the technique in this PR which interleaves reconstruction work with other taskpool work .. importantly, we should never block the main thread with these kinds of loads. |
When running column reconstruction, the current approach blocks until timeout or the reconstruction is finished, delaying other processing that should be happening at the same time.
With a TSP, we can allow async processing to continue while the reconstruction is done in the background also taking care to not stuff the thread pool with tasks (since it's shared).
Further work is needed to enable reconstruction, in particular: