fix: optimize the signing of attestations#834
fix: optimize the signing of attestations#834petarjuki7 wants to merge 8 commits intosigp:unstablefrom
Conversation
|
Claude finished @petarjuki7's task in 2m 30s —— View job Code Review CompleteThis PR implements a solid optimization by batching attestation signing at the SSV committee level. The refactoring correctly identifies that validators in the same committee can share a single QBFT consensus decision, reducing redundant consensus rounds. The code is well-structured and the parallel processing approach is appropriate. However, there are several issues that need to be addressed before merging. Critical Issues1. Panic Risk: Unsafe Array Indexing
|
anchor/validator_store/src/lib.rs
Outdated
| signing_root, | ||
| slot, | ||
| ) | ||
| .await?; |
There was a problem hiding this comment.
attestation validation is very time sensitive. in this loop over attestations, we .await each signature before moving onto the next, i.e. sequential processing, whereas it was being done in parallel via join_all before. we could turn this back into parallel futures with join_all as a fix.
I think at that point though, we'd just be recreating the pre-PR code, but with some extra steps and complexity.
other thoughts:
The gains from committee grouping are fewer calls to get_validator_and_cluster, get_voting_context, and similar, but these are in-memory lookups. And decide_instance is already engineered to handle multiple calls to the same instance gracefully.
Given that, I'm not sure the added complexity
(sign_committee_attestations, committee grouping HashMap, AttestationData type alias) justifies itself. What do you think about keeping the per-validator approach? And def lmk if i'm overlooking some solid gains we'd get from this PR
There was a problem hiding this comment.
Thanks for the comment! The original idea was that operations like deciding on a vote and collecting the signatures are preformed on a per committee basis, it would make sense to group them by committee. Also you wouldn't need the QBFT instance per validator as its the same for everyone in the committee. But I agree decide_instance is engineered to handle multiple calls to the same instance.
Although some overhead is avoided like fewer DashMap lookups, fewer oneshot channels... etc. but these are probably cheap operations.
Would like to hear @dknopik's input, if we agree the issue wasn't that significant I am okay with closing the PR.
There was a problem hiding this comment.
Actually this made me realize that the Pre-PR code is also flawed: if even one involved SSV committee is not able to sign (e.g. due to offline operators), the other signing futures will complete but the results will not be processed until those other signings time out. :/
There was a problem hiding this comment.
Do you know if it has been resolved?
anchor/validator_store/src/lib.rs
Outdated
| let signature = match self | ||
| .collect_signature( | ||
| PartialSignatureKind::PostConsensus, | ||
| Role::Committee, | ||
| CollectionMode::Committee { | ||
| num_signatures_to_collect, | ||
| base_hash: data_hash, | ||
| }, | ||
| &validator, | ||
| &cluster, | ||
| signing_root, | ||
| slot, | ||
| ) | ||
| .await |
There was a problem hiding this comment.
This will block as it will wait for collect_signature calls for the other validators - which will happen in the next iterations, so this deadlocks.
Therefore, we need a variant that can register for multiple validators.
Likely, this is why the testnet fails.
There was a problem hiding this comment.
Do you know if it has been resolved?
anchor/validator_store/src/lib.rs
Outdated
| @@ -2561,46 +2721,55 @@ impl<T: SlotClock, E: EthSpec> ValidatorStore for AnchorValidatorStore<T, E> { | |||
|
|
|||
| async fn sign_attestations( | |||
There was a problem hiding this comment.
Could we add a regression test for committee batch failure isolation?
Suggested scenario:
- one committee with 2+ validators
- inject one validator prep failure (e.g. missing index/share/decrypt failure)
- assert healthy validator(s) in the same committee are still signed and returned
This would lock in the intended behavior for the new batch path.
|
Non-blocking: Suggested follow-up split:
I’d prefer this as a separate refactor PR so this optimization PR stays easy to review. |
|
we should perhaps mark this PR as draft until changes from sigp/lighthouse#8880 are incorporated. To be unblocked, would recommend updating the lighthouse packages in |
|
@shane-moore thanks for the heads-up. Marked it as a draft. @petarjuki7 could you please address @shane-moore suggestion? |
|
@diegomrsantos @shane-moore So, I changed the lighthouse version to point to the newest commit on this new PR (sigp/lighthouse#8880). The main change is that I incorporated streams into the |
CI is failing right now. Can you get the workflow jobs passing first? Let’s align that PRs are ready for review only when all CI checks are green. |
|
@claude can you re-review this? |
|
Claude finished @petarjuki7's task in 2m 59s —— View job Code Review: PR #834 — Optimize Attestation Signing
The PR has improved substantially since the first review. The helper extraction ( Remaining issues: Issues1. Redundant
|
Prepares `unstable` for PR #834 (attestation batching optimization) by migrating to the updated lighthouse `ValidatorStore` trait - Update Lighthouse dependency to `shane-moore/lighthouse` fork [PR](sigp/lighthouse#8880) with streaming `ValidatorStore` trait fn return types. - **Note** that [PR](sigp/lighthouse#8880) must be merged on the lh side, and anchor updated to use `sigp/lighthouse` for dependencies, before we merge this code to our `stable` branch - Implement four streaming trait methods to replace existing fn's: - `sign_aggregate_and_proofs`, `sign_sync_committee_signatures`, `sign_sync_committee_contributions`, and `sign_attestations` The impl's now have naive stream wrappers, and the functionality is equivalent to the previous functions in `unstable`. The idea is that we can have separate PR's built on top of this base that will refactor these trait methods to stream committee based batched signing Co-Authored-By: shane-moore <skm1790@gmail.com>
4c1ea3b to
9f7eb54
Compare
|
I'm trying to think of what we would need to extend but this is getting rather close to re-implementing the same functionality as already done in
I think similar to the concerns in this issue, we want to avoid refactors for performance without having metrics supporting they are legit bottlenecks. I think it would be good to add a gh issue to add metrics around this signing flow so we can do proper profiling. Then, if justified, we can consider a just my thoughts. open to other suggestions as well, perhaps i'm missing some reason why |
| ) { | ||
| Ok(msg) => msg, | ||
| Err(err) => { | ||
| error!(%err, "Failed to create batch partial signature message"); |
There was a problem hiding this comment.
If create_message fails here, we return without notifying the registered oneshot waiters. That later surfaces as per-validator Failed to receive signature warnings, which obscures the root cause. Could we propagate this batch-creation error to each waiter (or at least add explicit log linkage)?
There was a problem hiding this comment.
I am proposing in #834 (comment) that we don't modify signature collector for reasons described in the comment. If I'm missing something and signature collector mods are necessary, I'm certainly open to it
There was a problem hiding this comment.
I am proposing in #834 (comment) that we don't modify signature collector for reasons described in the comment. If I'm missing something and signature collector mods are necessary, I'm certainly open to it
Tbh, I reviewed it before fully understanding your comment. It's hard to keep the full context in mind when someone isn't too familiar with this part of the code. But I'll do it.
There was a problem hiding this comment.
I get that! I spent some hours myself building context in order to formulate that comment because i was less familiar with signature collector. In a nut shell, the validator store changes are something we definitely want to get in for boole fork because it will enable us to send committee based batches of attestations to the beacon node. Currently, we don't have the batches, and a slow committee's post-consensus sig collection will block faster committee's. I've convinced myself we can accomplish the batching we need w/o changes to the sig collector logic, but I could of course be wrong. On your end, no rush, and feel free to wait for the next round of changes before reviewing further
| match self.get_validator_and_cluster(pubkey) { | ||
| Ok((_, cluster)) => { | ||
| committee_mapping | ||
| .entry(cluster.committee_id()) |
There was a problem hiding this comment.
Safety check: grouping key is only committee_id, while sign_committee_attestations derives slot/data from the first item. Current caller appears single-slot, but should we either key by (committee_id, slot) or assert slot consistency within each group to prevent future misuse?
| /// This is more efficient than calling sign_and_collect multiple times because: | ||
| /// 1. All partial signatures are sent in a single network message | ||
| /// 2. No risk of deadlock from sequential processing in Committee mode | ||
| pub async fn sign_and_collect_batch( |
There was a problem hiding this comment.
This path adds a large new production flow. Can we add dedicated tests for: (1) multi-validator batch success and (2) failure isolation where one validator prep failure does not block other validators in the same committee?
| /// This is more efficient than calling sign_and_collect multiple times because: | ||
| /// 1. All partial signatures are sent in a single network message | ||
| /// 2. No risk of deadlock from sequential processing in Committee mode | ||
| pub async fn sign_and_collect_batch( |
There was a problem hiding this comment.
This method has become quite dense (register waiters, sign/build/send, then await/aggregate). Could we split it into small helpers (e.g. register_batch_notifiers, send_batch_partials, collect_batch_results)? It would make the control flow and error handling (especially early returns in the signing closure) much easier to reason about.
|
I'm a bit confused, is this PR a necessary change for the boole fork, or is it an optimization? |
Thanks for the detailed write-up, this helped clarify the tradeoffs. After re-reading the flow, I think we’re aligned on direction:
Where I share the concern is
I’d favor the lower-risk path for now:
This keeps behavior aligned with the current collector contract while still delivering the resilience improvement from streaming. |
this PR is a fix to our existing blocking flow. If an operator's committee based attempt to acquire attestation partial signatures fails due to any number of reasons, it blocks all other committtee's signed attestations from being submitted to the BN. hence the stream of committee based batches approached introduced here. we want to fix it and test in boole devnet |
Issue Addressed
Addresses issue #819
Proposed Changes
Additional Info
Follow-up to #813. The sign_attestations batch signing was delegating to single sign_attestation calls which is inefficient since validators in the same committee share the same QBFT decision.
Blockers / Dependencies
Depends on PR #855 - currently in review, passing Anchor CI
PR #855 adds changes from sigp/lighthouse#8880 - currently in review, passing LH CI
Merge #855 before this
CI passes for this PR