Skip to content

fix: optimize the signing of attestations#834

Open
petarjuki7 wants to merge 8 commits intosigp:unstablefrom
petarjuki7:sign_attestations_optimization
Open

fix: optimize the signing of attestations#834
petarjuki7 wants to merge 8 commits intosigp:unstablefrom
petarjuki7:sign_attestations_optimization

Conversation

@petarjuki7
Copy link
Member

@petarjuki7 petarjuki7 commented Feb 17, 2026

Issue Addressed

Addresses issue #819

Proposed Changes

  • Batch attestation signing by SSV committee - runs QBFT consensus once per committee instead of once per validator
  • Process committees in parallel with join_all
  • Added AttestationData type alias for cleaner code

Additional Info

Follow-up to #813. The sign_attestations batch signing was delegating to single sign_attestation calls which is inefficient since validators in the same committee share the same QBFT decision.

Blockers / Dependencies

Depends on PR #855 - currently in review, passing Anchor CI
PR #855 adds changes from sigp/lighthouse#8880 - currently in review, passing LH CI
Merge #855 before this
CI passes for this PR

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Feb 17, 2026

Claude finished @petarjuki7's task in 2m 30s —— View job


Code Review Complete

This PR implements a solid optimization by batching attestation signing at the SSV committee level. The refactoring correctly identifies that validators in the same committee can share a single QBFT consensus decision, reducing redundant consensus rounds. The code is well-structured and the parallel processing approach is appropriate.

However, there are several issues that need to be addressed before merging.


Critical Issues

1. Panic Risk: Unsafe Array Indexing ⚠️

anchor/validator_store/src/lib.rs:1301-1302 — The function assumes attestations is non-empty and directly indexes [0] without validation. If called with an empty vector, this will panic.

Fix: Add empty vector check at the start of sign_committee_attestations.

2. Redundant Lookups Inside Loop

anchor/validator_store/src/lib.rs:1357 — The function looks up validator and cluster metadata inside the loop for each validator, but all validators in the same committee share the same cluster (retrieved at line 1305).

Impact: Unnecessary O(n) database lookups when O(1) is sufficient for cluster.

3. Missing Error Context for Committee Failures

anchor/validator_store/src/lib.rs:2612-2614 — When a committee fails, the error log doesn't include the committee_id, making debugging difficult when multiple committees are processed in parallel.


Design Concerns

4. Loss of Specific Error Information

The refactoring changes error handling from per-validator to per-committee. Previously, if a validator pubkey was unknown, it logged the specific pubkey. Now, an entire committee batch fails silently with a generic error message, losing diagnostic information.

Suggestion: Consider collecting partial successes within a committee batch.

5. Assumption: All Validators in Same Committee Share Cluster

anchor/validator_store/src/lib.rs:1305 — The code assumes all validators grouped by committee_id belong to the same SSV cluster. This assumption should be validated or documented as an invariant.

Question: Is it guaranteed that all validators with the same committee_id have the same cluster_id?


Performance

6. Performance Improvement Validated

The optimization is sound:

  • Before: N validators → N QBFT consensus rounds
  • After: N validators in M committees → M QBFT consensus rounds (M << N typically)

The use of join_all for parallel committee processing is appropriate.


Testing

7. Missing Tests for New Code Path ⚠️

No tests were added to verify:

  • Committee grouping logic works correctly
  • Empty attestation vector handling
  • Partial failure scenarios (one committee succeeds, another fails)
  • Validators from multiple committees are processed correctly
  • The optimization actually reduces QBFT calls as intended

Recommendation: Add integration tests for the new batching behavior.


Documentation

8. Good Documentation

The doc comment on sign_committee_attestations is clear and the AttestationData<E> type alias improves readability.


Minor Issues

9. Inconsistent Return Type

sign_committee_attestations returns Vec<(u64, Attestation<E>, PublicKeyBytes)>, but the PublicKeyBytes is immediately discarded. Consider simplifying the return type.

10. Non-Deterministic Ordering

The HashMap iteration doesn't guarantee deterministic ordering of committee processing. Consider using BTreeMap if deterministic ordering aids debugging.


Recommendations

Must Fix Before Merge:

  1. Add empty vector check in sign_committee_attestations
  2. Add committee_id to error logs
  3. Add tests for the new code path

Should Consider:
4. Optimize redundant get_validator_and_cluster calls inside loop
5. Document or validate the cluster-sharing assumption
6. Improve partial failure handling

Nice to Have:
7. Use BTreeMap for deterministic ordering
8. Simplify return type to match expected output


I've added inline comments on the specific lines that need attention.

signing_root,
slot,
)
.await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attestation validation is very time sensitive. in this loop over attestations, we .await each signature before moving onto the next, i.e. sequential processing, whereas it was being done in parallel via join_all before. we could turn this back into parallel futures with join_all as a fix.

I think at that point though, we'd just be recreating the pre-PR code, but with some extra steps and complexity.

other thoughts:
The gains from committee grouping are fewer calls to get_validator_and_cluster, get_voting_context, and similar, but these are in-memory lookups. And decide_instance is already engineered to handle multiple calls to the same instance gracefully.

Given that, I'm not sure the added complexity
(sign_committee_attestations, committee grouping HashMap, AttestationData type alias) justifies itself. What do you think about keeping the per-validator approach? And def lmk if i'm overlooking some solid gains we'd get from this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! The original idea was that operations like deciding on a vote and collecting the signatures are preformed on a per committee basis, it would make sense to group them by committee. Also you wouldn't need the QBFT instance per validator as its the same for everyone in the committee. But I agree decide_instance is engineered to handle multiple calls to the same instance.
Although some overhead is avoided like fewer DashMap lookups, fewer oneshot channels... etc. but these are probably cheap operations.

Would like to hear @dknopik's input, if we agree the issue wasn't that significant I am okay with closing the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this made me realize that the Pre-PR code is also flawed: if even one involved SSV committee is not able to sign (e.g. due to offline operators), the other signing futures will complete but the results will not be processed until those other signings time out. :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if it has been resolved?

Comment on lines +1374 to +1387
let signature = match self
.collect_signature(
PartialSignatureKind::PostConsensus,
Role::Committee,
CollectionMode::Committee {
num_signatures_to_collect,
base_hash: data_hash,
},
&validator,
&cluster,
signing_root,
slot,
)
.await
Copy link
Member

@dknopik dknopik Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will block as it will wait for collect_signature calls for the other validators - which will happen in the next iterations, so this deadlocks.

Therefore, we need a variant that can register for multiple validators.

Likely, this is why the testnet fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if it has been resolved?

@@ -2561,46 +2721,55 @@ impl<T: SlotClock, E: EthSpec> ValidatorStore for AnchorValidatorStore<T, E> {

async fn sign_attestations(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a regression test for committee batch failure isolation?

Suggested scenario:

  • one committee with 2+ validators
  • inject one validator prep failure (e.g. missing index/share/decrypt failure)
  • assert healthy validator(s) in the same committee are still signed and returned

This would lock in the intended behavior for the new batch path.

@diegomrsantos
Copy link
Member

Non-blocking: validator_store/src/lib.rs is carrying too many responsibilities now (attestation batching, QBFT orchestration, assignment counting, slashing glue, error mapping, etc.). Splitting into focused modules would improve readability and testability.

Suggested follow-up split:

  • attestation_signing.rs (sign_attestations, sign_committee_attestations, collect_committee_signatures)
  • consensus.rs (QBFT helpers/validators)
  • assignments.rs (VotingContext, VotingAssignments, counting helpers)
  • errors.rs (SpecificError + conversions)

I’d prefer this as a separate refactor PR so this optimization PR stays easy to review.

@shane-moore
Copy link
Member

shane-moore commented Feb 24, 2026

we should perhaps mark this PR as draft until changes from sigp/lighthouse#8880 are incorporated. To be unblocked, would recommend updating the lighthouse packages in Cargo.toml in this PR to point at the last commit in my branch in sigp/lighthouse#8880 and then refactor accordingly, since I dont anticipate too much request for changes on the lh PR

@diegomrsantos diegomrsantos marked this pull request as draft February 25, 2026 14:26
@diegomrsantos
Copy link
Member

@shane-moore thanks for the heads-up. Marked it as a draft. @petarjuki7 could you please address @shane-moore suggestion?

@petarjuki7
Copy link
Member Author

@diegomrsantos @shane-moore So, I changed the lighthouse version to point to the newest commit on this new PR (sigp/lighthouse#8880). The main change is that I incorporated streams into the sign_attestations() function. The trait change for the ValidatorStore in the LH pr also introduced some other changes, which I'm not going to do in this PR to not mix responsibilities. So this PR will stay as a draft and will not compile for now, but it can be reviewed. I'm going to open another PR to include all of the changes from the new ValidatorStore trait, and after that, this can be merged.

@diegomrsantos
Copy link
Member

@diegomrsantos @shane-moore So, I changed the lighthouse version to point to the newest commit on this new PR (sigp/lighthouse#8880). The main change is that I incorporated streams into the sign_attestations() function. The trait change for the ValidatorStore in the LH pr also introduced some other changes, which I'm not going to do in this PR to not mix responsibilities. So this PR will stay as a draft and will not compile for now, but it can be reviewed. I'm going to open another PR to include all of the changes from the new ValidatorStore trait, and after that, this can be merged.

CI is failing right now. Can you get the workflow jobs passing first? Let’s align that PRs are ready for review only when all CI checks are green.

@petarjuki7
Copy link
Member Author

@claude can you re-review this?

@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Mar 3, 2026

Claude finished @petarjuki7's task in 2m 59s —— View job


Code Review: PR #834 — Optimize Attestation Signing

  • Gather context and read changed files
  • Review validator_store changes
  • Review signature_collector changes
  • Review qbft_manager changes
  • Review other changes
  • Post review feedback

The PR has improved substantially since the first review. The helper extraction (resolve_decrypted_share, prepare_validator_signing_data), best-effort per-validator error handling, FuturesUnordered streaming, and Either::Left/Either::Right for the not-synced path all address previous feedback well. The core optimization is sound — grouping by committee to avoid redundant QBFT rounds.

Remaining issues:


Issues

1. Redundant get_validator_and_cluster calls — validators looked up twice

validator_store/src/lib.rs:2820-2835 — The sign_attestations grouping loop calls get_validator_and_cluster for every attestation to determine the committee_id. Then sign_committee_attestations at line 1476 calls get_validator_and_cluster again for every attestation in the committee loop. This means every validator gets two full database lookups (metadata + cluster).

Consider passing the already-looked-up ValidatorMetadata and Cluster from the grouping phase into sign_committee_attestations, or restructuring so the grouping phase captures the necessary data in one pass.

2. cluster_for_signing captures only the first cluster, ignoring potential divergence

validator_store/src/lib.rs:1488-1489cluster_for_signing is set from the first validator's cluster and never verified against subsequent validators. If validators within the same committee_id belong to different cluster_ids (different fee recipients, different liquidation states), the wrong cluster would be used for collect_committee_signatures.

This relates to the broader question of whether committee_id uniquely identifies a cluster — if it does, document/assert it. If it doesn't, this is a correctness bug.

3. cluster_members fetched separately from cluster, creating a consistency gap

validator_store/src/lib.rs:1411-1417 fetches cluster_members via get_cluster_members(&committee_id), but line 1510 gets the full Cluster from get_validator_and_cluster. These are two separate database reads. If a cluster is updated between these two reads, cluster_members (used for QBFT) and cluster (used for collect_committee_signatures threshold via get_f()) could be inconsistent.

Consider using a single data source. For example, since you already fetch the full Cluster from get_validator_and_cluster, use cluster.cluster_members for QBFT consensus instead of the separate get_cluster_members call.

4. Sequential notifier awaiting in sign_and_collect_batch

signature_collector/src/lib.rs:391-400 — The notifiers are awaited sequentially in a for loop. If one validator's signature reconstruction is slow (e.g., waiting for threshold shares from the network), all subsequent validators block behind it, even if their signatures are already available. This negates some of the batching benefit.

Consider using join_all or FuturesUnordered to await all notifiers concurrently.

5. Unused dependencies added

  • tokio-stream is added in anchor/validator_store/Cargo.toml but never imported or used
  • async-stream is added in the workspace Cargo.toml but never used anywhere

These should be removed.


Minor

6. Large inline tuple type in prepared vec

validator_store/src/lib.rs:1459-1466 — The 6-element tuple (u64, PublicKeyBytes, usize, Attestation<E>, ValidatorMetadata, Hash256) is hard to read and error-prone when destructuring. Consider a small struct (e.g., PreparedAttestation) to improve readability.

7. Missing metrics timer for batch signing path

The old sign_attestation had validator_metrics::start_timer_vec(&validator_metrics::SIGNING_TIMES, &["ssv"]) for the signing step. The new collect_committee_signatures path doesn't have an equivalent timer, making it harder to monitor signing latency per the new batch path.


Positive Changes Since Last Review

  • Helper extraction (resolve_decrypted_share, prepare_validator_signing_data) follows @diegomrsantos's suggestion well — clean separation of concerns
  • Best-effort per-validator error handling with warn!/error! + continue instead of fail-fast ? — addresses failure isolation concern
  • Either::Left/Either::Right for the not-synced path instead of async_stream — follows @shane-moore's suggestion
  • FuturesUnordered for committee-level streaming — each committee completes independently
  • decide_instance now takes &IndexSet<OperatorId> instead of &Cluster — cleaner interface
  • Removal of redundant .with_quorum_size() in qbft_manager — the default already computes the same value

mergify bot pushed a commit that referenced this pull request Mar 4, 2026
Prepares `unstable` for PR #834 (attestation batching optimization) by migrating to the updated lighthouse `ValidatorStore` trait


  - Update Lighthouse dependency to `shane-moore/lighthouse` fork [PR](sigp/lighthouse#8880) with streaming `ValidatorStore` trait fn return types.
- **Note** that [PR](sigp/lighthouse#8880) must be merged on the lh side, and anchor updated to use `sigp/lighthouse` for dependencies, before we merge this code to our `stable` branch
- Implement four streaming trait methods to replace existing fn's:
- `sign_aggregate_and_proofs`,  `sign_sync_committee_signatures`,
`sign_sync_committee_contributions`, and `sign_attestations`

The impl's now have naive stream wrappers, and the functionality is equivalent to the previous functions in `unstable`. The idea is that we can have separate PR's built on top of this base that will refactor these trait methods to stream committee based batched signing


Co-Authored-By: shane-moore <skm1790@gmail.com>
@petarjuki7 petarjuki7 force-pushed the sign_attestations_optimization branch from 4c1ea3b to 9f7eb54 Compare March 4, 2026 17:32
@petarjuki7 petarjuki7 requested a review from shane-moore March 4, 2026 17:34
@shane-moore
Copy link
Member

I'm trying to think of what sign_and_collect_batch would look like for sign_attestations and sign_sync_committee_signatures to be able to combine signature on a per committee basis, like they do today, using num_signatures_to_collect to indicate to the committee_signatures dashmap when it's time to fire the single partial sig message to the network. since both the attestation and sync messages have same base_hash, they get keyed to same bucked in the committee_signatures dashmap

we would need to extend sign_and_collect_batch to add its validators to the committee_signatures dashmap. and once num_signatures_to_collect is hit, proceed to sending the network message

but this is getting rather close to re-implementing the same functionality as already done in sign_and_collect. I'd favor removing sign_and_collect_batch and reverting to having collect_committee_signatures use join_all(collect_signature(Committee)) with correct num_signatures_to_collect (which avoids the sequential deadlock that motivated the sign_and_collect_batch approach). Reasons:

  • less code to maintain
  • join_all solves the deadlock -> the original concern was that sequential collect_signature calls deadlock because each .await waits before the next validator is registered. join_all enqueues all signing closures concurrently, so all validators are registered before any result is awaited
  • the dashMap accumulation path already handles the attestation + sync combination correctly

I think similar to the concerns in this issue, we want to avoid refactors for performance without having metrics supporting they are legit bottlenecks. I think it would be good to add a gh issue to add metrics around this signing flow so we can do proper profiling. Then, if justified, we can consider a sign_and_collect_batch

just my thoughts. open to other suggestions as well, perhaps i'm missing some reason why sign_and_collect_batch is required to prevent slow committees blocking faster committees, but i think the stream approach resolves that

) {
Ok(msg) => msg,
Err(err) => {
error!(%err, "Failed to create batch partial signature message");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If create_message fails here, we return without notifying the registered oneshot waiters. That later surfaces as per-validator Failed to receive signature warnings, which obscures the root cause. Could we propagate this batch-creation error to each waiter (or at least add explicit log linkage)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am proposing in #834 (comment) that we don't modify signature collector for reasons described in the comment. If I'm missing something and signature collector mods are necessary, I'm certainly open to it

Copy link
Member

@diegomrsantos diegomrsantos Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am proposing in #834 (comment) that we don't modify signature collector for reasons described in the comment. If I'm missing something and signature collector mods are necessary, I'm certainly open to it

Tbh, I reviewed it before fully understanding your comment. It's hard to keep the full context in mind when someone isn't too familiar with this part of the code. But I'll do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that! I spent some hours myself building context in order to formulate that comment because i was less familiar with signature collector. In a nut shell, the validator store changes are something we definitely want to get in for boole fork because it will enable us to send committee based batches of attestations to the beacon node. Currently, we don't have the batches, and a slow committee's post-consensus sig collection will block faster committee's. I've convinced myself we can accomplish the batching we need w/o changes to the sig collector logic, but I could of course be wrong. On your end, no rush, and feel free to wait for the next round of changes before reviewing further

match self.get_validator_and_cluster(pubkey) {
Ok((_, cluster)) => {
committee_mapping
.entry(cluster.committee_id())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safety check: grouping key is only committee_id, while sign_committee_attestations derives slot/data from the first item. Current caller appears single-slot, but should we either key by (committee_id, slot) or assert slot consistency within each group to prevent future misuse?

/// This is more efficient than calling sign_and_collect multiple times because:
/// 1. All partial signatures are sent in a single network message
/// 2. No risk of deadlock from sequential processing in Committee mode
pub async fn sign_and_collect_batch(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path adds a large new production flow. Can we add dedicated tests for: (1) multi-validator batch success and (2) failure isolation where one validator prep failure does not block other validators in the same committee?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here as #834 (comment)

/// This is more efficient than calling sign_and_collect multiple times because:
/// 1. All partial signatures are sent in a single network message
/// 2. No risk of deadlock from sequential processing in Committee mode
pub async fn sign_and_collect_batch(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has become quite dense (register waiters, sign/build/send, then await/aggregate). Could we split it into small helpers (e.g. register_batch_notifiers, send_batch_partials, collect_batch_results)? It would make the control flow and error handling (especially early returns in the signing closure) much easier to reason about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here as #834 (comment)

@diegomrsantos
Copy link
Member

I'm a bit confused, is this PR a necessary change for the boole fork, or is it an optimization?

@diegomrsantos
Copy link
Member

I'd favor removing sign_and_collect_batch and reverting to having collect_committee_signatures use join_all(collect_signature(Committee)) with correct num_signatures_to_collect (which avoids the sequential deadlock that motivated the sign_and_collect_batch approach).

Thanks for the detailed write-up, this helped clarify the tradeoffs.

After re-reading the flow, I think we’re aligned on direction:

  • This PR is not Boole-fork blocking.
  • Current pre-PR behavior is already correct:
    • sign_attestations fans out per-validator concurrently (join_all)
    • QBFT dedups by committee/slot instance internally
  • So “one consensus per committee” is already happening implicitly today.

Where I share the concern is sign_and_collect_batch:

  • it introduces a second batching path in signature_collector
  • it does not go through the existing CollectionMode::Committee / committee_signatures accumulation contract
  • that increases risk of semantic drift from existing committee counting behavior (num_signatures_to_collect, base_hash) used by current committee collection flows

I’d favor the lower-risk path for now:

  1. keep committee-level streaming in sign_attestations (FuturesUnordered) so slow committees don’t block fast ones
  2. remove sign_and_collect_batch
  3. inside each committee, use concurrent join_all(collect_signature(CollectionMode::Committee)) with correct num_signatures_to_collect and existing base_hash semantics

This keeps behavior aligned with the current collector contract while still delivering the resilience improvement from streaming.

@shane-moore
Copy link
Member

I'm a bit confused, is this PR a necessary change for the boole fork, or is it an optimization?

this PR is a fix to our existing blocking flow. If an operator's committee based attempt to acquire attestation partial signatures fails due to any number of reasons, it blocks all other committtee's signed attestations from being submitted to the BN. hence the stream of committee based batches approached introduced here. we want to fix it and test in boole devnet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants