fix: optimize the signing of attestations by petarjuki7 · Pull Request #834 · sigp/anchor

petarjuki7 · 2026-02-17T00:26:09Z

Issue Addressed

Addresses issue #819

Proposed Changes

Batch attestation signing by SSV committee - runs QBFT consensus once per committee instead of once per validator
Process committees in parallel with join_all
Added AttestationData type alias for cleaner code

Additional Info

Follow-up to #813. The sign_attestations batch signing was delegating to single sign_attestation calls which is inefficient since validators in the same committee share the same QBFT decision.

Blockers / Dependencies

Depends on PR #855 - currently in review, passing Anchor CI
PR #855 adds changes from sigp/lighthouse#8880 - currently in review, passing LH CI
Merge #855 before this
CI passes for this PR

claude-code-actions-sigp · 2026-02-17T00:26:24Z

Claude finished @petarjuki7's task in 2m 30s —— View job

Code Review Complete

This PR implements a solid optimization by batching attestation signing at the SSV committee level. The refactoring correctly identifies that validators in the same committee can share a single QBFT consensus decision, reducing redundant consensus rounds. The code is well-structured and the parallel processing approach is appropriate.

However, there are several issues that need to be addressed before merging.

Critical Issues

1. Panic Risk: Unsafe Array Indexing ⚠️

anchor/validator_store/src/lib.rs:1301-1302 — The function assumes attestations is non-empty and directly indexes [0] without validation. If called with an empty vector, this will panic.

Fix: Add empty vector check at the start of sign_committee_attestations.

2. Redundant Lookups Inside Loop

anchor/validator_store/src/lib.rs:1357 — The function looks up validator and cluster metadata inside the loop for each validator, but all validators in the same committee share the same cluster (retrieved at line 1305).

Impact: Unnecessary O(n) database lookups when O(1) is sufficient for cluster.

3. Missing Error Context for Committee Failures

anchor/validator_store/src/lib.rs:2612-2614 — When a committee fails, the error log doesn't include the committee_id, making debugging difficult when multiple committees are processed in parallel.

Design Concerns

4. Loss of Specific Error Information

The refactoring changes error handling from per-validator to per-committee. Previously, if a validator pubkey was unknown, it logged the specific pubkey. Now, an entire committee batch fails silently with a generic error message, losing diagnostic information.

Suggestion: Consider collecting partial successes within a committee batch.

5. Assumption: All Validators in Same Committee Share Cluster

anchor/validator_store/src/lib.rs:1305 — The code assumes all validators grouped by committee_id belong to the same SSV cluster. This assumption should be validated or documented as an invariant.

Question: Is it guaranteed that all validators with the same committee_id have the same cluster_id?

Performance

6. Performance Improvement Validated ✅

The optimization is sound:

Before: N validators → N QBFT consensus rounds
After: N validators in M committees → M QBFT consensus rounds (M << N typically)

The use of join_all for parallel committee processing is appropriate.

Testing

7. Missing Tests for New Code Path ⚠️

No tests were added to verify:

Committee grouping logic works correctly
Empty attestation vector handling
Partial failure scenarios (one committee succeeds, another fails)
Validators from multiple committees are processed correctly
The optimization actually reduces QBFT calls as intended

Recommendation: Add integration tests for the new batching behavior.

Documentation

8. Good Documentation ✅

The doc comment on sign_committee_attestations is clear and the AttestationData<E> type alias improves readability.

Minor Issues

9. Inconsistent Return Type

sign_committee_attestations returns Vec<(u64, Attestation<E>, PublicKeyBytes)>, but the PublicKeyBytes is immediately discarded. Consider simplifying the return type.

10. Non-Deterministic Ordering

The HashMap iteration doesn't guarantee deterministic ordering of committee processing. Consider using BTreeMap if deterministic ordering aids debugging.

Recommendations

Must Fix Before Merge:

Add empty vector check in sign_committee_attestations
Add committee_id to error logs
Add tests for the new code path

Should Consider:
4. Optimize redundant get_validator_and_cluster calls inside loop
5. Document or validate the cluster-sharing assumption
6. Improve partial failure handling

Nice to Have:
7. Use BTreeMap for deterministic ordering
8. Simplify return type to match expected output

I've added inline comments on the specific lines that need attention.

anchor/validator_store/src/lib.rs

shane-moore · 2026-02-17T04:12:49Z

anchor/validator_store/src/lib.rs

+                    signing_root,
+                    slot,
+                )
+                .await?;


attestation validation is very time sensitive. in this loop over attestations, we .await each signature before moving onto the next, i.e. sequential processing, whereas it was being done in parallel via join_all before. we could turn this back into parallel futures with join_all as a fix.

I think at that point though, we'd just be recreating the pre-PR code, but with some extra steps and complexity.

other thoughts:
The gains from committee grouping are fewer calls to get_validator_and_cluster, get_voting_context, and similar, but these are in-memory lookups. And decide_instance is already engineered to handle multiple calls to the same instance gracefully.

Given that, I'm not sure the added complexity
(sign_committee_attestations, committee grouping HashMap, AttestationData type alias) justifies itself. What do you think about keeping the per-validator approach? And def lmk if i'm overlooking some solid gains we'd get from this PR

Thanks for the comment! The original idea was that operations like deciding on a vote and collecting the signatures are preformed on a per committee basis, it would make sense to group them by committee. Also you wouldn't need the QBFT instance per validator as its the same for everyone in the committee. But I agree decide_instance is engineered to handle multiple calls to the same instance.
Although some overhead is avoided like fewer DashMap lookups, fewer oneshot channels... etc. but these are probably cheap operations.

Would like to hear @dknopik's input, if we agree the issue wasn't that significant I am okay with closing the PR.

Actually this made me realize that the Pre-PR code is also flawed: if even one involved SSV committee is not able to sign (e.g. due to offline operators), the other signing futures will complete but the results will not be processed until those other signings time out. :/

Do you know if it has been resolved?

anchor/validator_store/src/lib.rs

dknopik · 2026-02-17T13:11:13Z

anchor/validator_store/src/lib.rs

+            let signature = match self
+                .collect_signature(
+                    PartialSignatureKind::PostConsensus,
+                    Role::Committee,
+                    CollectionMode::Committee {
+                        num_signatures_to_collect,
+                        base_hash: data_hash,
+                    },
+                    &validator,
+                    &cluster,
+                    signing_root,
+                    slot,
+                )
+                .await


This will block as it will wait for collect_signature calls for the other validators - which will happen in the next iterations, so this deadlocks.

Therefore, we need a variant that can register for multiple validators.

Likely, this is why the testnet fails.

Do you know if it has been resolved?

anchor/validator_store/src/lib.rs

diegomrsantos · 2026-02-24T19:32:33Z

anchor/validator_store/src/lib.rs

@@ -2561,46 +2721,55 @@ impl<T: SlotClock, E: EthSpec> ValidatorStore for AnchorValidatorStore<T, E> {

    async fn sign_attestations(


Could we add a regression test for committee batch failure isolation?

Suggested scenario:

one committee with 2+ validators

inject one validator prep failure (e.g. missing index/share/decrypt failure)

assert healthy validator(s) in the same committee are still signed and returned

This would lock in the intended behavior for the new batch path.

diegomrsantos · 2026-02-24T19:35:13Z

Non-blocking: validator_store/src/lib.rs is carrying too many responsibilities now (attestation batching, QBFT orchestration, assignment counting, slashing glue, error mapping, etc.). Splitting into focused modules would improve readability and testability.

Suggested follow-up split:

attestation_signing.rs (sign_attestations, sign_committee_attestations, collect_committee_signatures)
consensus.rs (QBFT helpers/validators)
assignments.rs (VotingContext, VotingAssignments, counting helpers)
errors.rs (SpecificError + conversions)

I’d prefer this as a separate refactor PR so this optimization PR stays easy to review.

anchor/qbft_manager/src/lib.rs

shane-moore · 2026-02-24T22:32:14Z

we should perhaps mark this PR as draft until changes from sigp/lighthouse#8880 are incorporated. To be unblocked, would recommend updating the lighthouse packages in Cargo.toml in this PR to point at the last commit in my branch in sigp/lighthouse#8880 and then refactor accordingly, since I dont anticipate too much request for changes on the lh PR

diegomrsantos · 2026-02-25T14:27:49Z

@shane-moore thanks for the heads-up. Marked it as a draft. @petarjuki7 could you please address @shane-moore suggestion?

petarjuki7 · 2026-02-27T14:10:53Z

@diegomrsantos @shane-moore So, I changed the lighthouse version to point to the newest commit on this new PR (sigp/lighthouse#8880). The main change is that I incorporated streams into the sign_attestations() function. The trait change for the ValidatorStore in the LH pr also introduced some other changes, which I'm not going to do in this PR to not mix responsibilities. So this PR will stay as a draft and will not compile for now, but it can be reviewed. I'm going to open another PR to include all of the changes from the new ValidatorStore trait, and after that, this can be merged.

diegomrsantos · 2026-02-27T15:32:05Z

@diegomrsantos @shane-moore So, I changed the lighthouse version to point to the newest commit on this new PR (sigp/lighthouse#8880). The main change is that I incorporated streams into the sign_attestations() function. The trait change for the ValidatorStore in the LH pr also introduced some other changes, which I'm not going to do in this PR to not mix responsibilities. So this PR will stay as a draft and will not compile for now, but it can be reviewed. I'm going to open another PR to include all of the changes from the new ValidatorStore trait, and after that, this can be merged.

CI is failing right now. Can you get the workflow jobs passing first? Let’s align that PRs are ready for review only when all CI checks are green.

petarjuki7 · 2026-03-03T19:59:11Z

@claude can you re-review this?

claude-code-actions-sigp · 2026-03-03T19:59:32Z

Claude finished @petarjuki7's task in 2m 59s —— View job

Code Review: PR #834 — Optimize Attestation Signing

Gather context and read changed files
Review validator_store changes
Review signature_collector changes
Review qbft_manager changes
Review other changes
Post review feedback

The PR has improved substantially since the first review. The helper extraction (resolve_decrypted_share, prepare_validator_signing_data), best-effort per-validator error handling, FuturesUnordered streaming, and Either::Left/Either::Right for the not-synced path all address previous feedback well. The core optimization is sound — grouping by committee to avoid redundant QBFT rounds.

Remaining issues:

Issues

1. Redundant `get_validator_and_cluster` calls — validators looked up twice

validator_store/src/lib.rs:2820-2835 — The sign_attestations grouping loop calls get_validator_and_cluster for every attestation to determine the committee_id. Then sign_committee_attestations at line 1476 calls get_validator_and_cluster again for every attestation in the committee loop. This means every validator gets two full database lookups (metadata + cluster).

Consider passing the already-looked-up ValidatorMetadata and Cluster from the grouping phase into sign_committee_attestations, or restructuring so the grouping phase captures the necessary data in one pass.

2. `cluster_for_signing` captures only the first cluster, ignoring potential divergence

validator_store/src/lib.rs:1488-1489 — cluster_for_signing is set from the first validator's cluster and never verified against subsequent validators. If validators within the same committee_id belong to different cluster_ids (different fee recipients, different liquidation states), the wrong cluster would be used for collect_committee_signatures.

This relates to the broader question of whether committee_id uniquely identifies a cluster — if it does, document/assert it. If it doesn't, this is a correctness bug.

3. `cluster_members` fetched separately from `cluster`, creating a consistency gap

validator_store/src/lib.rs:1411-1417 fetches cluster_members via get_cluster_members(&committee_id), but line 1510 gets the full Cluster from get_validator_and_cluster. These are two separate database reads. If a cluster is updated between these two reads, cluster_members (used for QBFT) and cluster (used for collect_committee_signatures threshold via get_f()) could be inconsistent.

Consider using a single data source. For example, since you already fetch the full Cluster from get_validator_and_cluster, use cluster.cluster_members for QBFT consensus instead of the separate get_cluster_members call.

4. Sequential notifier awaiting in `sign_and_collect_batch`

signature_collector/src/lib.rs:391-400 — The notifiers are awaited sequentially in a for loop. If one validator's signature reconstruction is slow (e.g., waiting for threshold shares from the network), all subsequent validators block behind it, even if their signatures are already available. This negates some of the batching benefit.

Consider using join_all or FuturesUnordered to await all notifiers concurrently.

5. Unused dependencies added

tokio-stream is added in anchor/validator_store/Cargo.toml but never imported or used
async-stream is added in the workspace Cargo.toml but never used anywhere

These should be removed.

Minor

6. Large inline tuple type in `prepared` vec

validator_store/src/lib.rs:1459-1466 — The 6-element tuple (u64, PublicKeyBytes, usize, Attestation<E>, ValidatorMetadata, Hash256) is hard to read and error-prone when destructuring. Consider a small struct (e.g., PreparedAttestation) to improve readability.

7. Missing metrics timer for batch signing path

The old sign_attestation had validator_metrics::start_timer_vec(&validator_metrics::SIGNING_TIMES, &["ssv"]) for the signing step. The new collect_committee_signatures path doesn't have an equivalent timer, making it harder to monitor signing latency per the new batch path.

Positive Changes Since Last Review

Helper extraction (resolve_decrypted_share, prepare_validator_signing_data) follows @diegomrsantos's suggestion well — clean separation of concerns
Best-effort per-validator error handling with warn!/error! + continue instead of fail-fast ? — addresses failure isolation concern
Either::Left/Either::Right for the not-synced path instead of async_stream — follows @shane-moore's suggestion
FuturesUnordered for committee-level streaming — each committee completes independently
decide_instance now takes &IndexSet<OperatorId> instead of &Cluster — cleaner interface
Removal of redundant .with_quorum_size() in qbft_manager — the default already computes the same value

Prepares `unstable` for PR #834 (attestation batching optimization) by migrating to the updated lighthouse `ValidatorStore` trait - Update Lighthouse dependency to `shane-moore/lighthouse` fork [PR](sigp/lighthouse#8880) with streaming `ValidatorStore` trait fn return types. - **Note** that [PR](sigp/lighthouse#8880) must be merged on the lh side, and anchor updated to use `sigp/lighthouse` for dependencies, before we merge this code to our `stable` branch - Implement four streaming trait methods to replace existing fn's: - `sign_aggregate_and_proofs`, `sign_sync_committee_signatures`, `sign_sync_committee_contributions`, and `sign_attestations` The impl's now have naive stream wrappers, and the functionality is equivalent to the previous functions in `unstable`. The idea is that we can have separate PR's built on top of this base that will refactor these trait methods to stream committee based batched signing Co-Authored-By: shane-moore <skm1790@gmail.com>

Cargo.toml

anchor/validator_store/src/lib.rs

anchor/database/src/state.rs

anchor/qbft_manager/Cargo.toml

anchor/validator_store/src/lib.rs

shane-moore · 2026-03-04T23:33:53Z

I'm trying to think of what sign_and_collect_batch would look like for sign_attestations and sign_sync_committee_signatures to be able to combine signature on a per committee basis, like they do today, using num_signatures_to_collect to indicate to the committee_signatures dashmap when it's time to fire the single partial sig message to the network. since both the attestation and sync messages have same base_hash, they get keyed to same bucked in the committee_signatures dashmap

we would need to extend sign_and_collect_batch to add its validators to the committee_signatures dashmap. and once num_signatures_to_collect is hit, proceed to sending the network message

but this is getting rather close to re-implementing the same functionality as already done in sign_and_collect. I'd favor removing sign_and_collect_batch and reverting to having collect_committee_signatures use join_all(collect_signature(Committee)) with correct num_signatures_to_collect (which avoids the sequential deadlock that motivated the sign_and_collect_batch approach). Reasons:

less code to maintain
join_all solves the deadlock -> the original concern was that sequential collect_signature calls deadlock because each .await waits before the next validator is registered. join_all enqueues all signing closures concurrently, so all validators are registered before any result is awaited
the dashMap accumulation path already handles the attestation + sync combination correctly

I think similar to the concerns in this issue, we want to avoid refactors for performance without having metrics supporting they are legit bottlenecks. I think it would be good to add a gh issue to add metrics around this signing flow so we can do proper profiling. Then, if justified, we can consider a sign_and_collect_batch

just my thoughts. open to other suggestions as well, perhaps i'm missing some reason why sign_and_collect_batch is required to prevent slow committees blocking faster committees, but i think the stream approach resolves that

diegomrsantos · 2026-03-07T01:15:04Z

anchor/signature_collector/src/lib.rs

+                ) {
+                    Ok(msg) => msg,
+                    Err(err) => {
+                        error!(%err, "Failed to create batch partial signature message");


If create_message fails here, we return without notifying the registered oneshot waiters. That later surfaces as per-validator Failed to receive signature warnings, which obscures the root cause. Could we propagate this batch-creation error to each waiter (or at least add explicit log linkage)?

I am proposing in #834 (comment) that we don't modify signature collector for reasons described in the comment. If I'm missing something and signature collector mods are necessary, I'm certainly open to it

I am proposing in #834 (comment) that we don't modify signature collector for reasons described in the comment. If I'm missing something and signature collector mods are necessary, I'm certainly open to it

Tbh, I reviewed it before fully understanding your comment. It's hard to keep the full context in mind when someone isn't too familiar with this part of the code. But I'll do it.

I get that! I spent some hours myself building context in order to formulate that comment because i was less familiar with signature collector. In a nut shell, the validator store changes are something we definitely want to get in for boole fork because it will enable us to send committee based batches of attestations to the beacon node. Currently, we don't have the batches, and a slow committee's post-consensus sig collection will block faster committee's. I've convinced myself we can accomplish the batching we need w/o changes to the sig collector logic, but I could of course be wrong. On your end, no rush, and feel free to wait for the next round of changes before reviewing further

anchor/validator_store/src/lib.rs

anchor/database/src/state.rs

diegomrsantos · 2026-03-07T01:15:04Z

anchor/validator_store/src/lib.rs

+            match self.get_validator_and_cluster(pubkey) {
+                Ok((_, cluster)) => {
+                    committee_mapping
+                        .entry(cluster.committee_id())


Safety check: grouping key is only committee_id, while sign_committee_attestations derives slot/data from the first item. Current caller appears single-slot, but should we either key by (committee_id, slot) or assert slot consistency within each group to prevent future misuse?

diegomrsantos · 2026-03-07T01:15:04Z

anchor/signature_collector/src/lib.rs

+    /// This is more efficient than calling sign_and_collect multiple times because:
+    /// 1. All partial signatures are sent in a single network message
+    /// 2. No risk of deadlock from sequential processing in Committee mode
+    pub async fn sign_and_collect_batch(


This path adds a large new production flow. Can we add dedicated tests for: (1) multi-validator batch success and (2) failure isolation where one validator prep failure does not block other validators in the same committee?

same here as #834 (comment)

diegomrsantos · 2026-03-07T01:18:46Z

anchor/signature_collector/src/lib.rs

+    /// This is more efficient than calling sign_and_collect multiple times because:
+    /// 1. All partial signatures are sent in a single network message
+    /// 2. No risk of deadlock from sequential processing in Committee mode
+    pub async fn sign_and_collect_batch(


This method has become quite dense (register waiters, sign/build/send, then await/aggregate). Could we split it into small helpers (e.g. register_batch_notifiers, send_batch_partials, collect_batch_results)? It would make the control flow and error handling (especially early returns in the signing closure) much easier to reason about.

same here as #834 (comment)

diegomrsantos · 2026-03-07T16:55:58Z

I'm a bit confused, is this PR a necessary change for the boole fork, or is it an optimization?

diegomrsantos · 2026-03-07T18:16:57Z

I'd favor removing sign_and_collect_batch and reverting to having collect_committee_signatures use join_all(collect_signature(Committee)) with correct num_signatures_to_collect (which avoids the sequential deadlock that motivated the sign_and_collect_batch approach).

Thanks for the detailed write-up, this helped clarify the tradeoffs.

After re-reading the flow, I think we’re aligned on direction:

This PR is not Boole-fork blocking.
Current pre-PR behavior is already correct:
- sign_attestations fans out per-validator concurrently (join_all)
- QBFT dedups by committee/slot instance internally
So “one consensus per committee” is already happening implicitly today.

Where I share the concern is sign_and_collect_batch:

it introduces a second batching path in signature_collector
it does not go through the existing CollectionMode::Committee / committee_signatures accumulation contract
that increases risk of semantic drift from existing committee counting behavior (num_signatures_to_collect, base_hash) used by current committee collection flows

I’d favor the lower-risk path for now:

keep committee-level streaming in sign_attestations (FuturesUnordered) so slow committees don’t block fast ones
remove sign_and_collect_batch
inside each committee, use concurrent join_all(collect_signature(CollectionMode::Committee)) with correct num_signatures_to_collect and existing base_hash semantics

This keeps behavior aligned with the current collector contract while still delivering the resilience improvement from streaming.

shane-moore · 2026-03-07T18:21:59Z

I'm a bit confused, is this PR a necessary change for the boole fork, or is it an optimization?

this PR is a fix to our existing blocking flow. If an operator's committee based attempt to acquire attestation partial signatures fails due to any number of reasons, it blocks all other committtee's signed attestations from being submitted to the BN. hence the stream of committee based batches approached introduced here. we want to fix it and test in boole devnet

petarjuki7 requested review from diegomrsantos, dknopik and shane-moore February 17, 2026 00:26

petarjuki7 self-assigned this Feb 17, 2026

claude-code-actions-sigp bot reviewed Feb 17, 2026

View reviewed changes

anchor/validator_store/src/lib.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Feb 17, 2026

View reviewed changes

anchor/validator_store/src/lib.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Feb 17, 2026

View reviewed changes

anchor/validator_store/src/lib.rs Outdated Show resolved Hide resolved

claude-code-actions-sigp bot reviewed Feb 17, 2026

View reviewed changes

anchor/validator_store/src/lib.rs Outdated Show resolved Hide resolved

shane-moore reviewed Feb 17, 2026

View reviewed changes

anchor/validator_store/src/lib.rs Outdated Show resolved Hide resolved

shane-moore reviewed Feb 17, 2026

View reviewed changes

anchor/validator_store/src/lib.rs Show resolved Hide resolved