fix: reconstruct ISLOCK locally when recsig arrived first#7291
fix: reconstruct ISLOCK locally when recsig arrived first#7291knst wants to merge 1 commit intodashpay:developfrom
Conversation
PR dashpay#6994 stopped sending ISDLOCK invs to peers requesting recsigs on the premise that they can reconstruct the islock from the recsig. That reconstruction has a pre-existing race that become critical now. It causes intermittent failure in interface_zmq_dash.py - wait_for_instantlock timing out. The previous reconstruction path checked HasRecoveredSigForId before emplacing into creatingInstantSendLocks, which leaves a window: thread A: ProcessRecoveredSig writes recsig to db, listener fires HandleNewRecoveredSig, finds creating empty, no-ops thread B: TrySignInstantSendLock saw HasRecoveredSigForId=false, now emplaces and calls AsyncSignIfMember It causes recsig to be in in db but no one ever builds the islock: - db.HasRecoveredSigForHash blocks any subsequent listener firing for that recsig - TrySignInstantSendLock has already passed its check Before dashpay#6994 this was masked because the peer would still receive the ISDLOCK inv from another node, but on develop the peer no longer receives an ISDLOCK inv from peers as a fallback. Reorder to emplace first, then check: 1. Emplace into creatingInstantSendLocks. Any listener that fires after this point sees the entry and reconstructs via the existing HandleNewInstantSendLockRecoveredSig path. 2. Then check HasRecoveredSigForId. If already in db, find+erase the entry; if find misses, the listener already handled it. 3. Only call AsyncSignIfMember when no recsig exists yet.
|
✅ Review complete (commit 6587677) |
✅ No Merge Conflicts DetectedThis PR currently has no conflicts with other open PRs. |
WalkthroughThe Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/instantsend/signing.cpp (1)
417-438: Reduce duplication: delegate toHandleNewInstantSendLockRecoveredSig.Lines 418–437 reimplement, nearly verbatim, the find/move/erase + msgHash-equality +
sigassignment +TryEmplacePendingLocksequence already present inHandleNewInstantSendLockRecoveredSig(lines 249–273). Two issues come from that:
- DRY: any future change to the reconstruction path (e.g., logging, additional validation, metrics) has to be applied in two places.
- Log divergence: the conflict-mismatch branch logs different text in each path (
"islock conflicts with %s, dropping own version"vs."existing recovered islock sig has conflicting msgHash %s"), which makes operational grepping harder.Since
HandleNewInstantSendLockRecoveredSigalready handles the “entry not found” case (i.e., listener already reconstructed it) by early-returning, calling it here is semantically equivalent.♻️ Proposed refactor
// After emplacing, check whether the recsig is already in db. If yes, the listener // either already ran while creating was empty (so we must reconstruct here), or it // ran after our emplace and already removed the entry (so the find below misses). // Either case is handled correctly by the find/erase pattern. if (llmq::CRecoveredSig recSig; m_sigman.GetRecoveredSigForId(llmqType, id, recSig)) { - InstantSendLockPtr islock_ptr; - { - LOCK(cs_creating); - auto it = creatingInstantSendLocks.find(id); - if (it == creatingInstantSendLocks.end()) { - // The listener already fired after our emplace and reconstructed; nothing to do. - return; - } - islock_ptr = std::make_shared<InstantSendLock>(std::move(it->second)); - txToCreatingInstantSendLocks.erase(islock_ptr->txid); - creatingInstantSendLocks.erase(it); - } - if (recSig.getMsgHash() != tx.GetHash()) { - LogPrintf("%s -- txid=%s: existing recovered islock sig has conflicting msgHash %s\n", - __func__, tx.GetHash().ToString(), recSig.getMsgHash().ToString()); - return; - } - islock_ptr->sig = recSig.sig; - m_isman.TryEmplacePendingLock(::SerializeHash(*islock_ptr), /*id=*/-1, islock_ptr); + HandleNewInstantSendLockRecoveredSig(recSig); return; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/instantsend/signing.cpp` around lines 417 - 438, The duplicated reconstruction + validation + emplace logic in the block that checks m_sigman.GetRecoveredSigForId (using llmq::CRecoveredSig recSig) should be replaced by a call to the existing helper HandleNewInstantSendLockRecoveredSig to avoid DRY and logging divergence; specifically, instead of manually locking cs_creating, moving/erasing from creatingInstantSendLocks and txToCreatingInstantSendLocks, comparing recSig.getMsgHash() to tx.GetHash(), assigning islock_ptr->sig and calling m_isman.TryEmplacePendingLock, delegate to HandleNewInstantSendLockRecoveredSig(id, recSig, tx) (or the correct parameter ordering/signature of that function) so all reconstruction, conflict logging and TryEmplacePendingLock behavior is centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/instantsend/signing.cpp`:
- Around line 417-438: The duplicated reconstruction + validation + emplace
logic in the block that checks m_sigman.GetRecoveredSigForId (using
llmq::CRecoveredSig recSig) should be replaced by a call to the existing helper
HandleNewInstantSendLockRecoveredSig to avoid DRY and logging divergence;
specifically, instead of manually locking cs_creating, moving/erasing from
creatingInstantSendLocks and txToCreatingInstantSendLocks, comparing
recSig.getMsgHash() to tx.GetHash(), assigning islock_ptr->sig and calling
m_isman.TryEmplacePendingLock, delegate to
HandleNewInstantSendLockRecoveredSig(id, recSig, tx) (or the correct parameter
ordering/signature of that function) so all reconstruction, conflict logging and
TryEmplacePendingLock behavior is centralized.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b1809ad3-02dc-4236-9df9-f00c3c781fdb
📒 Files selected for processing (1)
src/instantsend/signing.cpp
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6587677f64
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // either already ran while creating was empty (so we must reconstruct here), or it | ||
| // ran after our emplace and already removed the entry (so the find below misses). | ||
| // Either case is handled correctly by the find/erase pattern. | ||
| if (llmq::CRecoveredSig recSig; m_sigman.GetRecoveredSigForId(llmqType, id, recSig)) { |
There was a problem hiding this comment.
Recompute cycle hash from recovered signature quorum
When GetRecoveredSigForId returns true, this branch reuses the cycleHash computed earlier from SelectQuorumForSigning at the current tip and only swaps in recSig.sig. If the recovered signature was created in an older active set (for example, tx seen after a quorum rotation), the reconstructed ISLOCK can carry a cycleHash that doesn't match recSig's quorum, so peers that verify via cycleHash (see net_instantsend.cpp quorum selection path) may reject the relayed lock. This path should derive/validate cycleHash from recSig.getQuorumHash() before enqueuing.
Useful? React with 👍 / 👎.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
The patch in src/instantsend/signing.cpp does close the specific TOCTOU window it targets by registering the pending ISLOCK before consulting the recovered-signature store, and I did not find a correctness regression in that new locking sequence. The remaining issue is test coverage: this timing-sensitive reconstruction path changed without any new regression coverage that proves the "recovered sig arrives before local ISLOCK scheduling" case stays fixed.
Reviewed commit: 6587677
🟡 1 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `src/instantsend/signing.cpp`:
- [SUGGESTION] lines 401-440: The race fix changed ISLOCK reconstruction without adding regression coverage for the reordered path
This change only touches `TrySignInstantSendLock()` and no unit or functional test was added or updated alongside it. The new behavior depends on a narrow interleaving between `creatingInstantSendLocks`, `HandleNewRecoveredSig()`, and `GetRecoveredSigForId()`: the code is now correct only if the "recovered sig already exists or arrives concurrently" path keeps routing reconstruction through the new find/erase flow. Because that path is timing-dependent and not exercised explicitly anywhere in the checked-in tests, CI still has no deterministic signal that this race is actually closed or that a later refactor will not reopen it.
| @@ -411,6 +410,33 @@ void InstantSendSigner::TrySignInstantSendLock(const CTransaction& tx) | |||
| txToCreatingInstantSendLocks.emplace(tx.GetHash(), &e.first->second); | |||
| } | |||
|
|
|||
| // After emplacing, check whether the recsig is already in db. If yes, the listener | |||
| // either already ran while creating was empty (so we must reconstruct here), or it | |||
| // ran after our emplace and already removed the entry (so the find below misses). | |||
| // Either case is handled correctly by the find/erase pattern. | |||
| if (llmq::CRecoveredSig recSig; m_sigman.GetRecoveredSigForId(llmqType, id, recSig)) { | |||
| InstantSendLockPtr islock_ptr; | |||
| { | |||
| LOCK(cs_creating); | |||
| auto it = creatingInstantSendLocks.find(id); | |||
| if (it == creatingInstantSendLocks.end()) { | |||
| // The listener already fired after our emplace and reconstructed; nothing to do. | |||
| return; | |||
| } | |||
| islock_ptr = std::make_shared<InstantSendLock>(std::move(it->second)); | |||
| txToCreatingInstantSendLocks.erase(islock_ptr->txid); | |||
| creatingInstantSendLocks.erase(it); | |||
| } | |||
| if (recSig.getMsgHash() != tx.GetHash()) { | |||
| LogPrintf("%s -- txid=%s: existing recovered islock sig has conflicting msgHash %s\n", | |||
| __func__, tx.GetHash().ToString(), recSig.getMsgHash().ToString()); | |||
| return; | |||
| } | |||
| islock_ptr->sig = recSig.sig; | |||
| m_isman.TryEmplacePendingLock(::SerializeHash(*islock_ptr), /*id=*/-1, islock_ptr); | |||
| return; | |||
| } | |||
|
|
|||
| m_shareman.AsyncSignIfMember(llmqType, m_sigman, id, tx.GetHash(), quorum->m_quorum_base_block_index->GetBlockHash()); | |||
There was a problem hiding this comment.
🟡 Suggestion: The race fix changed ISLOCK reconstruction without adding regression coverage for the reordered path
This change only touches TrySignInstantSendLock() and no unit or functional test was added or updated alongside it. The new behavior depends on a narrow interleaving between creatingInstantSendLocks, HandleNewRecoveredSig(), and GetRecoveredSigForId(): the code is now correct only if the "recovered sig already exists or arrives concurrently" path keeps routing reconstruction through the new find/erase flow. Because that path is timing-dependent and not exercised explicitly anywhere in the checked-in tests, CI still has no deterministic signal that this race is actually closed or that a later refactor will not reopen it.
source: ['codex']
🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `src/instantsend/signing.cpp`:
- [SUGGESTION] lines 401-440: The race fix changed ISLOCK reconstruction without adding regression coverage for the reordered path
This change only touches `TrySignInstantSendLock()` and no unit or functional test was added or updated alongside it. The new behavior depends on a narrow interleaving between `creatingInstantSendLocks`, `HandleNewRecoveredSig()`, and `GetRecoveredSigForId()`: the code is now correct only if the "recovered sig already exists or arrives concurrently" path keeps routing reconstruction through the new find/erase flow. Because that path is timing-dependent and not exercised explicitly anywhere in the checked-in tests, CI still has no deterministic signal that this race is actually closed or that a later refactor will not reopen it.
|
This fix is a workaround and not fixing underlying issue; failure rate is dropped from 50% to 15% but it hasn't been actually fixed. Fix is #7293 |
Issue being fixed or feature implemented
PR #6994 stopped sending ISDLOCK invs to peers requesting recsigs on the premise that they can reconstruct the islock from the recsig. That reconstruction has a pre-existing race that become critical now.
It causes intermittent failure in interface_zmq_dash.py - wait_for_instantlock timing out.
The previous reconstruction path checked HasRecoveredSigForId before
emplacing into creatingInstantSendLocks, which leaves a window:
thread A: ProcessRecoveredSig writes recsig to db, listener fires
HandleNewRecoveredSig, finds creating empty, no-ops
thread B: TrySignInstantSendLock saw HasRecoveredSigForId=false,
now emplaces and calls AsyncSignIfMember
It causes recsig to be in in db but no one ever builds the islock:
that recsig
Before #6994 this was masked because the peer would still receive the ISDLOCK inv from another node,
but on develop the peer no longer receives an ISDLOCK inv from peers as a fallback.
What was done?
Reorder to emplace first, then check:
after this point sees the entry and reconstructs via the existing
HandleNewInstantSendLockRecoveredSig path.
entry; if find misses, the listener already handled it.
How Has This Been Tested?
Performance improvements after this PR should be still relevant [not tested], but failure rate on my environment reduced from 50% to ~15% for interface_zmq_dash.py
Breaking Changes
N/A
Checklist: