Skip to content

fix: reconstruct ISLOCK locally when recsig arrived first#7291

Closed
knst wants to merge 1 commit intodashpay:developfrom
knst:fix-6994
Closed

fix: reconstruct ISLOCK locally when recsig arrived first#7291
knst wants to merge 1 commit intodashpay:developfrom
knst:fix-6994

Conversation

@knst
Copy link
Copy Markdown
Collaborator

@knst knst commented Apr 26, 2026

Issue being fixed or feature implemented

PR #6994 stopped sending ISDLOCK invs to peers requesting recsigs on the premise that they can reconstruct the islock from the recsig. That reconstruction has a pre-existing race that become critical now.
It causes intermittent failure in interface_zmq_dash.py - wait_for_instantlock timing out.

The previous reconstruction path checked HasRecoveredSigForId before
emplacing into creatingInstantSendLocks, which leaves a window:

thread A: ProcessRecoveredSig writes recsig to db, listener fires
HandleNewRecoveredSig, finds creating empty, no-ops
thread B: TrySignInstantSendLock saw HasRecoveredSigForId=false,
now emplaces and calls AsyncSignIfMember

It causes recsig to be in in db but no one ever builds the islock:

  • db.HasRecoveredSigForHash blocks any subsequent listener firing for
    that recsig
  • TrySignInstantSendLock has already passed its check

Before #6994 this was masked because the peer would still receive the ISDLOCK inv from another node,
but on develop the peer no longer receives an ISDLOCK inv from peers as a fallback.

What was done?

Reorder to emplace first, then check:

  1. Emplace into creatingInstantSendLocks. Any listener that fires
    after this point sees the entry and reconstructs via the existing
    HandleNewInstantSendLockRecoveredSig path.
  2. Then check HasRecoveredSigForId. If already in db, find+erase the
    entry; if find misses, the listener already handled it.
  3. Only call AsyncSignIfMember when no recsig exists yet.

How Has This Been Tested?

Performance improvements after this PR should be still relevant [not tested], but failure rate on my environment reduced from 50% to ~15% for interface_zmq_dash.py

Breaking Changes

N/A

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have made corresponding changes to the documentation
  • I have assigned this pull request to a milestone

PR dashpay#6994 stopped sending ISDLOCK invs to peers requesting recsigs on the premise
that they can reconstruct the islock from the recsig. That reconstruction has
a pre-existing race that become critical now.
It causes intermittent failure in interface_zmq_dash.py - wait_for_instantlock
timing out.

The previous reconstruction path checked HasRecoveredSigForId before
emplacing into creatingInstantSendLocks, which leaves a window:

thread A: ProcessRecoveredSig writes recsig to db, listener fires
          HandleNewRecoveredSig, finds creating empty, no-ops
thread B: TrySignInstantSendLock saw HasRecoveredSigForId=false,
          now emplaces and calls AsyncSignIfMember

It causes recsig to be in in db but no one ever builds the islock:
 - db.HasRecoveredSigForHash blocks any subsequent listener firing for
that recsig
 - TrySignInstantSendLock has already passed its check

Before dashpay#6994 this was masked because the peer would still receive the ISDLOCK inv from another node,
but on develop the peer no longer receives an ISDLOCK inv from peers as a fallback.

Reorder to emplace first, then check:

1. Emplace into creatingInstantSendLocks. Any listener that fires
 after this point sees the entry and reconstructs via the existing
 HandleNewInstantSendLockRecoveredSig path.
2. Then check HasRecoveredSigForId. If already in db, find+erase the
 entry; if find misses, the listener already handled it.
3. Only call AsyncSignIfMember when no recsig exists yet.
@knst knst added this to the 24 milestone Apr 26, 2026
@thepastaclaw
Copy link
Copy Markdown

thepastaclaw commented Apr 26, 2026

✅ Review complete (commit 6587677)

@github-actions
Copy link
Copy Markdown

✅ No Merge Conflicts Detected

This PR currently has no conflicts with other open PRs.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 26, 2026

Walkthrough

The TrySignInstantSendLock function in src/instantsend/signing.cpp was modified to change its operation sequence. Previously, the function returned early if a recovered signature already existed for the given ID. Now, the function always creates and emplaces an InstantSendLock entry in creatingInstantSendLocks first, before interacting with the recovered-sig listener. After emplacement, the function attempts to retrieve the recovered signature from m_sigman. If a recovered signature is found, the function validates that its msgHash matches the current transaction hash. On a successful match, the signature is applied to the lock and the pending lock is emplaced into m_isman with an early return. On a hash mismatch, the recovered signature is discarded.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: reordering operations to reconstruct ISLOCK when a recovered signature arrives first, which is the core fix for the race condition introduced in PR #6994.
Description check ✅ Passed The description comprehensively explains the issue (race condition in ISLOCK reconstruction), root cause (TOCTOU window), solution (reorder emplace before check), and provides context about testing impact.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/instantsend/signing.cpp (1)

417-438: Reduce duplication: delegate to HandleNewInstantSendLockRecoveredSig.

Lines 418–437 reimplement, nearly verbatim, the find/move/erase + msgHash-equality + sig assignment + TryEmplacePendingLock sequence already present in HandleNewInstantSendLockRecoveredSig (lines 249–273). Two issues come from that:

  • DRY: any future change to the reconstruction path (e.g., logging, additional validation, metrics) has to be applied in two places.
  • Log divergence: the conflict-mismatch branch logs different text in each path ("islock conflicts with %s, dropping own version" vs. "existing recovered islock sig has conflicting msgHash %s"), which makes operational grepping harder.

Since HandleNewInstantSendLockRecoveredSig already handles the “entry not found” case (i.e., listener already reconstructed it) by early-returning, calling it here is semantically equivalent.

♻️ Proposed refactor
     // After emplacing, check whether the recsig is already in db. If yes, the listener
     // either already ran while creating was empty (so we must reconstruct here), or it
     // ran after our emplace and already removed the entry (so the find below misses).
     // Either case is handled correctly by the find/erase pattern.
     if (llmq::CRecoveredSig recSig; m_sigman.GetRecoveredSigForId(llmqType, id, recSig)) {
-        InstantSendLockPtr islock_ptr;
-        {
-            LOCK(cs_creating);
-            auto it = creatingInstantSendLocks.find(id);
-            if (it == creatingInstantSendLocks.end()) {
-                // The listener already fired after our emplace and reconstructed; nothing to do.
-                return;
-            }
-            islock_ptr = std::make_shared<InstantSendLock>(std::move(it->second));
-            txToCreatingInstantSendLocks.erase(islock_ptr->txid);
-            creatingInstantSendLocks.erase(it);
-        }
-        if (recSig.getMsgHash() != tx.GetHash()) {
-            LogPrintf("%s -- txid=%s: existing recovered islock sig has conflicting msgHash %s\n",
-                      __func__, tx.GetHash().ToString(), recSig.getMsgHash().ToString());
-            return;
-        }
-        islock_ptr->sig = recSig.sig;
-        m_isman.TryEmplacePendingLock(::SerializeHash(*islock_ptr), /*id=*/-1, islock_ptr);
+        HandleNewInstantSendLockRecoveredSig(recSig);
         return;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instantsend/signing.cpp` around lines 417 - 438, The duplicated
reconstruction + validation + emplace logic in the block that checks
m_sigman.GetRecoveredSigForId (using llmq::CRecoveredSig recSig) should be
replaced by a call to the existing helper HandleNewInstantSendLockRecoveredSig
to avoid DRY and logging divergence; specifically, instead of manually locking
cs_creating, moving/erasing from creatingInstantSendLocks and
txToCreatingInstantSendLocks, comparing recSig.getMsgHash() to tx.GetHash(),
assigning islock_ptr->sig and calling m_isman.TryEmplacePendingLock, delegate to
HandleNewInstantSendLockRecoveredSig(id, recSig, tx) (or the correct parameter
ordering/signature of that function) so all reconstruction, conflict logging and
TryEmplacePendingLock behavior is centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/instantsend/signing.cpp`:
- Around line 417-438: The duplicated reconstruction + validation + emplace
logic in the block that checks m_sigman.GetRecoveredSigForId (using
llmq::CRecoveredSig recSig) should be replaced by a call to the existing helper
HandleNewInstantSendLockRecoveredSig to avoid DRY and logging divergence;
specifically, instead of manually locking cs_creating, moving/erasing from
creatingInstantSendLocks and txToCreatingInstantSendLocks, comparing
recSig.getMsgHash() to tx.GetHash(), assigning islock_ptr->sig and calling
m_isman.TryEmplacePendingLock, delegate to
HandleNewInstantSendLockRecoveredSig(id, recSig, tx) (or the correct parameter
ordering/signature of that function) so all reconstruction, conflict logging and
TryEmplacePendingLock behavior is centralized.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b1809ad3-02dc-4236-9df9-f00c3c781fdb

📥 Commits

Reviewing files that changed from the base of the PR and between cfad414 and 6587677.

📒 Files selected for processing (1)
  • src/instantsend/signing.cpp

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6587677f64

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// either already ran while creating was empty (so we must reconstruct here), or it
// ran after our emplace and already removed the entry (so the find below misses).
// Either case is handled correctly by the find/erase pattern.
if (llmq::CRecoveredSig recSig; m_sigman.GetRecoveredSigForId(llmqType, id, recSig)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute cycle hash from recovered signature quorum

When GetRecoveredSigForId returns true, this branch reuses the cycleHash computed earlier from SelectQuorumForSigning at the current tip and only swaps in recSig.sig. If the recovered signature was created in an older active set (for example, tx seen after a quorum rotation), the reconstructed ISLOCK can carry a cycleHash that doesn't match recSig's quorum, so peers that verify via cycleHash (see net_instantsend.cpp quorum selection path) may reject the relayed lock. This path should derive/validate cycleHash from recSig.getQuorumHash() before enqueuing.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@thepastaclaw thepastaclaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The patch in src/instantsend/signing.cpp does close the specific TOCTOU window it targets by registering the pending ISLOCK before consulting the recovered-signature store, and I did not find a correctness regression in that new locking sequence. The remaining issue is test coverage: this timing-sensitive reconstruction path changed without any new regression coverage that proves the "recovered sig arrives before local ISLOCK scheduling" case stays fixed.

Reviewed commit: 6587677

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/instantsend/signing.cpp`:
- [SUGGESTION] lines 401-440: The race fix changed ISLOCK reconstruction without adding regression coverage for the reordered path
  This change only touches `TrySignInstantSendLock()` and no unit or functional test was added or updated alongside it. The new behavior depends on a narrow interleaving between `creatingInstantSendLocks`, `HandleNewRecoveredSig()`, and `GetRecoveredSigForId()`: the code is now correct only if the "recovered sig already exists or arrives concurrently" path keeps routing reconstruction through the new find/erase flow. Because that path is timing-dependent and not exercised explicitly anywhere in the checked-in tests, CI still has no deterministic signal that this race is actually closed or that a later refactor will not reopen it.

Comment on lines 401 to 440
@@ -411,6 +410,33 @@ void InstantSendSigner::TrySignInstantSendLock(const CTransaction& tx)
txToCreatingInstantSendLocks.emplace(tx.GetHash(), &e.first->second);
}

// After emplacing, check whether the recsig is already in db. If yes, the listener
// either already ran while creating was empty (so we must reconstruct here), or it
// ran after our emplace and already removed the entry (so the find below misses).
// Either case is handled correctly by the find/erase pattern.
if (llmq::CRecoveredSig recSig; m_sigman.GetRecoveredSigForId(llmqType, id, recSig)) {
InstantSendLockPtr islock_ptr;
{
LOCK(cs_creating);
auto it = creatingInstantSendLocks.find(id);
if (it == creatingInstantSendLocks.end()) {
// The listener already fired after our emplace and reconstructed; nothing to do.
return;
}
islock_ptr = std::make_shared<InstantSendLock>(std::move(it->second));
txToCreatingInstantSendLocks.erase(islock_ptr->txid);
creatingInstantSendLocks.erase(it);
}
if (recSig.getMsgHash() != tx.GetHash()) {
LogPrintf("%s -- txid=%s: existing recovered islock sig has conflicting msgHash %s\n",
__func__, tx.GetHash().ToString(), recSig.getMsgHash().ToString());
return;
}
islock_ptr->sig = recSig.sig;
m_isman.TryEmplacePendingLock(::SerializeHash(*islock_ptr), /*id=*/-1, islock_ptr);
return;
}

m_shareman.AsyncSignIfMember(llmqType, m_sigman, id, tx.GetHash(), quorum->m_quorum_base_block_index->GetBlockHash());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: The race fix changed ISLOCK reconstruction without adding regression coverage for the reordered path

This change only touches TrySignInstantSendLock() and no unit or functional test was added or updated alongside it. The new behavior depends on a narrow interleaving between creatingInstantSendLocks, HandleNewRecoveredSig(), and GetRecoveredSigForId(): the code is now correct only if the "recovered sig already exists or arrives concurrently" path keeps routing reconstruction through the new find/erase flow. Because that path is timing-dependent and not exercised explicitly anywhere in the checked-in tests, CI still has no deterministic signal that this race is actually closed or that a later refactor will not reopen it.

source: ['codex']

🤖 Fix this with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/instantsend/signing.cpp`:
- [SUGGESTION] lines 401-440: The race fix changed ISLOCK reconstruction without adding regression coverage for the reordered path
  This change only touches `TrySignInstantSendLock()` and no unit or functional test was added or updated alongside it. The new behavior depends on a narrow interleaving between `creatingInstantSendLocks`, `HandleNewRecoveredSig()`, and `GetRecoveredSigForId()`: the code is now correct only if the "recovered sig already exists or arrives concurrently" path keeps routing reconstruction through the new find/erase flow. Because that path is timing-dependent and not exercised explicitly anywhere in the checked-in tests, CI still has no deterministic signal that this race is actually closed or that a later refactor will not reopen it.

@knst knst marked this pull request as draft April 26, 2026 19:21
@knst
Copy link
Copy Markdown
Collaborator Author

knst commented Apr 27, 2026

This fix is a workaround and not fixing underlying issue; failure rate is dropped from 50% to 15% but it hasn't been actually fixed. Fix is #7293

@knst knst closed this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants