fix: reconstruct ISLOCK locally when recsig arrived first by knst · Pull Request #7291 · dashpay/dash

knst · 2026-04-26T08:29:55Z

Issue being fixed or feature implemented

PR #6994 stopped sending ISDLOCK invs to peers requesting recsigs on the premise that they can reconstruct the islock from the recsig. That reconstruction has a pre-existing race that become critical now.
It causes intermittent failure in interface_zmq_dash.py - wait_for_instantlock timing out.

The previous reconstruction path checked HasRecoveredSigForId before
emplacing into creatingInstantSendLocks, which leaves a window:

thread A: ProcessRecoveredSig writes recsig to db, listener fires
HandleNewRecoveredSig, finds creating empty, no-ops
thread B: TrySignInstantSendLock saw HasRecoveredSigForId=false,
now emplaces and calls AsyncSignIfMember

It causes recsig to be in in db but no one ever builds the islock:

db.HasRecoveredSigForHash blocks any subsequent listener firing for
that recsig
TrySignInstantSendLock has already passed its check

Before #6994 this was masked because the peer would still receive the ISDLOCK inv from another node,
but on develop the peer no longer receives an ISDLOCK inv from peers as a fallback.

What was done?

Reorder to emplace first, then check:

Emplace into creatingInstantSendLocks. Any listener that fires
after this point sees the entry and reconstructs via the existing
HandleNewInstantSendLockRecoveredSig path.
Then check HasRecoveredSigForId. If already in db, find+erase the
entry; if find misses, the listener already handled it.
Only call AsyncSignIfMember when no recsig exists yet.

How Has This Been Tested?

Performance improvements after this PR should be still relevant [not tested], but failure rate on my environment reduced from 50% to ~15% for interface_zmq_dash.py

Breaking Changes

N/A

Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have added or updated relevant unit/integration/functional/e2e tests
I have made corresponding changes to the documentation
I have assigned this pull request to a milestone

PR dashpay#6994 stopped sending ISDLOCK invs to peers requesting recsigs on the premise that they can reconstruct the islock from the recsig. That reconstruction has a pre-existing race that become critical now. It causes intermittent failure in interface_zmq_dash.py - wait_for_instantlock timing out. The previous reconstruction path checked HasRecoveredSigForId before emplacing into creatingInstantSendLocks, which leaves a window: thread A: ProcessRecoveredSig writes recsig to db, listener fires HandleNewRecoveredSig, finds creating empty, no-ops thread B: TrySignInstantSendLock saw HasRecoveredSigForId=false, now emplaces and calls AsyncSignIfMember It causes recsig to be in in db but no one ever builds the islock: - db.HasRecoveredSigForHash blocks any subsequent listener firing for that recsig - TrySignInstantSendLock has already passed its check Before dashpay#6994 this was masked because the peer would still receive the ISDLOCK inv from another node, but on develop the peer no longer receives an ISDLOCK inv from peers as a fallback. Reorder to emplace first, then check: 1. Emplace into creatingInstantSendLocks. Any listener that fires after this point sees the entry and reconstructs via the existing HandleNewInstantSendLockRecoveredSig path. 2. Then check HasRecoveredSigForId. If already in db, find+erase the entry; if find misses, the listener already handled it. 3. Only call AsyncSignIfMember when no recsig exists yet.

thepastaclaw · 2026-04-26T08:30:05Z

✅ Review complete (commit 6587677)

github-actions · 2026-04-26T08:30:23Z

✅ No Merge Conflicts Detected

This PR currently has no conflicts with other open PRs.

coderabbitai · 2026-04-26T08:34:05Z

Walkthrough

The TrySignInstantSendLock function in src/instantsend/signing.cpp was modified to change its operation sequence. Previously, the function returned early if a recovered signature already existed for the given ID. Now, the function always creates and emplaces an InstantSendLock entry in creatingInstantSendLocks first, before interacting with the recovered-sig listener. After emplacement, the function attempts to retrieve the recovered signature from m_sigman. If a recovered signature is found, the function validates that its msgHash matches the current transaction hash. On a successful match, the signature is applied to the lock and the pending lock is emplaced into m_isman with an early return. On a hash mismatch, the recovered signature is discarded.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: reordering operations to reconstruct ISLOCK when a recovered signature arrives first, which is the core fix for the race condition introduced in PR `#6994`.
Description check	✅ Passed	The description comprehensively explains the issue (race condition in ISLOCK reconstruction), root cause (TOCTOU window), solution (reorder emplace before check), and provides context about testing impact.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/instantsend/signing.cpp (1)

417-438: Reduce duplication: delegate to HandleNewInstantSendLockRecoveredSig.

Lines 418–437 reimplement, nearly verbatim, the find/move/erase + msgHash-equality + sig assignment + TryEmplacePendingLock sequence already present in HandleNewInstantSendLockRecoveredSig (lines 249–273). Two issues come from that:

DRY: any future change to the reconstruction path (e.g., logging, additional validation, metrics) has to be applied in two places.
Log divergence: the conflict-mismatch branch logs different text in each path ("islock conflicts with %s, dropping own version" vs. "existing recovered islock sig has conflicting msgHash %s"), which makes operational grepping harder.

Since HandleNewInstantSendLockRecoveredSig already handles the “entry not found” case (i.e., listener already reconstructed it) by early-returning, calling it here is semantically equivalent.

♻️ Proposed refactor

     // After emplacing, check whether the recsig is already in db. If yes, the listener
     // either already ran while creating was empty (so we must reconstruct here), or it
     // ran after our emplace and already removed the entry (so the find below misses).
     // Either case is handled correctly by the find/erase pattern.
     if (llmq::CRecoveredSig recSig; m_sigman.GetRecoveredSigForId(llmqType, id, recSig)) {
-        InstantSendLockPtr islock_ptr;
-        {
-            LOCK(cs_creating);
-            auto it = creatingInstantSendLocks.find(id);
-            if (it == creatingInstantSendLocks.end()) {
-                // The listener already fired after our emplace and reconstructed; nothing to do.
-                return;
-            }
-            islock_ptr = std::make_shared<InstantSendLock>(std::move(it->second));
-            txToCreatingInstantSendLocks.erase(islock_ptr->txid);
-            creatingInstantSendLocks.erase(it);
-        }
-        if (recSig.getMsgHash() != tx.GetHash()) {
-            LogPrintf("%s -- txid=%s: existing recovered islock sig has conflicting msgHash %s\n",
-                      __func__, tx.GetHash().ToString(), recSig.getMsgHash().ToString());
-            return;
-        }
-        islock_ptr->sig = recSig.sig;
-        m_isman.TryEmplacePendingLock(::SerializeHash(*islock_ptr), /*id=*/-1, islock_ptr);
+        HandleNewInstantSendLockRecoveredSig(recSig);
         return;
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/instantsend/signing.cpp` around lines 417 - 438, The duplicated
reconstruction + validation + emplace logic in the block that checks
m_sigman.GetRecoveredSigForId (using llmq::CRecoveredSig recSig) should be
replaced by a call to the existing helper HandleNewInstantSendLockRecoveredSig
to avoid DRY and logging divergence; specifically, instead of manually locking
cs_creating, moving/erasing from creatingInstantSendLocks and
txToCreatingInstantSendLocks, comparing recSig.getMsgHash() to tx.GetHash(),
assigning islock_ptr->sig and calling m_isman.TryEmplacePendingLock, delegate to
HandleNewInstantSendLockRecoveredSig(id, recSig, tx) (or the correct parameter
ordering/signature of that function) so all reconstruction, conflict logging and
TryEmplacePendingLock behavior is centralized.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/instantsend/signing.cpp`:
- Around line 417-438: The duplicated reconstruction + validation + emplace
logic in the block that checks m_sigman.GetRecoveredSigForId (using
llmq::CRecoveredSig recSig) should be replaced by a call to the existing helper
HandleNewInstantSendLockRecoveredSig to avoid DRY and logging divergence;
specifically, instead of manually locking cs_creating, moving/erasing from
creatingInstantSendLocks and txToCreatingInstantSendLocks, comparing
recSig.getMsgHash() to tx.GetHash(), assigning islock_ptr->sig and calling
m_isman.TryEmplacePendingLock, delegate to
HandleNewInstantSendLockRecoveredSig(id, recSig, tx) (or the correct parameter
ordering/signature of that function) so all reconstruction, conflict logging and
TryEmplacePendingLock behavior is centralized.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b1809ad3-02dc-4236-9df9-f00c3c781fdb

📥 Commits

Reviewing files that changed from the base of the PR and between cfad414 and 6587677.

📒 Files selected for processing (1)

src/instantsend/signing.cpp

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6587677f64

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T08:36:00Z

+    // either already ran while creating was empty (so we must reconstruct here), or it
+    // ran after our emplace and already removed the entry (so the find below misses).
+    // Either case is handled correctly by the find/erase pattern.
+    if (llmq::CRecoveredSig recSig; m_sigman.GetRecoveredSigForId(llmqType, id, recSig)) {


Recompute cycle hash from recovered signature quorum

When GetRecoveredSigForId returns true, this branch reuses the cycleHash computed earlier from SelectQuorumForSigning at the current tip and only swaps in recSig.sig. If the recovered signature was created in an older active set (for example, tx seen after a quorum rotation), the reconstructed ISLOCK can carry a cycleHash that doesn't match recSig's quorum, so peers that verify via cycleHash (see net_instantsend.cpp quorum selection path) may reject the relayed lock. This path should derive/validate cycleHash from recSig.getQuorumHash() before enqueuing.

Useful? React with 👍 / 👎.

thepastaclaw

Code Review

The patch in src/instantsend/signing.cpp does close the specific TOCTOU window it targets by registering the pending ISLOCK before consulting the recovered-signature store, and I did not find a correctness regression in that new locking sequence. The remaining issue is test coverage: this timing-sensitive reconstruction path changed without any new regression coverage that proves the "recovered sig arrives before local ISLOCK scheduling" case stays fixed.

Reviewed commit: 6587677

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `src/instantsend/signing.cpp`:
- [SUGGESTION] lines 401-440: The race fix changed ISLOCK reconstruction without adding regression coverage for the reordered path
  This change only touches `TrySignInstantSendLock()` and no unit or functional test was added or updated alongside it. The new behavior depends on a narrow interleaving between `creatingInstantSendLocks`, `HandleNewRecoveredSig()`, and `GetRecoveredSigForId()`: the code is now correct only if the "recovered sig already exists or arrives concurrently" path keeps routing reconstruction through the new find/erase flow. Because that path is timing-dependent and not exercised explicitly anywhere in the checked-in tests, CI still has no deterministic signal that this race is actually closed or that a later refactor will not reopen it.

thepastaclaw · 2026-04-26T09:31:35Z

@@ -411,6 +410,33 @@ void InstantSendSigner::TrySignInstantSendLock(const CTransaction& tx)
        txToCreatingInstantSendLocks.emplace(tx.GetHash(), &e.first->second);
    }

+    // After emplacing, check whether the recsig is already in db. If yes, the listener
+    // either already ran while creating was empty (so we must reconstruct here), or it
+    // ran after our emplace and already removed the entry (so the find below misses).
+    // Either case is handled correctly by the find/erase pattern.
+    if (llmq::CRecoveredSig recSig; m_sigman.GetRecoveredSigForId(llmqType, id, recSig)) {
+        InstantSendLockPtr islock_ptr;
+        {
+            LOCK(cs_creating);
+            auto it = creatingInstantSendLocks.find(id);
+            if (it == creatingInstantSendLocks.end()) {
+                // The listener already fired after our emplace and reconstructed; nothing to do.
+                return;
+            }
+            islock_ptr = std::make_shared<InstantSendLock>(std::move(it->second));
+            txToCreatingInstantSendLocks.erase(islock_ptr->txid);
+            creatingInstantSendLocks.erase(it);
+        }
+        if (recSig.getMsgHash() != tx.GetHash()) {
+            LogPrintf("%s -- txid=%s: existing recovered islock sig has conflicting msgHash %s\n",
+                      __func__, tx.GetHash().ToString(), recSig.getMsgHash().ToString());
+            return;
+        }
+        islock_ptr->sig = recSig.sig;
+        m_isman.TryEmplacePendingLock(::SerializeHash(*islock_ptr), /*id=*/-1, islock_ptr);
+        return;
+    }
+
    m_shareman.AsyncSignIfMember(llmqType, m_sigman, id, tx.GetHash(), quorum->m_quorum_base_block_index->GetBlockHash());


🟡 Suggestion: The race fix changed ISLOCK reconstruction without adding regression coverage for the reordered path

This change only touches TrySignInstantSendLock() and no unit or functional test was added or updated alongside it. The new behavior depends on a narrow interleaving between creatingInstantSendLocks, HandleNewRecoveredSig(), and GetRecoveredSigForId(): the code is now correct only if the "recovered sig already exists or arrives concurrently" path keeps routing reconstruction through the new find/erase flow. Because that path is timing-dependent and not exercised explicitly anywhere in the checked-in tests, CI still has no deterministic signal that this race is actually closed or that a later refactor will not reopen it.

_{source: ['codex']}

🤖 Fix this with AI agents

These findings are from an automated code review. Verify each finding against the current code and only fix it if needed. In `src/instantsend/signing.cpp`: - [SUGGESTION] lines 401-440: The race fix changed ISLOCK reconstruction without adding regression coverage for the reordered path This change only touches `TrySignInstantSendLock()` and no unit or functional test was added or updated alongside it. The new behavior depends on a narrow interleaving between `creatingInstantSendLocks`, `HandleNewRecoveredSig()`, and `GetRecoveredSigForId()`: the code is now correct only if the "recovered sig already exists or arrives concurrently" path keeps routing reconstruction through the new find/erase flow. Because that path is timing-dependent and not exercised explicitly anywhere in the checked-in tests, CI still has no deterministic signal that this race is actually closed or that a later refactor will not reopen it.

knst · 2026-04-27T16:35:23Z

This fix is a workaround and not fixing underlying issue; failure rate is dropped from 50% to 15% but it hasn't been actually fixed. Fix is #7293

knst added this to the 24 milestone Apr 26, 2026

knst added the backport-candidate-23.1.x label Apr 26, 2026

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 26, 2026

View reviewed changes

thepastaclaw reviewed Apr 26, 2026

View reviewed changes

knst marked this pull request as draft April 26, 2026 19:21

knst removed the backport-candidate-23.1.x label Apr 26, 2026

knst closed this Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reconstruct ISLOCK locally when recsig arrived first#7291

fix: reconstruct ISLOCK locally when recsig arrived first#7291
knst wants to merge 1 commit intodashpay:developfrom
knst:fix-6994

knst commented Apr 26, 2026

Uh oh!

thepastaclaw commented Apr 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 26, 2026

Uh oh!

coderabbitai Bot commented Apr 26, 2026

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Uh oh!

thepastaclaw left a comment

Uh oh!

thepastaclaw Apr 26, 2026

Uh oh!

knst commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

knst commented Apr 26, 2026

Issue being fixed or feature implemented

What was done?

How Has This Been Tested?

Breaking Changes

Checklist:

Uh oh!

thepastaclaw commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 26, 2026

✅ No Merge Conflicts Detected

Uh oh!

coderabbitai Bot commented Apr 26, 2026

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

thepastaclaw left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

thepastaclaw Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

knst commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thepastaclaw commented Apr 26, 2026 •

edited

Loading