Fix peer store management on channel closure#895
Fix peer store management on channel closure#895Jolah1 wants to merge 2 commits intolightningdevkit:mainfrom
Conversation
|
I've assigned @tnull as a reviewer! |
Camillarhi
left a comment
There was a problem hiding this comment.
Thanks for working on this. I left some comments. Also, can you squash the commits?
| user_channel_id: UserChannelId(user_channel_id), | ||
| counterparty_node_id, | ||
| reason: Some(reason), | ||
| reason: Some(reason.clone()), |
There was a problem hiding this comment.
reason.clone() is unnecessary as reason was not moved by the match
There was a problem hiding this comment.
Good catch, removed.
Thanks!
| if let Some(counterparty_node_id) = counterparty_node_id { | ||
| let all_channels = | ||
| self.channel_manager.list_channels_with_counterparty(&counterparty_node_id); | ||
|
|
||
| let has_other_channels = | ||
| all_channels.iter().any(|c| c.channel_id != channel_id); | ||
|
|
||
| let counterparty_initiated = matches!( | ||
| reason, | ||
| ClosureReason::CounterpartyForceClosed { .. } | ||
| | ClosureReason::CounterpartyInitiatedCooperativeClosure | ||
| ); |
There was a problem hiding this comment.
We should flip the order here and check the closure reason first, so the read on the channel manager only happens when it's actually relevant
There was a problem hiding this comment.
That's true, no point querying the channel manager if the closure reason rules out action.
It has been reordered.
| // Check if this was the last open channel, if so, forget the peer. | ||
| if open_channels.len() == 1 { | ||
| //For force-closes we keep the peer | ||
| // in the store so the background reconnection task can fire and | ||
| // complete the channel_reestablish recovery flow. This is especially | ||
| // important for LND peers | ||
| // The peer will be removed by the ChannelClosed event handler once | ||
| // the counterparty reconnects and confirms closure. |
There was a problem hiding this comment.
This comment should flow as one paragraph rather than fragmented lines, and the last sentence is misleading, the ChannelClosed event handler only removes the peer on counterparty-initiated closures, so after a local force-close the peer isn't actually cleaned up there. It's retained indefinitely until the user calls disconnect.
Something like:
// If this was the last open channel and we're closing cooperatively, forget // the peer, since we have no further reason to reconnect. For force-closes we // intentionally keep the peer in the store so the background reconnection // task keeps firing and can drive the channel_reestablish recovery flow. // This is especially important against LND peers, which don't always handle // force-closure error messages correctly. Note that this means a force-closed // peer is retained until the user explicitly calls Node::disconnect.
There was a problem hiding this comment.
You're right, the last sentence was inaccurate.
The handler only removes on counterparty-initiated closures, not local force-closes.
Replaced with your suggested comment.
| } | ||
|
|
||
| #[tokio::test(flavor = "multi_thread", worker_threads = 1)] | ||
| async fn test_peer_retained_on_local_force_close() { |
There was a problem hiding this comment.
This test can be folded into do_channel_full_cycle as it already runs both force_close == true and false paths, so the assertion can be done there
There was a problem hiding this comment.
Thanks for pointing this out. I have folded the assertions into do_channel_full_cycle alongside the existing force-close paths rather than keeping them as separate tests.
|
🔔 1st Reminder Hey @tnull! This PR has been waiting for your review. |
c1bd96c to
7609c26
Compare
Thanks for the thorough review. All four points have been addressed . Will squash the commits once the changes are in. |
| // If the counterparty initiated closure of their last remaining channel | ||
| // with us, remove them from the peer store so we stop trying to reconnect. | ||
| // | ||
| // If we initiated the closure, keep them in the peer store so the | ||
| // background reconnection task fires and we can complete the | ||
| // channel_reestablish recovery flow. This matters especially for LND | ||
| // peers, which need us to reconnect to recover from force-closures. | ||
| // | ||
| // We exclude `channel_id` from the remaining-channel check because LDK | ||
| // fires ChannelClosed before removing the channel from its internal list, | ||
| // so list_channels_with_counterparty still includes the closing channel. |
There was a problem hiding this comment.
This comment partially duplicates the rationale in the lib.rs change. I suggest trimming it to just what this block does, and let the lib.rs comment carry the LND recovery rationale
There was a problem hiding this comment.
Thank you!
I have trimmed the comment here to focus on the behavior of this block.
Summary
This PR resolves both issues described in #745 related to peer store management on channel closure.
Fixed behaviors
channel_reestablishrecoveryProblems
1. Local force-close removes peer too early
When we force-closed a channel in
close_channel_internal, we immediately removed the peer from the store regardless of closure type:This broke recovery. The background reconnection task only reconnects to peers in the store. Removing the peer immediately guaranteed we would never reconnect, so
channel_reestablishcould never complete. This is especially problematic for LND peers, which may not handle force-closure error messages correctly and rely on us reconnecting to recover.2. Counterparty force-close never cleans up peer
When a counterparty force-closed their last channel with us, the
ChannelClosedevent handler did nothing with the peer store. The peer remained persisted indefinitely and the node kept attempting to reconnect on every background tick. wasted resources and misleading peer state.Solution
1. Retain peer on local force-close (
src/lib.rs)In
close_channel_internal, peers are now only removed for cooperativecloses:
For force-closes, the peer is retained in the store so the background reconnection task can fire,
channel_reestablishcan complete, and peer cleanup is deferred to theChannelClosedevent handler.2. Remove peer on counterparty-initiated closure (
src/event.rs)In the
LdkEvent::ChannelClosedhandler, we now check:If both conditions are true, the peer is removed from the store.
Implementation Details
LDK timing behavior
LDK emits
ChannelClosedbefore removing the channel from its internal state. A naive call tolist_channels_with_counterparty()inside the handler would still return the closing channel, making the "last channel" check always wrong. We fix this by excluding the closing channel explicitly:Ordering to avoid race conditions
Peer removal is performed before calling
add_event. This ensures consumers ofnext_event_async()always observe a consistent peer store when they react toChannelClosed— avoiding a race where the event is visible before the removal completes.Tests
Added two integration tests in
tests/integration_tests_rust.rs:test_peer_removed_on_counterparty_force_closeVerifies that after the counterparty force-closes the last channel, the peer is no longer persisted in the local peer store.
test_peer_retained_on_local_force_closeVerifies that after a locally initiated force-close, the peer remains persisted in the local peer store to allow background reconnection and
channel_reestablishrecovery.Both tests assert on
is_persistedrather than peer presence to correctly distinguish stored peers from transient TCP connections that may still appear inlist_peers()during teardown.Results
test test_peer_removed_on_counterparty_force_close ... ok
test test_peer_retained_on_local_force_close ... ok
test result: ok. 41 passed; 0 failed; 0 ignored; 0 measured
cargo build → no errors
cargo clippy → no new warnings
cargo fmt → applied