Skip to content

fix(kad-dht): getClosestPeers timeout#3425

Open
dozyio wants to merge 16 commits intolibp2p:mainfrom
dozyio:fix/kad-getClosestPeers
Open

fix(kad-dht): getClosestPeers timeout#3425
dozyio wants to merge 16 commits intolibp2p:mainfrom
dozyio:fix/kad-getClosestPeers

Conversation

@dozyio
Copy link
Copy Markdown
Collaborator

@dozyio dozyio commented Mar 31, 2026

Description

  • Move routing-table updates off the getClosestPeers hot path so query progress/event streaming is not blocked.
  • Centralise update scheduling in RoutingTable and use a dedicated PeerQueue with per-peer dedupe (find/join).
  • Apply the same queued update path for topology-triggered updates (onPeerConnect).
  • Add configurable queue concurrency via routingTableUpdateQueueConcurrency.
  • Expose queue depth metrics via queue metric wiring for Grafana.
  • Update getClosestPeers to yield query results directly (no peer filtering/mutation).
  • Wire missing networkDialTimeout (thanks @paschal533).

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation if necessary (this includes comments as well)
  • I have added tests that prove my fix is effective or that my feature works

@dozyio dozyio changed the title fix: kad getClosestPeers timeout fix(kad-dht): getClosestPeers timeout Apr 3, 2026
@dozyio dozyio marked this pull request as ready for review April 3, 2026 00:57
@dozyio dozyio requested a review from a team as a code owner April 3, 2026 00:57
@dozyio
Copy link
Copy Markdown
Collaborator Author

dozyio commented Apr 7, 2026

@achingbrain any chance you could take a look at this one please?

Copy link
Copy Markdown
Member

@achingbrain achingbrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes:

  • Maybe make the routing table .add method synchronous and encapsulate the update queuing logic inside the routing table class instead. The query manager seems like the wrong place to me.
  • If you use the PeerQueue class from @libp2p/utils instead of the generic Queue it will deduplicate jobs based on the passed PeerId for you (call .find(peerId) on the queue and then .join(opts) on the job if present).
  • Pass a metrics instance and metricName option to the Queue/PeerQueue constructor to have it show up in graphana (essential to tune parameters)
  • You can continue to use a queue after .abort() is called on it, no need to have the queue field be Queue | undefined in the containing class - just create it in the constructor as normal
  • Most of the routing table stats would be available to graphana via the built in metrics, it seems redundant to hold them in memory
  • The routing table stats seem to be most useful for tests - consider using sinon spys instead
  • There seems to be some duplicate logic around when to perform routing table updates - this is defined by the spec and the routing table already has an interval during which a peer will not be re-pinged
  • "Routing" is an overloaded term in libp2p so the new config options should be fully qualified (e.g. routingTableUpdateQueueConcurrency)

Comment on lines +270 to +283
}))) {
continue
}

Copy link
Copy Markdown
Member

@achingbrain achingbrain Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

These conditionals probably all need to go and any results found should just be yielded.

A peer can be returned from a query without multiaddrs. It's not invalid, it just means that their multiaddrs have expired in the peerstore of the node returning the peer.

From memory I think multiaddr expiration was added in js after this code was written though it was present in go which is why peers were returned without multiaddrs which looked like an error.

If the calling code wants to dial the peer, it should execute a findPeer query to get updated multiaddrs before dialling (the dial queue already does this here).

yield finalPeerEvent({
from: this.components.peerId,
peer: await self.components.peerStore.getInfo(peer.id, options),
peer,
Copy link
Copy Markdown
Member

@achingbrain achingbrain Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better, we should just yield the query result without modifying it.

@dozyio
Copy link
Copy Markdown
Collaborator Author

dozyio commented Apr 12, 2026

Thanks @achingbrain for the feedback - Hopefully addressed all the issues although I didn't make the routing table .add method synchronous - would require a fair amount of change as many of the internals are async.

Also reduced the scope a bit by removing the peer TTL stuff and custom queue metrics.

Copy link
Copy Markdown
Member

@achingbrain achingbrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested locally & is much faster.

Couple of small comments inline.

Comment thread packages/kad-dht/src/routing-table/index.ts Outdated
Comment thread packages/kad-dht/src/routing-table/index.ts Outdated
Comment thread packages/kad-dht/src/routing-table/index.ts Outdated
}
}, {
peerId,
signal: options.signal
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The composite signal needs to be passed to the job.

Suggested change
signal: options.signal
signal

Comment on lines +125 to +129
this.routingTableUpdateQueue = new PeerQueue<void>({
concurrency: init.routingTableUpdateQueueConcurrency ?? ROUTING_TABLE_UPDATE_QUEUE_CONCURRENCY,
metricName: `${init.metricsPrefix}_routing_table_update_queue`,
metrics: this.components.metrics
})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also allow configuring the maxSize parameter otherwise this can grow in an unbounded fashion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants