fix: nat lost in some p2p apps#2216
Conversation
|
丢 NAT 是什么意思 |
|
不知道为啥没法 comment。这个 pr 有几个严重 bug 我 comment 不上去。 |
|
只从 quic 的角度看确实是应该复用已有的 connection 的,我当时写的时候确实对 quic 不够熟悉 |
|
对于 transport_config,quic tunnel 和 quic proxy 应该需要不同的参数,这个有待测试 |
容器里流量通过gost来转发(relay协议,流量封装成tcp, 类似vless),easytier就当作网络中转,而且只处理tcp的请求。 |
8dee0fc to
b16fec3
Compare
73c1356 to
a8ab9ab
Compare
|
另外只开启 enable_kcp_proxy 不开启 enable_quic_proxy 的时候也有这个问题吗? |
我现在不确定kcp是否有问题了,需要再观察一下 |
70fee73 to
bf8e376
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR changes the QUIC proxy behavior to reuse a single QUIC connection per destination peer (keyed by dst_peer_id) to mitigate NAT-loss issues observed in some P2P app scenarios over a transparent proxy/tun setup.
Changes:
- Introduce a per-peer connection cache (
moka::future::Cache) to reusequinn::ConnectionbyPeerId. - Replace multi-attempt concurrent connect logic with a simpler retry loop that reuses/invalidates cached connections.
- Adjust stream receive task handling to await transfer completion and log transfer errors.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| easytier/src/gateway/quic_proxy.rs | Adds per-peer QUIC connection caching + retry/invalidate logic; updates stream task execution to await transfer result. |
| easytier/Cargo.toml | Adds moka dependency (future cache) to support connection reuse. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let conn_map = Cache::builder() | ||
| .max_capacity(u8::MAX.into()) // same with max_concurrent_bidi_streams, can be increased |
|
|
||
| let mut connect_tasks = JoinSet::<Result<QuicStream, Error>>::new(); | ||
| let connect = |tasks: &mut JoinSet<_>| { | ||
| for attempt in 0..2 { |
|
|
||
| if attempt == 0 { | ||
| self.conn_map.invalidate(&dst_peer_id).await; | ||
| tokio::time::sleep(Duration::from_millis(300)).await; |
| } | ||
| } | ||
|
|
||
| Err(anyhow!("quic connect: failed to establish stream after retry").into()) |
| let conn_map = Cache::builder() | ||
| .max_capacity(u8::MAX.into()) // same with max_concurrent_bidi_streams, can be increased | ||
| .time_to_idle(Duration::from_secs(600)) |
| .await; | ||
|
|
||
| match stream { | ||
| Ok(stream) => return Ok(stream), |
There was a problem hiding this comment.
如果有一个链接 stream 活着,但是一直没新的连接创建,是不是 conn 也会因为超过 ttl 被 evict?
There was a problem hiding this comment.
由下一次try_get_with 触发的,超过ttl 才会重新init一个新的。或者手动invalidate, 下一次try_get_with 触发新的init
There was a problem hiding this comment.
我的意思是两个节点之间有存活的长链接,并且一直没有新的链接建立,这个长链接会不会被 cache ttl 清掉?
There was a problem hiding this comment.
惰性触发的,moka不会主动清理(没有异步任务)。下一个connect fn调用才会清理,然后旧的quinn::Connection被替换成新的,但是 活跃的 SendStream/RecvStream 内部持有连接引用 → QUIC 连接继续存活
There was a problem hiding this comment.
如果是惰性清理,那如果来了一波 connection,后续没有新连接的话内存就要永久被占用?
There was a problem hiding this comment.
后面代码加了定时任务60s调用run_pending_tasks 清理
There was a problem hiding this comment.
为什么需要手动清理?让 Cache 自己按照 TTI 清理就行了,要不然用 Cache 干什么
stream 会持有 Connection 的引用,Cache 里面的 Connection 释放了也没关系,以前我就是这么做的。不放心的话可以加个单元测试,如果真有问题的话把 Connection 句柄 Clone 一份和 stream 的生命周期绑定就行
There was a problem hiding this comment.
我本地测试验证下清理后单条连接的情况
There was a problem hiding this comment.
清理之后,那上面的问题是否依然存在:会误清理长链接
测了下不会误清理,长链接能保持(conn被踢掉后)
复现步骤:
.try_get_with(dst_peer_id, async move {
debug!("quic connect begin {}", dst_peer_id); //加日志
.time_to_idle(Duration::from_secs(30)) // 调小
let mut interval = tokio::time::interval(Duration::from_secs(10)); //调小
loop {
interval.tick().await;
debug!("quic conn_map_bg run_pending_tasks"); //加日志
conn_map_bg.run_pending_tasks().await;
}
然后ssh连接远端ssh(termux dropbear ssh)
ssh -p 8022 -t u0_a94@10.126.126.8 "while true; do echo $(date) keepalive; sleep 1; done"
日志过滤内容:May 09 19:35:21 nixos12700 easytier-core[267418]: 2026-05-09T19:35:21.056038686+08:00 DEBUG easytier::gateway::quic_proxy: quic connect begin 790591911
May 09 19:35:29 nixos12700 easytier-core[267418]: 2026-05-09T19:35:29.538870845+08:00 DEBUG easytier::gateway::quic_proxy: quic conn_map_bg run_pending_tasks
^[]11;rgb:0c0c/0c0c/0c0c^[\May 09 19:35:39 nixos12700 easytier-core[267418]: 2026-05-09T19:35:39.539525309+08:00 DEBUG easytier::gateway::quic_proxy: quic conn_map_bg run_pending_tasks
然后再连接新的ssh
ssh -p 8022 -t u0_a94@10.126.126.8 "while true; do echo $(date) keepalive; sleep 1; done"
发现日志
May 09 19:37:01 nixos12700 easytier-core[267418]: 2026-05-09T19:37:01.316130293+08:00 DEBUG easytier::gateway::quic_proxy: quic connect begin 790591911 说明重新创建conn了
等待一会,2个ssh都能输出时间,而且
easytier-cli proxy 能看到2个Connected的连接, 断开一个ssh后变成Closed(一个),过会消失
…ix nat lost problem
reuse conn by dst_peer_id, every peer use only 1 quic conn, to fix nat lost problem
我遇到了丢nat的问题。场景是透明代理, 所有数据都通过 tun(gost自带)发到远端(自带的relay协议, 基于tcp),
跑p2p应用(erigon) 会有 0 caplin peer的问题。
经过调试,使用单个quic conn来处理所有连接(open_bi), 可以解决这个问题, pr的修改大致是这个意思
如果场景的连接数量过高,可以本地修改easytier/src/tunnel/quic.rs 的max_concurrent_bidi_streams配置为2000(默认256)