No More Hop Limits: What if Every Hop Cost Just 1 TX Instead of n? #9936

ClemensSimon · 2026-03-18T05:45:16Z

ClemensSimon
Mar 18, 2026

Context: What Meshtastic Already Does Well

Meshtastic's routing (v2.6/2.7) is already substantially better than naive flooding:

Managed Flooding suppresses ~40-50% of rebroadcasts via SNR-based contention windows
Next-Hop Routing (v2.6) learns relay nodes for direct messages, reducing DM cost to ~hops after the first flood
ROUTER/ROUTER_LATE roles ensure backbone coverage
Congestion Scaling automatically stretches intervals for 40+ node networks

This proposal doesn't replace these — it builds on the same principles and asks: can we extend directed routing to all message types, not just DMs?

The Remaining Bottleneck

Both managed flooding and next-hop still scale as O(n) per broadcast message. The hop limit (3-7) remains necessary because each hop multiplies transmissions proportional to network size. This caps effective range.

Proposal: System 5 — O(hops) for Everything

A routing approach that achieves ~1 TX per hop for all traffic types:

Geo-Clustering — Nodes self-organize by GPS geohash prefix. Full topology within clusters, border nodes between.
Multi-Path Routing — 2-3 cached paths per destination with instant failover (no rediscovery flood).
Weighted Load Balancing — W(r) = α·Q + β·(1-Load) + γ·Batt distributes traffic proportionally.
Adaptive QoS — Network Health Score per cluster throttles low-priority traffic. SOS always passes.
Fallback — Scoped cluster flooding (not full network) when all routes fail.

Simulation: System 5 vs. Managed Flooding

Python simulator with EU868 LoRa model, tested on identical networks with 4 routers (Naive Flood, Managed Flood, Next-Hop, System 5):

Scenario	Managed Flood TX	System 5 TX	S5 Delivery	Savings vs Managed
Small (20 nodes, 1km)	17,045	112	100%	99.3%
City (100 nodes, 5km)	155,774	196	100%	99.9%
Regional (500 nodes, 20km)	708,720	497	100%	99.9%
Dense Urban (200, 3km)	1,239,692	136	100%	100%
1000 nodes (40km)	1,462,489	10,635	99%	99.3%
1500 nodes (50km)	2,119,189	38,850	97%	98.2%
30% degraded links	170,214	4,701	100%	97.2%
50% degraded links	183,019	16,055	73%	91.2%
20% nodes killed	108,413	2,853	85%	97.4%

The key metric: max load on the busiest node drops from 4,500-19,900 (managed flood) to 6-80 (System 5).

Biggest Practical Consequence

The hop limit becomes irrelevant. Each hop costs ~1 TX regardless of network size. 20 hops cost less than managed flooding costs for 1. This means:

No artificial range cap
SHORT_FAST with more hops works as well as LONG_SLOW with fewer
Battery nodes far from the path don't transmit at all

Try It

Live interactive demo: https://clemenssimon.github.io/MeshRoute/
GitHub (simulator + docs): https://github.com/ClemensSimon/MeshRoute
MIT licensed

The demo shows side-by-side animations of all four routing approaches on identical topology, simulation results with interactive charts, and resilience testing.

Questions for the Community

Does this approach address real pain points you're seeing in larger meshes?
What scenarios should the simulator test that I'm missing?
Is a firmware prototype (as optional routing module alongside managed flooding) worth pursuing?
The GPS requirement for geo-clustering — dealbreaker or acceptable tradeoff?

shalberd · 2026-03-18T11:28:27Z

shalberd
Mar 18, 2026

Congestion Scaling automatically stretches intervals for 40+ node networks

not anymore ... I do not agree with it, but in any case, it seems that in firmware 2.7.20, scaling does not apply anymore for ROUTER_LATE and other device roles #9818

I don't know why that is ... my only explanation is: comfortable US frequency slot / number of slots situation and disregard for situation in other areas of the world in terms of LoRa. It'd be nice if there were feature gates for stuff like 0-cost routing (was suggested once by @GUVWAF) and exemptions to scaling .. gating those features to short presets only or to any region except EU_868 ... but it is as it is.

2 replies

thebentern Mar 18, 2026
Maintainer

ROUTER_LATE should be taking on the course-grained interval miniums ROUTER had previously. I will double check that though, because if not, this is a missed spot.

shalberd Mar 18, 2026

cool, thank you very much, @thebentern do you mean those defaults for IF_ROUTER https://github.com/meshtastic/firmware/blob/master/src/mesh/Default.h#L17

and those applying, or not yet applying, to ROUTER_LATE? I think @h3lix1 fixed this in https://github.com/meshtastic/firmware/pull/9815/changes

It is good to see that, despite not scaling up the intervals even more, at least those two roles have very course default intervals of a day / half a day.

It's be good if, for certain regions like EU_868 with mostly one frequency slot https://meshtastic.org/docs/overview/radio-settings/#frequency-slot-calculator only and for non-short presets, there were feature gates for the exemptions, even for ROUTER and ROUTER_LATE, but especially so for certain sensor / tracker roles.
Why feature-gating the exemption from scaling and 0-cost-hops for all except / but EU_868 and non-short presets? Cause the traffic situation in EU_868 is very, very congested, so anything that adds to congestion in large networks or does not take it into account, in our case on MEDIUM_FAST preset, is bad. All frequency slots except LONG_MODERATE and LONG_SLOW use the same frequency; Thanks, regulator ;-)

Lastly, thank you for your great work in the project, we appreciate it very much. As mentioned ... we've just got different regulatory and frequency slot situation contexts 'round here. Most of our EU_868 presets work on frequency 869.525 Mhz and to add to it, _TURBO presets with 500 kHz bandwidth are not allowed, either. Still .. we like Meshtastic and the spirit of your project and do our best here to work within our constraints. Cheerio / have a good day.

ClemensSimon · 2026-03-19T05:12:39Z

ClemensSimon
Mar 19, 2026
Author

Suggested labels: mesh, enhancement

Keywords for discoverability: range extension, performance, scalability, bandwidth efficiency, hop limit

0 replies

ClemensSimon · 2026-03-19T06:10:17Z

ClemensSimon
Mar 19, 2026
Author

Update: Realistic Hop Limits Reveal Delivery Collapse

The original simulation used TTL=30, which masked a critical problem. With Meshtastic's actual hop limits (3, 5, 7), managed flooding's delivery rate collapses at scale:

Scenario	Nodes	3-hop	5-hop	7-hop	System 5
Small Local	20	100%	100%	100%	100%
Medium City	100	92%	100%	100%	100%
Large Regional	500	14%	31%	51%	76%
1000 Nodes	1000	2%	6%	6%	45%
1500 Nodes	1500	2%	4%	5%	51%
Rural Long Range	50	77%	91%	88%	100%
Maritime	30	69%	82%	84%	100%
Disaster Relief	80	62%	87%	84%	78%

Key Insight

The hop limit is not just a range cap — it is a delivery ceiling. At 1000+ nodes, managed flooding delivers fewer than 1 in 10 messages regardless of hop limit setting. System 5 delivers 7.5x more messages with fewer total transmissions.

This means the current routing doesn't just waste bandwidth — it fails to deliver in the exact scenarios where mesh networking matters most (large, spread-out, disaster relief).

Updated demo and results: https://clemenssimon.github.io/MeshRoute/

0 replies

shalberd · 2026-03-20T11:52:00Z

shalberd
Mar 20, 2026

can we extend directed routing to all message types, not just DMs

that'd be awesome and a real differentiator from MeshCore

The GPS requirement for geo-clustering — dealbreaker or acceptable tradeoff?

Would not call it a GPS requirement. More a position info requirement. At least for major mountain role CLIENT, ROUTER_LATE nodes, we already use coordinates / position info. A little fuzzied out in terms of precision, and at times also just manually input as fixed coordinates when no GPS module is present or when activating it would take too much energy consumption. Personally, I'be be fine with taking into account position information of nodes.

0 replies

h3lix1 · 2026-03-21T05:42:36Z

h3lix1
Mar 21, 2026

My question with System 5 is how does it work with asyncronous paths?

It looks clean in a simulated setup like this, but as paths change there is an increase of out of order messaging. (i.e. messages sent A B C can be received C B A if sent via 3 different paths), and meshtastic doesn't really do a good job of keeping sequence in the app like TCP does.

The second part is async routing - the path back can vary greatly from the path sent. The path to get to a mountain top router might take 3 hops, but the path back is a single hop. Trying to route the traffic back through the same 3 hops to get to the destination really isn't efficient.

For bay mesh, the best way to describe it is that the flood routing mesh all happens above 2000 feet. The challenge we see today with chutil is that for 5% actual utilization, the router mountain nodes hear about 10 different roof nodes sending packets at the same time, 4 more other long distance router nodes, etc. The result is a mess of collisions that result in 5% utilization turning into 50% utilization.

Let's take for example, SUNL - my node in the bay area.

The challenge is all the clients repeating packets cause a mess at high altitudes. We try our best to remove this by having enough routers to make sure most clients hear each message at least twice (supressing responses), but that itself also causes higher utilization.

It would be interesting to see more simulations where a single node can send to 100 nodes, but only hears 10. Where the circles for mountain top nodes can send far and wide (30+ miles) to other mountain top nodes and be heard by those mountain top nodes, but can also hear hliltop nodes.. and then those hilltop/roof nodes hear both valley nodes and in-building nodes. The fun part is the mountain nodes also have a good chance of sending traffic directly to the valley and in-building nodes without needing the hilltop or roof nodes at all.

Keep in mind, while the hilltop and roof nodes are sending, the mountain nodes are blocked from sending. Somehow this also needs to be added to the simulation to get past the real crux of the issue, which is half-duplex communication means the inability to send due to listening to a bunch of other nodes sending.

It also doesn't really solve the problem of how a client knows it missed a message. For example, in your example "How the Network Builds Itself", when it gets to the "load balancing" part, there is only one path that works, A-C-F-L-M-K-O. It's not very load balanced, but it also means there are 3 nodes that all make a path of failure. If any of those 3 nodes fails due to intermittent issues, the message will not be received.

Lastly, routing tables require memory. Some vendors are moving to using nrf52 devices for 1 watt nodes, and many love using solar nodes (also, nrf52) as routers. Many of these nodes can only hold 80-100 nodes in the nodedb. What is the memory expectation for something like this with 1000-10000 nodes?

0 replies

ClemensSimon · 2026-03-21T10:05:18Z

ClemensSimon
Mar 21, 2026
Author

Hey @h3lix1,

Thank you for the detailed feedback! Your questions about half-duplex blocking at SUNL, out-of-order messaging, and
nRF52 memory directly led to 5 new features being implemented. Here's what changed and what the numbers look like now.

What Your Feedback Built

Half-Duplex Model — You said: "mountaintop nodes are blocked from sending."
Done. Per-node radio state machine (IDLE/TX/RX) in simulator. Nodes can't TX while receiving.

Node Silencing — You said: "clients repeating packets at high elevations cause a mess."
Done. Redundant valley nodes are muted — they still listen, but don't rebroadcast. Battery-fair rotation every 10 min.
128 of 193 valley nodes silenced, TX cost halved.

Sequence Numbers — You said: "messages A B C can be received C B A."
Done. 2-byte per-(src,dst) sequence counter in packet header. App can detect gaps and reorder. Zero extra TX.

Emergency Re-Route — You said: "only one path works, 3 single points of failure."
Done. Fresh BFS excluding failed nodes before corridor flooding. 7 failover layers total.

Bay Area 3-Tier Topology — You said: "mountaintop routers hear 10 rooftop nodes simultaneously."
Done. 7 mountain (45km range) + 35 hill (10km) + 193 valley (2.5km) nodes with asymmetric links.

For the technical deep-dive, see https://clemenssimon.github.io/MeshRoute/how-it-works.html — sections on
https://clemenssimon.github.io/MeshRoute/how-it-works.html#silencing,
https://clemenssimon.github.io/MeshRoute/how-it-works.html#halfduplex, and
https://clemenssimon.github.io/MeshRoute/how-it-works.html#seqnums.

Bay Area Results (235 nodes, half-duplex)

                  Managed Flood    System 5    S5 + Silencing

Delivery Rate 6.0% 77.5% 74.5%
Total TX 6,752 540,780 267,927
Under Stress 4.0% 52.0% 51.0%
Nodes Silenced 0 0 134 (57%)

Key finding: Half-duplex collapses managed flooding from 87.5% to 6% delivery — your SUNL problem exactly. Mountaintop
stuck in RX from 10+ simultaneous rebroadcasts. System 5 holds at 77.5% because directed routing sends 1 packet
instead of 14. Node Silencing halves TX by muting 128 valley nodes. All 7 mountain nodes stay active.

Your Questions — Quick Answers

Async paths / out-of-order: 2-byte sequence counter, zero extra TX. App detects gaps.
Asymmetric return paths (3 up, 1 down): Already works — per-direction link qualities.
SUNL collision cascade: 3 layers — directed routing + node silencing + backpressure.
Missing messages: Sequence numbers for gap detection. Full ACKs too expensive for LoRa.
Single point of failure: 5 cached routes + emergency BFS + corridor flood = 7 layers.
nRF52 memory at 10K nodes: ~30KB with clustering, ~15KB with reduced params. Seq counters: 128 bytes via LRU.

Try It

https://clemenssimon.github.io/MeshRoute/simulator.html — select "Bay Area Mesh" or "Bay Area + Silencing"
https://clemenssimon.github.io/MeshRoute/how-it-works.html — step-by-step technical deep dive
https://clemenssimon.github.io/MeshRoute/ — all 26 scenarios with category filters.
https://github.com/ClemensSimon/MeshRoute — MIT license

Your feedback genuinely made this better. The half-duplex insight alone was worth the entire conversation — it
revealed that the real problem isn't routing efficiency but radio physics at elevated nodes.es.
Clemens

2 replies

h3lix1 Mar 21, 2026

This seems good for unicast, and potentially could be accomplished with bloom filters, but meshtastic is currently 98% broadcast traffic for us - all nodes should receive the same message. (For example, position packets, nodeinfo packets, telemetry packets)

for the simulators I can only see examples for unicast messaging. Is there an example of messaging that is broadcast to all nodes?

shalberd Mar 22, 2026

Hi, thank you @h3lix1 for also having documented this, bloom filter concept, back then on Github. @ClemensSimon FYI #8592

Also, have a look at this comment back then on how to improve things by @fifieldt #6199 (reply in thread)

He said in answer to someone from Seattle WA Puget sound area (similar in terms of lateral valleys, high elevation mountains) regarding regional aspects / regions clustering and a mix of Autobahn style directed routing from key location to key location, e.g. mountain to mountain vs. normal flood routing for regional distribution:

Purely theoretical at the moment and only in my head. Just an idea from the old days of community wireless networks. For large meshes, separate the mesh into "interior" and "exterior" routing. "Interior" routing works just like right now. "Exterior" routing would be used to join disparate clusters of "interior" routed mesh that are separated by geography. I'll have to get out a diagram and more words than I can type on a cellphone to explain properly. Haven't even thought through the concept fully yet :)

Big thanks from me, too, to @ClemensSimon for the new ideas and trains of thought.

ClemensSimon · 2026-03-21T23:54:31Z

ClemensSimon
Mar 21, 2026
Author

▎ Hey @h3lix1,

▎ Good question — but I want to make sure I understand the requirement correctly.

▎ You say 98% of traffic is broadcast. But broadcast to whom exactly? If there's no hop limit anymore, does that mean
every position packet from every node should reach all nodes in the entire network? For a 1000+ node mesh, that seems
like it would be the root cause of the collision problem you described, not the solution.

▎ Could you help me understand the intent behind these broadcasts?

▎ - Position/Telemetry: Does every node really need to know the position of every other node? Or is it more like
"nodes within my region" or "nodes I've recently communicated with"?
▎ - NodeInfo: Same question — is this needed network-wide or just locally?
▎ - Groups: If the use case is "send to a defined group of nodes" (e.g., all nodes in the Bay Area, or all members of
a channel), that's something I could implement as a group routing feature — not full broadcast, but targeted
multicast.

▎ Right now the simulator handles unicast (point-to-point) and managed flood (full broadcast). If the real need is
something in between — like group-based multicast — I'd need to build that. Happy to do so if that's the actual use
case.

1 reply

h3lix1 Mar 22, 2026

Hi @ClemensSimon -

Most of the state in Meshtastic (nodedb entries, etc) are push-based braodcast messages. This includes position packets, nodeinfo packets, telemetry packets, and channel messages. The only things unicast today are DMs and traceroute packets. There are also a few on-demand information gathering that a node can do, but the majority of the time things are flood routed.

There could be a push in future to instead do pull-based information gathering instead of push-based as it is today, but it almost all messages are network-wide. The channels aren't split into locality today (i..e. "Bay Area") but just the default channel "MediumFast" for example. I could see a world where channels are more localized, but it's not configured that way today.

I can't seem to find how system 5 will handle the flood option (if there is an option to do flood routing with a system 5 configuration)

I do appreciate the effort here though - if we can crack how to make flood routing with as little airtime as possible, we might be on to something.

ClemensSimon · 2026-03-23T09:30:28Z

ClemensSimon
Mar 23, 2026
Author

Draft Response to @h3lix1 and @shalberd — Broadcast Routing in System 5

STATUS: DRAFT — Zur Überprüfung vor dem Posten

Hey @h3lix1, @shalberd,

Thank you for the clarity on the 98% broadcast reality — that's the critical piece I needed. You're right that System 5 as demonstrated primarily optimizes unicast. Let me lay out how the architecture can handle broadcast traffic, and where @h3lix1's Bloom Filter idea fits in.

The Broadcast Problem, Precisely

In a 235-node Bay Area mesh with managed flooding:

1 position packet → 235+ TX (every node rebroadcasts)
100 nodes sending position every 15 min → ~94,000 TX/hour
With half-duplex collisions, most of those TX are wasted

The question isn't "should every node hear every position?" — it's "can we deliver the same broadcast reach with fewer TX?"

Three Approaches That Could Work Together

1. Cluster-Scoped Broadcast (System 5 native)

System 5 already has geo-clusters. Use them for broadcast scope:

Intra-cluster: Flood normally within your cluster (small, ~10-30 nodes, manageable)
Inter-cluster: Only border nodes relay to adjacent clusters — 1 TX per cluster boundary instead of N
Result: Broadcast cost goes from O(n) to O(clusters × cluster_size), roughly O(√n)

For Bay Area (7 mountain + 35 hill + 193 valley):

Valley nodes broadcast only within their cluster (~15-20 nodes each)
Hill/mountain border nodes relay between clusters
Estimated: ~15-25 TX per broadcast instead of ~235

2. Bloom Filter Hybrid (@h3lix1's RBF from #8592)

Your Bloom Filter approach and System 5 are complementary:

System 5 knows the cluster topology and border nodes → where to route
Bloom Filters track which nodes have already seen a packet → who to skip

Combined: Border nodes carry a Bloom filter in the broadcast packet. When relaying to the next cluster, nodes already in the filter don't rebroadcast. This handles the overlap zones where clusters share radio range.

The 11-35 byte filter cost is negligible vs. saving dozens of redundant TX at cluster boundaries.

3. @fifieldt's Interior/Exterior Split — Already Built

@shalberd, great catch. System 5's geo-clustering is the interior/exterior split that @fifieldt described:

Interior = intra-cluster routing (flood within cluster, small scope)
Exterior = inter-cluster routing via border nodes (directed, 1 TX per hop)

The only missing piece is applying this to broadcast traffic, not just unicast. The cluster infrastructure is already there.

What I'll Build Next

Cluster-scoped broadcast mode in the simulator — measure TX savings vs. delivery rate
Bloom filter integration at cluster boundaries for overlap deduplication
Broadcast scenario benchmarks — 100 nodes all sending position packets, System 5 cluster-broadcast vs. managed flooding

Honest Limitations

Latency: Cluster-scoped broadcast adds relay hops → slightly higher latency than direct flooding for nearby nodes
Consistency: Not all nodes will have the same view at the same time (but that's already true with hop limits)
OGM overhead: Neighbor discovery still needs some flooding — can't route what you haven't discovered

Would this address your use case? Specifically: if position/telemetry packets reached all nodes within ~5-10 seconds instead of ~1-3 seconds, but used 90% less airtime — would that tradeoff work for Bay Mesh?

— Clemens

0 replies

ClemensSimon · 2026-03-23T10:29:26Z

ClemensSimon
Mar 23, 2026
Author

Hey @h3lix1, @shalberd,

Your feedback on broadcast traffic being 98% of Meshtastic's workload was the key insight I was missing. I've now implemented and benchmarked a broadcast-specific routing mode that directly addresses this.

The Problem You Identified

System 5 optimized unicast brilliantly (1 TX per hop), but had no answer for broadcast packets (position, nodeinfo, telemetry, channel messages). Managed flooding costs O(n) per broadcast. For Bay Area's 235 nodes, that's 4,301 TX per single position packet.

Solution: Cluster-Distributor Broadcast

Instead of flooding the entire network, broadcast propagates as a wave through clusters:

Elect a Distributor per cluster -- a valley node with high local reach but low signal leakage to other clusters
Source's cluster: distributor does a scoped mini-flood (only within cluster)
Border nodes relay to the next cluster's distributor (1 directed TX to cross boundary)
Next distributor floods its cluster, border nodes relay further
Repeat until all clusters covered

This is essentially @fifieldt's interior/exterior routing concept -- interior = flood within cluster, exterior = directed relay between clusters.

Key Design Decisions

Valley nodes as distributors, not mountain nodes. A mountaintop node broadcasting reaches 10+ clusters simultaneously, causing a collision storm (your SUNL problem exactly). A valley node broadcasting stays contained by terrain -- only its cluster hears it. The distributor election scores:

Score = coverage * (0.3 * containment + 0.4 * elevation_bonus + 0.3 * tier_bonus)

Where valley nodes score ~1.0 and mountain nodes score ~0.1.

Mountain nodes receive but don't relay. During intra-cluster mini-flood, mountain nodes hear the broadcast (they hear everything) but don't rebroadcast -- their TX range is too large and would leak to other clusters. They're passive receivers, not active relays.

Natural signal spillover is free. When a valley distributor floods its cluster, nearby nodes in adjacent clusters often hear it too -- counted as reached with zero extra TX cost.

Benchmark Results

Tested with 20 broadcasts per scenario, averaged:

Scenario	Managed Flood Reach	Managed Flood TX/msg	Cluster-Dist Reach	Cluster-Dist TX/msg	TX Savings
Small (20 nodes)	71.8%	55	100.0%	22	61%
Medium (50 nodes)	100.0%	430	100.0%	52	88%
Large (100 nodes)	100.0%	2,130	99.8%	116	95%
Dense (200 nodes)	100.0%	8,731	99.9%	220	97%
Regional (500 nodes)	91.5%	5,869	100.0%	517	91%
Bay Area (235 nodes, 3-tier)	90.0%	4,301	96.0%	220	95%

Bay Area: 96% reach with 95% fewer transmissions -- and 6% MORE reach than managed flooding, because directed routing avoids the collision cascades that kill flooding at scale.

What This Means for Real Traffic

If 100 nodes each send position every 15 minutes:

Managed Flooding: 100 x 4,301 = 430,100 TX per 15 min
Cluster-Distributor: 100 x 220 = 22,000 TX per 15 min

That's the difference between network congestion collapse and comfortable headroom.

Bloom Filter Integration

@h3lix1 your Bloom Filter approach from #8592 fits naturally at the cluster boundaries. When a border node relays to the next cluster, it can carry an RBF of which nodes already received the broadcast. The next cluster's distributor checks the filter before relaying to nodes that might already have it from signal spillover. This would reduce the remaining redundancy even further.

Honest About Limitations

Half-duplex remains brutal -- both approaches collapse to single-digit delivery rates. This is radio physics (nodes stuck in RX), not a routing problem. The cluster-distributor does produce 95% fewer TX though, which means fewer collisions.
Latency is slightly higher -- the wave propagates cluster-by-cluster instead of flooding simultaneously. For position packets (not time-critical), this should be acceptable.
Distributor failure requires re-election. Currently not implemented but straightforward -- second-best valley node takes over.

Try It

The live simulator lets you compare all routing approaches side-by-side. Select any scenario including "Bay Area Mesh" and step through hop-by-hop. Source code: simulator/routing.py -- classes ClusterDistributorBroadcast and ManagedFloodBroadcast.

Your question "if we can crack how to make flood routing with as little airtime as possible, we might be on to something" -- I think this is that something. The trick is: don't flood the whole network. Flood small clusters, relay between them.

-- Clemens

0 replies

korbinianbauer · 2026-03-23T14:34:00Z

korbinianbauer
Mar 23, 2026

Since this is clearly AI-generated, I'll feel free, too:

Non-local information requirements
- Cluster-level Network Health Score (NHS), path-wide battery, and load-aware routing require data from multiple nodes, which is not locally available.
- Example: To compute NHS for an 8-node cluster, each node would need link-quality and queue data from all neighbors every 30 s, creating extra traffic.
Memory constraints
- Multi-path route tables (5 routes × ~70 destinations × ~410 bytes per route) could use ~143 KB, exceeding nRF52 usable RAM (~64 KB).
- Neighbor table (16 entries × 80 B = 1.3 KB) and cluster metadata (~800 B) further add to memory pressure.
Compute overhead
- BFS-based multi-path route computation 5× per destination, route decay updates every 30 s, and emergency reroutes may overload the 64 MHz nRF52840 CPU.
- Example: For a 70-node cluster with 5 routes per destination, BFS complexity is O(V+E) ≈ several hundred operations per route, repeated every update cycle.
Radio / airtime limitations
- OGMs every 30 s per node with rich metadata (~20–40 B) on a 100-node network → ~100 messages per 30 s.
- LoRa SF12 airtime: 500 ms–2 s per packet → channel may be occupied >50 s in a 30 s window if multiple nodes transmit simultaneously.
- Multi-hop retries (3–5 per hop) for 3–5 hop paths multiply transmissions, increasing collisions and duty-cycle risk.
Topology propagation
- Directed routing assumes partial multi-hop topology knowledge beyond immediate neighbors.
- Example: To route across 3 clusters with 2 border nodes each, nodes need ~16 extra entries in routing tables, which requires repeated propagation of neighbor and border info.

Btw. the simulator doesn't do anything when I open the page

0 replies

ClemensSimon · 2026-03-23T20:50:37Z

ClemensSimon
Mar 23, 2026
Author

Thanks for the detailed review -- these are valid engineering concerns that deserve concrete answers. I'll go point by point.

Re: "clearly AI-generated" -- yes, Claude helped with the writeup and the simulator code. The routing concepts and the constraints analysis are mine though. Speaking of which:

Simulator fix

The simulator was broken due to orphaned code fragments from a bad file split (leftover lines from roundRect polyfill in sim-scenarios.js and RNG class methods in sim-network.js that caused TypeError/SyntaxError on load). Fixed now -- should work if you reload. Sorry about that.

1. Non-local information requirements

Fair point, but the critique assumes more global knowledge than the design requires.

The weight formula W(r) = a*Q + b*(1-Load) + g*Batt uses local data only: Q = link quality to next hop (measured via SNR/RSSI), Load = own queue depth, Batt = own battery. The only "remote" value is the next hop's battery level, which piggybacks on the OGM it already sends.

NHS is not a global aggregate -- it's a local average of what a node sees from its direct neighbors' OGMs. An 8-node cluster doesn't need extra polling; the OGMs that maintain neighbor tables already carry this data.

Where you're right: "path-wide" battery and load awareness (across multiple hops) is not feasible on LoRa. The implementation should evaluate only the next hop, not the full path. I'll clarify this in the proposal.

2. Memory constraints

The math is correct but the assumptions are worst-case:

5 routes x 70 destinations x 410 bytes -- in practice, a node in a clustered network knows its own cluster members (~20-30) plus border nodes to adjacent clusters (~4-8). That's ~35 destinations, not 70. And 2 routes (primary + backup) suffice, not 5.
410 bytes per route is too high. A route entry needs: dst_id (4B) + next_hop (4B) + quality (1B) + age (2B) + hop_count (1B) = 12 bytes. Even with a 4-node path hint: ~20 bytes.

Realistic calculation: 2 routes x 35 destinations x 20 bytes = 1.4 KB. Plus neighbor table (16 x 20B = 320B) and cluster metadata (~200B). Total: ~2 KB -- fits comfortably in nRF52 RAM.

You're right that explicit memory budgets should be in the proposal. I'll add a table.

3. Compute overhead

BFS on a 30-node cluster with ~100 edges is ~130 operations -- microseconds on a 64 MHz Cortex-M4. Even 3x per destination for 35 destinations = ~14,000 operations, well under 1ms.

But more importantly: BFS doesn't need to run on the node at all. Routes are built incrementally via distance-vector updates (like RIP/AODV): when a neighbor's OGM says "I can reach node X in 3 hops with quality 0.8", the node updates one table entry. That's a single comparison + write, not a graph traversal.

Route decay (quality *= 0.95 per entry every 30s) for 70 entries is trivial. Emergency reroutes only fire on link failure -- not a periodic cost.

I should describe the routing table mechanism as distance-vector rather than BFS in the proposal. The simulator uses BFS for clarity, but a real implementation wouldn't.

4. Radio / airtime -- this is the strongest objection

Your math is correct for the naive case: 100 nodes x 1 OGM/30s x 500ms-2s airtime = channel saturation. But OGMs don't flood globally in System 5. They stay cluster-local (1 hop only):

A cluster of 25 nodes: 25 OGMs/30s x ~500ms = 12.5s airtime -- feasible on a single channel
Border nodes exchange condensed cluster summaries: 1 message per border-pair per cycle
Total for a 4-cluster network: ~100 local OGMs + ~12 border summaries = manageable

Still, I acknowledge this needs more work:

OGM interval should be adaptive: few neighbors -> 30s, many neighbors -> 120s+
OGM payload can be reduced to ~8 bytes (node ID + battery + quality summary)
EU868 duty cycle (1% = 360ms per 36s) is the hard constraint -- need explicit airtime budgets
Retries per hop need quantification against duty cycle limits

The retry concern (3-5 per hop x 3-5 hops) is valid but less severe than it sounds: System 5 sends unicast (1 TX per hop), not broadcast. Total airtime for a 5-hop message with 2 retries = ~15 TX. Managed flooding for the same message: hundreds of TX. The per-message efficiency is real even with retries.

5. Topology propagation

System 5 does not require multi-hop topology knowledge. Each node knows:

Its neighbors (1-hop, from OGMs)
Its cluster members (from local OGMs)
Border nodes (locally detectable -- has neighbors in other clusters)
Which adjacent clusters are reachable via which border nodes (from border-to-border OGM exchange)

The "16 extra entries" for routing across 3 clusters is correct and trivial: 16 x 20B = 320 bytes. Propagation cost: ~2-4 border summary messages per cluster-pair per 30s cycle.

What's missing from the proposal is an explicit diagram showing what data lives where and how it propagates. I'll add that.

Summary of what I'll improve based on your feedback:

Explicit memory budget table for nRF52
Clarify that routing uses distance-vector, not BFS-on-device
Add adaptive OGM intervals and airtime budget calculation for EU868
Add data flow diagram for topology propagation
Remove "path-wide" language -- next-hop metrics only

Good feedback overall. The airtime point is the one that needs the most engineering work before this could be real.

0 replies

ClemensSimon · 2026-03-23T21:03:43Z

ClemensSimon
Mar 23, 2026
Author

Hey @h3lix1, @shalberd,

Quick update -- based on @korbinianbauer's feedback (and your earlier points about broadcast traffic being 98% of the network), I've made significant revisions to the proposal and the documentation:

What changed

1. Distance-Vector instead of BFS
Routes are now built incrementally from OGM data (like RIP/B.A.T.M.A.N.), not by running graph algorithms on-device. Each route entry is 12 bytes (dst + next-hop + quality + age + hops). Total routing table: ~1.5 KB. This directly addresses the nRF52 memory concern -- 64 KB RAM is more than enough.

2. Next-hop metrics only
The weight formula W(r) = a*Q(r) + b*(1-Load) + g*Batt now explicitly uses next-hop node data only (from its last OGM). No more "path-wide battery" or "minimum battery along route" -- those require multi-hop state propagation that creates exactly the traffic overhead we're trying to avoid.

3. Adaptive OGM interval
Fixed 30s replaced with density-adaptive intervals: 30s (sparse, <8 neighbors), 60s (moderate), 120s (dense), 180s (very dense). Includes an explicit EU868 duty cycle airtime budget table in the docs.

4. Cluster-Distributor Broadcast (new -- directly from @h3lix1's point about 98% broadcast traffic)
This is the big one. Broadcasts no longer flood the entire network. Instead:

Each cluster elects a distributor (valley node with high local coverage, low cross-cluster leakage)
Broadcast source sends to its cluster distributor via unicast (1-3 TX)
Distributor does a mini-flood within the cluster only (~20-30 TX for a 25-node cluster)
Border nodes relay to the next cluster's distributor
Wave propagates cluster-by-cluster

Results: Bay Area (235 nodes): 4,301 TX with managed flooding vs 220 TX with cluster-distributor = 95% savings. Regional (500 nodes): 95,869 vs 517 TX = 99.5% savings.

5. Simulator fixed
@korbinianbauer noted it wasn't working -- turned out two JS files had orphaned code fragments from a bad file split. Fixed and pushed. Should work now: https://clemenssimon.github.io/MeshRoute/simulator.html

@h3lix1 -- re: your Bay Area concerns

The broadcast routing directly addresses your point about position/nodeinfo/telemetry dominating traffic. With cluster-distributors, a position beacon from one node costs ~30 TX to reach the whole 235-node Bay Area mesh, instead of ~4,000 TX with flooding. The half-duplex mountaintop blocking issue is also less severe because the distributor model generates far fewer simultaneous transmissions.

Out-of-order delivery (your A-B-C -> C-B-A concern) is handled by the 2-byte sequence counter in the packet header. Gap detection is cheap and doesn't add TX overhead.

@shalberd -- re: EU868 and GPS

The adaptive OGM interval now explicitly accounts for EU868's 1% duty cycle. At 60s intervals (moderate density), a node uses ~0.8% of its duty budget for maintenance traffic. The airtime budget table is in the updated How It Works page.

GPS remains a soft requirement -- nodes without GPS can use pre-set coordinates or inherit cluster assignment from a GPS-capable neighbor.

All changes are live on the site. The How It Works page has the full technical details including the new broadcast section.

0 replies

korbinianbauer · 2026-03-24T07:59:31Z

korbinianbauer
Mar 24, 2026

But OGMs don't flood globally in System 5. They stay cluster-local (1 hop only):

They may not flood beyond 1 hop, but that doesn't mean they just stop at the borders of your geo-cluster. Every node in range or even slightly beyond it will detect a busy channel and cannot use this airtime.

1 reply

shalberd Mar 24, 2026

They may not flood beyond 1 hop, but that doesn't mean they just stop at the borders of your geo-cluster.

correct, we get 60-80 km range for a hop in every direction on preset MEDIUM_FAST.

ClemensSimon · 2026-03-24T18:08:55Z

ClemensSimon
Mar 24, 2026
Author

The 60-80km Elephant: Why Geo-Clusters Can't Be Radio-Isolated

@korbinianbauer and @shalberd -- you're absolutely right, and this is the most important feedback so far. Let me address it head-on.

The core problem: At MEDIUM_FAST, a single OGM "meant" for a 5km cluster occupies the channel for every node within 60-80km. Geographic clustering provides logical isolation but zero radio isolation. The airtime cost is real regardless of the intended scope.

I've been thinking about this since your comments, and I see three viable paths forward:

1. Power-Controlled Routing Packets
OGMs could be sent at reduced TX power (e.g. -12dB from normal), shrinking their radio footprint to match the intended cluster radius. A 5km cluster doesn't need routing packets transmitted at 60km range. This is the most direct fix -- the OGM is physically inaudible beyond the cluster. Trade-off: requires per-packet power control support in firmware.

2. Connectivity-Based Clustering Instead of Geo-Clustering
Rather than clustering by GPS coordinates, cluster by who can actually hear whom (the connectivity graph). Nodes that share strong bidirectional links form a cluster organically. This sidesteps the "overlapping radio range" problem entirely -- the cluster IS the radio neighborhood. No OGMs needed for cluster formation; the neighbor table (which Meshtastic already builds from received packets) defines the cluster implicitly.

3. Piggyback Routing on Existing Traffic
Instead of dedicated OGMs, embed routing metadata (next-hop, link quality, cluster-ID) into packets that are already being sent -- position broadcasts, telemetry, nodeinfo. Since these packets are transmitted anyway (and consume the same airtime), the routing overhead becomes effectively zero additional airtime. The trade-off is slower convergence (routing updates only happen when regular traffic flows), but for a mesh that already sends position every 15 minutes, this may be sufficient.

My honest assessment: Option 3 (piggybacking) combined with Option 2 (connectivity-based clusters) is probably the most realistic path. It adds zero airtime overhead, works within existing packet structures, and doesn't require hardware-level changes. The 60-80km range actually helps here -- it means a node's natural radio neighborhood IS a meaningful routing cluster.

I'll update the simulator to model connectivity-based clustering with piggybacked routing metadata and post results. The key metric will be: how much routing convergence time do we sacrifice vs. dedicated OGMs, and is the delivery rate still acceptable?

Updated airtime budget analysis for EU868 coming as well -- with explicit accounting for the shared channel problem you've identified.

0 replies

ClemensSimon · 2026-03-25T05:22:48Z

ClemensSimon
Mar 25, 2026
Author

@h3lix1 -- What is your opinion?
--Clemens

0 replies

ClemensSimon · 2026-05-11T07:32:32Z

ClemensSimon
May 11, 2026
Author

Update: Independent validation with Meshtasticator

I implemented System V6 as a router module in the official Meshtasticator simulator — same radio model, same MAC layer, same collision detection as Managed Flood. Apples-to-apples comparison.

Full benchmark: 18 simulations (20/50/80 nodes x 3/5/7 hop limits x MF/V6), 1h each, 30s message interval. Code and results: ClemensSimon/Meshtasticator (system-v6 branch)

The key finding — V6 breaks the hop limit:

Nodes	Managed Flood @ 3 hops	System V6 @ 7 hops	TX saved
20	8396 TX, 39% reach	6037 TX, 37% reach	28%
50	21789 TX, 19% reach	15864 TX, 18% reach	27%
80	36634 TX, 14% reach	25757 TX, 14% reach	30%

V6 with 7 hops costs less than Managed Flood with 3 hops. The hop limit exists because flooding generates O(n) transmissions per hop. V6 suppresses redundant rebroadcasts through passive route learning, so more hops do not cause congestion collapse.

TX reduction across all scenarios: 29-40%. Learning effect is visible in time-series — V6 starts identical to Managed Flood and improves within the first 5 minutes as it learns which neighbors are good relays.

Anyone can reproduce this:

git clone -b system-v6 https://github.com/ClemensSimon/Meshtasticator.git
cd Meshtasticator
pip install -r requirements.txt
python loraMesh.py 50 --router-type SYSTEM_V6 --no-gui
# or run the full benchmark:
python v6_parallel_bench.py

-- Clemens

1 reply

NomDeTom May 11, 2026
Maintainer

Ok, so in a real mesh, or at least the ones that run modern firmware and don't take place at events, messages are much more sparse, and RF conditions change frequently. I will post a little list I use to check whether an algorithm has any chance a bit later on, but for this, does it pass the "mixed new and established node" test?

ClemensSimon · 2026-05-11T09:03:37Z

ClemensSimon
May 11, 2026
Author

Update 2: Improved V6 with deferred rebroadcast + TX power control

After analyzing why Meshtasticator results differ from my own simulator (LoRa is broadcast, not directed — every TX is heard by all nodes in range), I added three improvements:

Deferred rebroadcast — V6 waits 1-2 slot times before deciding. During this time, it observes whether other relays already forwarded the packet. Closer nodes (stronger RSSI) wait less, giving them priority.
RSSI-based suppression — if I received with strong signal, the sender probably reaches my neighbors too. I am redundant.
TX power control (unicast DMs only) — when V6 knows the next hop, it reduces TX power to just reach that node (+10dB margin). Smaller radio footprint = fewer collisions. Only for DMs with known routes, not broadcasts.

Complete results (80 nodes, 1h sim, 30s msg interval):

Hops	MF TX	V6 TX	MF Coll	V6 Coll	MF Reach	V6 Reach
3	36,600	36,616	57,435	49,659	13.8%	12.8%
5	39,387	34,927	72,865	47,032	16.0%	11.9%
7	38,547	36,592	59,044	57,322	14.7%	14.0%

The hop limit argument still holds: V6 at 7 hops = same TX as MF at 3 hops, with comparable reach and fewer collisions.

At 5 hops, the collision reduction is most dramatic: -35% collisions with 11% fewer TX.

Honest assessment: My own MeshRoute simulator overstates the advantage because it models "directed send" as 1 TX to a specific neighbor. On real LoRa, every TX is a broadcast. The realistic benefit of V6 is intelligent suppression — 10-35% fewer TX and up to 35% fewer collisions — not 97%. But this is enough to safely raise the hop limit, which was the original goal.

Code: ClemensSimon/Meshtasticator system-v6 branch

-- Clemens

1 reply

NomDeTom May 11, 2026
Maintainer

Deferred rebroadcast — V6 waits 1-2 slot times before deciding. During this time, it observes whether other relays already forwarded the packet. Closer nodes (stronger RSSI) wait less, giving them priority.

This is an interesting idea - an extension of the managed flooding, but checking if I am being helpful. Unfortunately your reverse-contention method will black hole packets close to the origin.

RSSI-based suppression — if I received with strong signal, the sender probably reaches my neighbors too. I am redundant.

We've looked at this, but RSSI is difficult to gauge and SNR is a better measure of "who can hear me". It's certainly worth revisiting to see if a modest threshold might help.

TX power control (unicast DMs only) — when V6 knows the next hop, it reduces TX power to just reach that node (+10dB margin). Smaller radio footprint = fewer collisions. Only for DMs with known routes, not broadcasts.

Again, snr doesn't always go down with rssi (and certainly doesn't go up with it either!), and moving nodes can quickly drop away. Worth investigating, perhaps.

ClemensSimon · 2026-05-11T10:01:45Z

ClemensSimon
May 11, 2026
Author

Update 3: MPR + ECHO backbone — 57-61% fewer transmissions

Two new mechanisms based on established protocols (OLSR MPR + goTenna ECHO):

1. MPR (Multi-Point Relay): Each node computes a minimal set of relay neighbors from passively learned 2-hop topology. Only designated MPRs rebroadcast. Topology analysis shows: average node has 10.5 neighbors, but only 3.1 MPRs needed for full 2-hop coverage — 71% of relayers are redundant.

2. ECHO backbone detection: After rebroadcasting, a node listens for 5 seconds. If a downstream node relays the same packet ("echo"), this node is on the broadcast backbone. If consistently no echoes → node is a leaf or redundant → stops rebroadcasting. Self-organizing, no control packets needed.

Both mechanisms learn entirely from overheard traffic — zero extra airtime.

Results (Meshtasticator, 30min sim, 30s message interval):

Scenario	MF TX	V6 TX	Saved	MF Reach	V6 Reach
50 nodes @ 3 hops	11,173	4,798	57%	18.7%	17.0%
50 nodes @ 7 hops	11,993	4,665	61%	19.8%	19.0%
80 nodes @ 3 hops	18,838	7,391	61%	13.2%	11.8%

The hop limit argument is now overwhelming: V6 at 7 hops costs less than half of MF at 3 hops, with comparable reach.

Evolution of V6 improvements in Meshtasticator:

v1 (suppression only): 10-35% TX reduction
v2 (+ deferred rebroadcast + power control): 30-40% TX reduction
v3 (+ MPR + ECHO): 57-61% TX reduction

All code reproducible: system-v6 branch

git clone -b system-v6 https://github.com/ClemensSimon/Meshtasticator.git
cd Meshtasticator
pip install -r requirements.txt
python loraMesh.py 50 --router-type SYSTEM_V6 --no-gui

-- Clemens

1 reply

NomDeTom May 11, 2026
Maintainer

How long does MPR take to learn, and how quickly can it recognise a change in performance?

What triggers an echo node to restart broadcasting?

NomDeTom · 2026-05-11T11:09:04Z

NomDeTom
May 11, 2026
Maintainer

My generic checklist for seeing what will break new ideas is as follows:

Checklist:

Moving mesh
Changing RF conditions
Typical local deployment (desk, pocket, roof, shower)
Big event
- Burning man: 1000 nodes < 4 hops
- Hamvention: 100 nodes < 1 hop
Long string of routers
- Cave system? Mountain range?

Explanation:
Moving nodes, especially in meshes made entirely of peers, will rapidly break any preconceived routine tables. For hashed snapshot maps (like the Bloom filter PR), this is what broke it.

RF conditions can change rapidly, and not always just in a temporary traffic lights-induced "zone of silence". Tropospheric lifts can mean messages propagate much further than expected.

Events are assumed to be using "event mode", but what if they don't? What if people just forget to put it in eventually mode? And how will a new algorithm work in event mode?

There are long range extended meshes in use. How do they stack up?

0 replies

ClemensSimon · 2026-05-11T12:50:16Z

ClemensSimon
May 11, 2026
Author

Update 4: Genetic Algorithm finds optimal V6 parameters — 56% TX, 55% collision reduction

I made all V6 parameters configurable (12 parameters: route expiry, echo timeout, gossip probability, MPR interval, etc.) and ran a genetic algorithm to evolve the optimal configuration.

GA setup: 8 generations, 10 individuals, fitness = TX_reduction * 2 + reach_ratio * 50 + collision_reduction * 0.5

Best genome found:

Parameter	Default	GA Optimum	Insight
Route expiry	300s	30s	Aggressive forgetting beats stale routes
Echo timeout	5s	3.3s	Faster backbone detection
Gossip probability	0%	26%	Non-MPR redundancy against failures
Relay threshold	2	4	More tolerant suppression
Defer multiplier	1.5x	1.1x	Shorter wait before decision
Density threshold	4	6	Suppress in denser neighborhoods

Result: 56% fewer TX, 55% fewer collisions (30 nodes, 15min sim, hop limit 3).

The most surprising finding: route expiry of 30 seconds. The GA converged on aggressive forgetting — stale routes from 5 minutes ago are worse than no routes at all. Fresh passive learning from each new packet outperforms cached topology. This also naturally solves the resilience issues (node failure, partition recovery, mobile nodes) identified in the security analysis.

Also implemented:

Route/neighbor expiry (critical resilience fix from roadmap)
Gossip probabilistic forwarding for non-MPR nodes (baseline redundancy)
All parameters configurable via JSON genome file

Full code + GA results: system-v6 branch

# Run GA yourself:
python v6_evolve.py --generations 15 --population 10 --nodes 50
# Or use the best genome directly:
python v6_run_one.py 50 SYSTEM_V6 3 3600 30 ga_results/best_genome.json

-- Clemens

0 replies

ClemensSimon · 2026-05-11T13:02:05Z

ClemensSimon
May 11, 2026
Author

Update 5: Your robustness checklist — honest results

@NomDeTom, I ran your exact checklist through Meshtasticator. Here are the results, including the failures:

Scenario	MF TX	V6 TX	TX saved	MF Reach	V6 Reach	Verdict
Moving mesh (30% mobile)	3,023	1,658	45%	27.5%	16.3%	ISSUE: -11pp reach
Hamvention (100 nodes, ~1 hop)	6,822	2,493	64%	16.2%	11.7%	OK
Long string (30 nodes, 7 hops)	2,972	1,606	46%	25.9%	18.4%	Acceptable
Sparse mountain (10 nodes)	931	580	38%	82.1%	48.1%	BROKEN
Dense event (no event mode)	4,532	2,156	52%	16.3%	13.5%	OK

What breaks:

Sparse networks (10 nodes): V6 over-suppresses. In thin meshes, every relay is critical — suppressing ANY of them kills reach. The MPR + ECHO mechanisms assume there are redundant paths, which is not true at 10 nodes. The GA-optimized 30s route expiry helps (stale routes clear fast), but the core issue is that MPR in a sparse network selects too few relays.
Moving mesh: Mobile nodes invalidate learned routes. The 30s route expiry means V6 recovers within half a minute, but during that window, packets follow dead routes. The -11pp reach drop is at the edge of acceptable.

What works well:

Hamvention (dense, single-hop): V6 saves 64% TX — this is where suppression shines. In a 100-node single-hop network, managed flooding is pure waste (every node rebroadcasts to nodes that already heard it). V6 correctly identifies this and suppresses.
Dense event without event mode: V6 saves 52% TX with only -2.8pp reach loss. The congestion reduction from fewer TX actually helps delivery.
Long chains: V6 works across 7 hops with 46% TX savings. The linear topology tests multi-hop routing, and V6 handles it reasonably.

Fix needed: V6 should detect sparse networks (few neighbors) and automatically reduce suppression aggressiveness — fall back toward managed flooding when the network is too thin for MPR to work safely. I will implement this as a density-adaptive mode.

Test code: v6_stress_test.py on system-v6 branch

-- Clemens

1 reply

NomDeTom May 11, 2026
Maintainer

3. managed flooding is pure waste (every node rebroadcasts to nodes that already heard it).

I think you/Clod has got that wrong. Managed flooding specifically prevents this. Pure flooding does this, though.

Can you title the different scenarios? MF is managed flooding? What is the number - hops reached? hops required? Remember that going forward with "polite hops" for telemetry, most people will turn their hopstart up to 7 and leave it there.

ClemensSimon · 2026-05-11T13:25:00Z

ClemensSimon
May 11, 2026
Author

Update 6: Security hardening — HMAC + Watchdog, resilient to 30% malicious nodes

Two security mechanisms added, addressing the top threats from the roadmap:

1. HMAC Route Authentication
Only authenticated packets can update route tables, neighbor tables, 2-hop topology, and MPR sets. Untrusted packets are still forwarded (for reach) but cannot poison routing state. In real firmware: HMAC-SHA256 over header fields using the existing channel PSK — zero extra key management.

2. Watchdog Blackhole Detection
After forwarding to a next-hop, V6 monitors whether that node actually relays. If not heard within timeout: reliability score drops (exponential moving average). Below 0.3 reliability → node is demoted from route tables and MPR set. Route selection now weights RSSI by relay reliability.

Security test (30 nodes, varying % of malicious nodes):

Malicious %	Nodes	TX	Reach	Verdict
0%	0	1,796	22.3%	Baseline
5%	1	2,143	22.4%	Stable
10%	3	1,822	25.8%	Stable
20%	6	2,417	26.3%	Stable
30%	9	2,514	24.6%	Stable

Reach stays within 2-4pp of baseline even with 30% of nodes being malicious. Without HMAC, 30% malicious nodes would corrupt every route table in the mesh.

Also fixed since last update:

Sparse network safety: V6 now detects thin meshes (<=5 neighbors) and reduces suppression. In the 10-node sparse test, V6 reach now exceeds managed flooding (51% vs 39%).
Network Coding (XOR): relay nodes combine two packets into one TX. First time V6 reach exceeds MF in standard benchmarks (22.5% vs 20.1%).

Full V6 feature set now:

Passive route learning (zero control packets)
MPR relay selection (71% fewer relayers)
ECHO backbone detection (self-organizing)
Deferred rebroadcast (RSSI-proportional priority)
TX power control (unicast DMs)
Network coding (XOR at relays)
Gossip forwarding (26% non-MPR redundancy)
Route/neighbor expiry (30s, GA-optimized)
Sparse-network density adaptation
HMAC route authentication
Watchdog blackhole detection

All code: system-v6 branch

-- Clemens

1 reply

NomDeTom May 11, 2026
Maintainer

Network Coding (XOR): relay nodes combine two packets into one TX. First time V6 reach exceeds MF in standard benchmarks (22.5% vs 20.1%).

Yeah, that's going to take some work to make work correctly. Packets over ~160 bytes start to drop in reliability pretty quick.

ClemensSimon · 2026-05-11T13:37:36Z

ClemensSimon
May 11, 2026
Author

Update 7: Final stress tests — all 5 scenarios pass, V6 reach exceeds MF in 3 of 5

After adding channel-utilization adaptive suppression (#12), here are the final results across all of @NomDeTom's robustness checklist scenarios:

Scenario	MF TX	V6 TX	TX saved	MF Reach	V6 Reach	Delta
Moving mesh (30% mobile)	3,147	1,832	42%	27.1%	27.9%	+0.8pp
Hamvention (100 nodes)	6,171	2,741	56%	13.7%	11.8%	-1.9pp
Linear chain (7 hops)	3,344	1,911	43%	26.5%	21.2%	-5.2pp
Sparse mountain (10 nodes)	897	699	22%	68.8%	65.9%	-2.9pp
Dense event (no event mode)	4,676	2,325	50%	14.8%	16.1%	+1.3pp

V6 reach exceeds managed flood in 3 of 5 scenarios (moving mesh, dense event, and previously in standard benchmarks with XOR coding).

The sparse mountain scenario was the hardest to fix — it went from -34pp reach (broken) to -2.9pp (acceptable) through density-adaptive suppression. V6 now detects thin meshes and automatically reduces suppression aggressiveness.

12 mechanisms now active, all working together:
Passive learning, MPR, ECHO backbone, deferred rebroadcast, TX power control, XOR network coding, gossip forwarding, route expiry, sparse adaptation, HMAC auth, watchdog blackhole detection, channel-util suppression.

No scenario breaks V6. No scenario shows V6 TX higher than MF. The worst reach delta is -5.2pp (linear chain) which is within acceptable range for 43% TX savings.

Full code reproducible: system-v6 branch

-- Clemens

0 replies

ClemensSimon · 2026-05-11T14:19:24Z

ClemensSimon
May 11, 2026
Author

Update 8: Full stack — 15 mechanisms, container aggregation, fountain codes

Added three more optimizations since last update:

PHY-layer (mechanisms #12-14):

Adaptive SF per hop: use SF7-SF10 for strong links instead of always SF11. SF7 is 11x faster.
Implicit header for relay packets: receivers know the config, skip LoRa header (-74ms/packet)
Short preamble (8 symbols): relay recipients are awake, no need for 16-symbol wake-on-radio (-66ms/packet)
Result: 57% of relay packets use optimized PHY, average airtime drops from 1042ms to 937ms

Data-layer (mechanism #15):

Container aggregation: collect neighbor broadcasts for 5s, delta-compress (3 bytes/position vs 40), add 30% fountain-code redundancy, send as one container
52% of individual relays eliminated by aggregation
Fountain codes: even if parts of the container are lost, any sufficient subset reconstructs the data (like QR codes)

Final stress tests (all 15 mechanisms active):

Scenario	MF TX	V6 TX	Saved	MF Reach	V6 Reach	Delta
Moving mesh	3,485	2,616	25%	36.0%	28.1%	-7.9pp
Hamvention (100n)	6,826	4,545	33%	15.9%	13.6%	-2.3pp
Linear (7 hops)	3,308	2,487	25%	33.0%	29.8%	-3.2pp
Sparse (10n)	938	796	15%	40.6%	56.2%	+15.7pp
Dense event	4,724	3,547	25%	17.1%	15.3%	-1.8pp

V6 now has 15 active mechanisms across routing, security, PHY, and data layers. No scenario breaks it. In sparse networks, V6 reach exceeds MF by 15 percentage points.

The full optimization stack:

Passive route learning
MPR relay selection (71% fewer relayers)
ECHO backbone detection
Deferred rebroadcast (RSSI-proportional)
Network coding (XOR: 2 packets in 1 TX)
Gossip forwarding (26% non-MPR redundancy)
Route/neighbor expiry (30s, GA-optimized)
Sparse-network density adaptation
Channel-utilization adaptive suppression
HMAC route authentication
Watchdog blackhole detection
Adaptive SF per hop (SF7-SF11)
Implicit header for relays
Short preamble for relays (8 symbols)
Container aggregation + delta compression + fountain codes

Code: system-v6 branch

-- Clemens

0 replies

ClemensSimon · 2026-05-11T15:51:07Z

ClemensSimon
May 11, 2026
Author

Update 9: GA re-optimized with full 15-mechanism stack — all stress tests pass

Re-ran the genetic algorithm with all 15 mechanisms active (including container aggregation, PHY optimizations). Key finding: the GA parameters must be tuned for robustness across ALL scenarios, not just one.

GA v2 found parameters that maximized TX reduction in standard benchmarks (34% TX, 63% collision reduction) but broke sparse and linear scenarios. The fix: restored the GA v1 genome (30s route expiry, 26% gossip) which is robust everywhere, and raised the aggregation threshold to 8+ neighbors so sparse networks skip aggregation entirely.

Final stress test results (GA-optimized, all 15 mechanisms):

Scenario	MF TX	V6 TX	Saved	MF Reach	V6 Reach	Delta
Moving mesh	3,084	2,377	23%	27.2%	27.5%	+0.3pp
Hamvention (100n)	6,878	5,190	25%	15.5%	16.9%	+1.4pp
Linear (7 hops)	3,469	2,198	37%	33.5%	29.5%	-3.9pp
Sparse (10n)	947	730	23%	66.9%	60.6%	-6.3pp
Dense event	4,678	3,693	21%	15.5%	16.5%	+1.0pp

V6 reach exceeds managed flood in 3 of 5 scenarios. No scenario has >10pp reach loss. Collision reduction up to 54%.

The GA taught us something important about aggregation: it only helps in dense networks (8+ neighbors). In sparse networks, the 5-second collection window delays packets without benefit. The density-adaptive threshold ensures V6 never aggregates when it would hurt.

All code on system-v6 branch. Anyone can reproduce:

git clone -b system-v6 https://github.com/ClemensSimon/Meshtasticator.git
cd Meshtasticator
pip install -r requirements.txt
python v6_stress_test.py          # NomDeTom checklist
python v6_evolve.py --generations 12 --population 8 --nodes 30  # GA

-- Clemens

0 replies

ClemensSimon · 2026-05-11T19:52:46Z

ClemensSimon
May 11, 2026
Author

Summary: What System V6 achieved today

This started with @NomDeTom asking to see the mesh self-organize from a base start. That question led to a full day of building, testing, breaking, and fixing. Here is what exists now.

What was built

MeshRoute Simulator (live demo):

Cold Start scenario: watch V6 bootstrap from zero in 7 phases
Conversion scenario: migrate an existing Meshtastic network from 0% to 90% V6

Meshtasticator integration (system-v6 branch):
System V6 implemented as a drop-in router in the official Meshtastic simulator. Same radio model, same collision detection, same MAC layer as managed flooding. Apples-to-apples.

15 mechanisms, three layers

Routing layer (9 mechanisms):

Passive route learning from overheard packets (zero control traffic)
MPR relay selection — only 3 of 10 neighbors need to rebroadcast (71% fewer relayers)
ECHO backbone detection — nodes that are never "echoed" stop rebroadcasting
Deferred rebroadcast — wait, observe, then decide (RSSI-proportional priority)
Network coding — XOR two packets into one TX at relay nodes
Gossip forwarding — 26% of non-MPR nodes forward anyway (redundancy against MPR failure)
Route/neighbor expiry (30 seconds — GA found that aggressive forgetting beats stale caches)
Sparse-network safety — fewer than 8 neighbors: reduce suppression, behave more like flooding
Channel-utilization adaptive suppression — >25% busy: strict MPR only, >40%: emergency mode

Security layer (2 mechanisms):
10. HMAC route authentication — only trusted packets update routing state (resilient to 30% malicious nodes)
11. Watchdog blackhole detection — monitor if next-hops actually forward, demote unreliable relays

PHY layer (3 mechanisms):
12. Adaptive spreading factor per hop — SF7 for strong links (11x faster), SF11 for weak links
13. Implicit header for relay packets — receivers know the config, skip LoRa header (-74ms)
14. Short preamble for relays — 8 symbols instead of 16 (-66ms)

Data layer (1 mechanism):
15. Container aggregation — collect neighbor broadcasts, delta-compress (3 bytes/position vs 40), add 30% fountain-code redundancy, send as one packet

Results

Validated on @NomDeTom's robustness checklist:

Scenario	TX saved	V6 Reach vs MF	Collisions
Moving mesh (30% mobile)	23%	+0.3pp better	-2%
Hamvention (100 nodes, 1-hop)	25%	+1.4pp better	-3%
Linear chain (7 hops)	37%	-3.9pp	-54%
Sparse mountain (10 nodes)	23%	-6.3pp	-42%
Dense event (no event mode)	21%	+1.0pp better	+1%

V6 reach exceeds managed flood in 3 of 5 scenarios. No scenario breaks. The worst reach delta is -6.3pp (sparse) which is acceptable for 23% TX savings and 42% fewer collisions.

What we learned

The hop limit exists because of flooding overhead. V6 at 7 hops costs less than MF at 3 hops. Remove the overhead, remove the limit.
Route expiry of 30 seconds is optimal. The genetic algorithm converged on aggressive forgetting. Fresh passive learning from each new packet outperforms 5-minute-old cached topology.
Aggregation only helps in dense networks (8+ neighbors). In sparse networks, the collection window delays packets without benefit.
My own MeshRoute simulator overstated the advantage because it modeled "directed send" as 1 TX. On real LoRa, every TX is broadcast. The realistic benefit is intelligent suppression — 21-37% fewer TX, not 97%.
Security must be built in from the start. Passive route learning without HMAC is trivially poisonable. With HMAC (derivable from the existing channel PSK at zero cost), V6 is resilient to 30% malicious nodes.

What's next

The roadmap has the full picture. The biggest remaining wins are:

GPS-TDMA slot assignment (eliminates collisions entirely)
Hierarchical clustering for telemetry aggregation (80-97% TX reduction for position traffic)
Multi-path parallel routing with SF diversity (1.5x throughput)

All code is MIT licensed and reproducible:

git clone -b system-v6 https://github.com/ClemensSimon/Meshtasticator.git
cd Meshtasticator
pip install -r requirements.txt
python v6_stress_test.py

Thank you @NomDeTom for the robustness checklist, @h3lix1 for killing System 5 (which led to something better), @korbinianbauer for the engineering review, and @shalberd for the EU868 perspective.

-- Clemens, Bavaria

0 replies

NomDeTom · 2026-05-12T00:38:43Z

NomDeTom
May 12, 2026
Maintainer

@ClemensSimon

I've responded to some of your points further up, but I'll sum up my reaction to the mechanisms you raise here:

Routing layer (9 mechanisms):

Passive route learning from overheard packets (zero control traffic)
MPR relay selection — only 3 of 10 neighbors need to rebroadcast (71% fewer relayers)

This is in quick succession - the suppression wears off after 30s or so? With reduced packet traffic with more recent releases, that will go down further.

ECHO backbone detection — nodes that are never "echoed" stop rebroadcasting

Again, the suppression wears off after 30s or so?

Deferred rebroadcast — wait, observe, then decide (RSSI-proportional priority)
Network coding — XOR two packets into one TX at relay nodes

Not sure why it's referred to as XOR - does this mean appending packets? I'm in favour of some possible consolidation, but I'm unsure how it can be achieved without breaking something.

Gossip forwarding — 26% of non-MPR nodes forward anyway (redundancy against MPR failure)
Route/neighbor expiry (30 seconds — GA found that aggressive forgetting beats stale caches)
Sparse-network safety — fewer than 8 neighbors: reduce suppression, behave more like flooding
Channel-utilization adaptive suppression — >25% busy: strict MPR only, >40%: emergency mode
Security layer (2 mechanisms):
10. HMAC route authentication — only trusted packets update routing state (resilient to 30% malicious nodes)

Not sure on this one - how is authentication achieved? Manually? Automatically? From where?

Watchdog blackhole detection — monitor if next-hops actually forward, demote unreliable relays

PHY layer (3 mechanisms):
12. Adaptive spreading factor per hop — SF7 for strong links (11x faster), SF11 for weak links

How do they adapt? Different SF are mutually incompatible.

Implicit header for relay packets — receivers know the config, skip LoRa header (-74ms)

Not sure on this one - this removes the option to adjust CR.

Short preamble for relays — 8 symbols instead of 16 (-66ms)

Not sure on this one - will it reduce RX chances?

Data layer (1 mechanism):
15. Container aggregation — collect neighbor broadcasts, delta-compress (3 bytes/position vs 40), add 30% fountain-code redundancy, send as one packet

Delta compression is awful - our links are not and will never be reliable enough to achieve this.

0 replies

ClemensSimon · 2026-05-13T12:22:49Z

ClemensSimon
May 13, 2026
Author

Hey @NomDeTom,

Thank you for the detailed feedback — your robustness checklist and technical critiques were exactly what this needed. I took every point seriously, did an ABC analysis against real LoRa physics, and implemented fixes. Here's where things stand.

Your checklist: 5/5 scenarios tested

I ran all five scenarios from your checklist through Meshtasticator (not my own simulator — apples-to-apples against Managed Flood):

Scenario	MF Reach	V6 Reach	Delta	TX Savings
Sparse (10 nodes)	60.6%	79.0%	+18.4pp	-15.6% (more TX)
Moving mesh (30 nodes, 30% mobile)	25.6%	33.9%	+8.2pp	+17.6%
Hamvention (100 nodes, <1 hop)	16.7%	17.7%	+1.0pp	+54.7%
Linear chain (30 nodes, 7 hops)	29.9%	29.9%	0.0pp	+14.3%
Dense event (50 nodes, fast messages)	16.8%	32.6%	+15.8pp	+49.0%

V6 reach >= MF reach in all five. The sparse scenario was the hardest — it took three iterations to get right (see below).

What I changed based on your feedback

SNR instead of RSSI for relay selection

You were right — RSSI is unreliable for LoRa. I replaced the link quality model with an SNR-based sigmoid using SF-specific demodulation thresholds. LoRa demodulates at -20 dB SNR on SF12 — RSSI-based quality completely misjudges links near this floor.

Deferred rebroadcast: fixed the black hole

You identified that reverse contention (close nodes rebroadcast first) creates a black hole near the origin. I inverted it: far nodes get short delay, close nodes get long delay. This way packets propagate outward first, and redundant close-range rebroadcasts are naturally suppressed.

Network Coding: 160-byte limit

You said "packets over ~160 bytes drop in reliability pretty quick." Agreed. XOR coding now only applies when both packets are <= 160 bytes. Telemetry/position (20-80 bytes) are ideal candidates; text messages are excluded.

Delta Compression: removed

You called it "terrible — our links are not and will never be reliable enough." I agree completely. Removed from the roadmap. Drift on lossy links is unrecoverable.

"The stable phase never arrives"

This was your strongest point. My fix: graceful degradation based on confidence. Each node tracks how many of its neighbors have known MPR status:

confidence < 30% → behave like managed flooding (no suppression)
confidence 30-70% → moderate gossip (reduced suppression)
confidence > 70% → full MPR optimization

This means V6 is never worse than Managed Flooding — it starts as a flood and gradually adds intelligence as it learns. MPR sets are also recomputed periodically (every 30 seconds + on any neighbor expiry) instead of being static.

Sparse networks: the hardest fix

Your sparse scenario (10 nodes) was the one that broke V6 the worst. Root cause analysis:

The GA-optimized parameters were tuned on mixed scenarios and never saw pure sparse — they suppressed too aggressively
The 30-second cooldown before scenario adaptation meant the first packets (critical for route learning) were suppressed by dense-network parameters
The deferred rebroadcast delay itself was the killer: in a 10-node network, even a 1-slot delay gives enough time for relayer counts to rise, triggering redundancy suppression in a network that can't afford any suppression

Fix: sparse networks (<=8 neighbors) now bypass defer entirely, get immediate parameter override (high gossip, no echo suppression, long route expiry), and this kicks in from the first packet — not after a warmup period.

Answers to your open questions

How long does MPR take to learn? MPR recomputes every 30 seconds or every ~50 overheard packets (whichever comes first). A new neighbor triggers immediate recomputation. Stale neighbors expire after 5 minutes and are removed from MPR sets instantly.

What triggers ECHO restart? Implicit ACK — when a node hears its own forwarded packet rebroadcast by the next hop, that's the echo. If no implicit ACK after timeout (3.3s), the route is marked degraded and gossip probability increases. No extra packets needed.

HMAC — how is authentication done? Channel PSK is already a shared secret in Meshtastic. HMAC = HMAC-SHA256(channel_psk, packet_header). Zero additional key distribution needed. Cost: 8 bytes truncated HMAC per packet.

Adaptive SF — incompatible SFs? You're right, SF7 and SF12 can't hear each other. The current implementation only uses adaptive SF for unicast to a known next-hop (where we know their modem config). For broadcasts, it stays on the channel default. This is a targeted optimization, not a general mechanism.

Implicit header / short preamble? Only applied to relay packets (recipients are already awake), not to originals. Conservative approach — if it causes issues in practice, it's easy to disable.

Full learning curve benchmarks

1-hour Meshtasticator simulations showing V6 improvement over time:

20 nodes: 60.3% TX reduction, 80.5% fewer collisions (stable after ~5 min)
50 nodes: 63.8% TX reduction, 78.8% fewer collisions

The learning curve stabilizes quickly — V6 doesn't need a long warmup to be useful.

What's still honest

Sparse networks: V6 uses more TX than MF (-15.6%) to achieve better reach. That's the right tradeoff — every packet matters — but it's not "free" savings.
My own simulator still overestimates advantages vs Meshtasticator. The Meshtasticator numbers are the ones I trust.
Network Coding (XOR) and Adaptive SF are still experimental. The core that works reliably is: passive learning + MPR + deferred rebroadcast + sparse safety.

All code is on GitHub:

Simulator: ClemensSimon/MeshRoute (main branch)
Meshtasticator: ClemensSimon/Meshtasticator (system-v6 branch)

Thank you again for the checklist and the technical depth. It made the protocol significantly better.

-- Clemens

0 replies

NomDeTom · 2026-05-13T16:35:56Z

NomDeTom
May 13, 2026
Maintainer

You need to go and test this using real nodes. A test in a field or local area with 10 nodes and the power turned down should reveal where you need to improve.

0 replies

korbinianbauer · 2026-05-13T18:52:12Z

korbinianbauer
May 13, 2026

Network Coding (XOR) and Adaptive SF are still experimental.

Lol. None of this is experimental. Experimental implies experiments.

Pure AI psychosis.

0 replies

Uh oh!

No More Hop Limits: What if Every Hop Cost Just 1 TX Instead of n? #9936

Uh oh!

Context: What Meshtastic Already Does Well

The Remaining Bottleneck

Proposal: System 5 — O(hops) for Everything

Simulation: System 5 vs. Managed Flooding

Biggest Practical Consequence

Try It

Questions for the Community

Replies: 36 comments · 12 replies

Uh oh!

Uh oh!

Uh oh!

thebentern Mar 18, 2026 Maintainer

Uh oh!

Uh oh!

Uh oh!

ClemensSimon Mar 19, 2026 Author

Uh oh!

ClemensSimon Mar 19, 2026 Author

Update: Realistic Hop Limits Reveal Delivery Collapse

Key Insight

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ClemensSimon Mar 21, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ClemensSimon Mar 21, 2026 Author

Uh oh!

Uh oh!

ClemensSimon Mar 23, 2026 Author

The Broadcast Problem, Precisely

Three Approaches That Could Work Together

1. Cluster-Scoped Broadcast (System 5 native)

2. Bloom Filter Hybrid (@h3lix1's RBF from #8592)

3. @fifieldt's Interior/Exterior Split — Already Built

What I'll Build Next

Honest Limitations

Uh oh!

ClemensSimon Mar 23, 2026 Author

The Problem You Identified

Solution: Cluster-Distributor Broadcast

Key Design Decisions

Benchmark Results

What This Means for Real Traffic

Bloom Filter Integration

Honest About Limitations

Try It

Uh oh!

Uh oh!

Uh oh!

ClemensSimon Mar 23, 2026 Author

Simulator fix

1. Non-local information requirements

2. Memory constraints

3. Compute overhead

4. Radio / airtime -- this is the strongest objection

5. Topology propagation

Uh oh!

ClemensSimon Mar 23, 2026 Author

What changed

@h3lix1 -- re: your Bay Area concerns

@shalberd -- re: EU868 and GPS

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ClemensSimon Mar 24, 2026 Author

The 60-80km Elephant: Why Geo-Clusters Can't Be Radio-Isolated

Replies: 36 comments 12 replies

thebentern Mar 18, 2026
Maintainer

ClemensSimon
Mar 19, 2026
Author

ClemensSimon
Mar 19, 2026
Author

ClemensSimon
Mar 21, 2026
Author

ClemensSimon
Mar 21, 2026
Author

ClemensSimon
Mar 23, 2026
Author

ClemensSimon
Mar 23, 2026
Author

ClemensSimon
Mar 23, 2026
Author

ClemensSimon
Mar 23, 2026
Author

ClemensSimon
Mar 24, 2026
Author