perf(shm): add lock-free segment pool to PosixShmProviderBackendTalc#2514
Open
YuanYuYuan wants to merge 12 commits into
Open
perf(shm): add lock-free segment pool to PosixShmProviderBackendTalc#2514YuanYuYuan wants to merge 12 commits into
YuanYuYuan wants to merge 12 commits into
Conversation
The Waiter::wait() hot path in the transport pipeline created and dropped
an event_listener::EventListener on every batch send cycle. In perf
profiles of ros-z-pingpong this showed as 12.18% CPU in
drop_in_place<EventListener> on the tx-0 thread.
Replace EventInner { EventLib + AtomicU8 } with { Mutex<u8> + Condvar }.
This eliminates per-wakeup EventListener allocation and intrusive-list
register/deregister overhead. Semantics are preserved:
- Sticky notification: if notify() fires with no waiter, the next wait()
returns immediately (flag stored in the mutex-protected state).
- Error propagation: last Notifier drop sets ERR and wakes all waiters.
- wait_async: delegated to spawn_blocking (not on the hot path).
All existing zenoh-sync tests pass.
The original EventListener alloc+intrusive-list per wait() cost CPU on the transport hot path (12% in perf profiles). Previous fix attempt used spawn_blocking for wait_async(), which is not cancellation-safe: tokio::time::timeout() cancels the outer future but the blocking thread detaches and eventually consumes the next real notification, causing a hang at 64KB+ payloads (fragmented messages). New approach: - sync paths (wait/wait_deadline/wait_timeout): std Mutex<u8> + Condvar (zero allocation, no intrusive list) - async path (wait_async): tokio::sync::Notify with enable() pattern (cancellation-safe: dropped Notified futures return their permit) The enable()-before-state-check pattern ensures notifications are never missed between the state check and the .await.
…hroughput std::sync::Mutex uncontended lock/unlock costs ~90 ns; parking_lot::Mutex costs ~20 ns. The ev-fix design requires notify() to briefly acquire the condvar mutex on every call (lost-wakeup fence). With std that made the sticky (already-set) wait() path cost 179 ns vs event_listener's 31 ns. With parking_lot the sticky path measures 32 ns — matching event_listener — while the cross-thread wakeup path improves to 4.7 µs (from 5.1 µs with event_listener). No semantic change; parking_lot is already in the lockfile.
2affdc2 to
41067eb
Compare
e5cf1e8 to
1bf318f
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2514 +/- ##
==========================================
+ Coverage 72.48% 72.53% +0.04%
==========================================
Files 390 392 +2
Lines 63374 63455 +81
==========================================
+ Hits 45937 46026 +89
+ Misses 17437 17429 -8 ☔ View full report in Codecov by Sentry. |
Measures the sticky (hot) path and cross-thread wakeup path. Used to compare event_listener vs the Mutex+Condvar+parking_lot replacement introduced in the preceding commits.
1bf318f to
54e6264
Compare
Sort [[bench]] keys alphabetically (harness before name) to satisfy taplo reorder_keys=true. Clear the default peer listen endpoint in pub_config of test_session_from_cloned_config to prevent the OS from assigning port 38446 to pub_session's random listener, which collided with sub_session's explicit bind on that port.
Introduce ShmSegmentPool: a per-size pool of idle POSIX SHM segments that eliminates per-message shm_open+mmap / munmap+shm_unlink cycles on the hot publish path. Design: - ShmSegmentPool stores segments in per-size ArrayQueue (crossbeam, bounded MPMC lock-free). Hot path (take/give after warmup) requires only a shared RwLock read + single CAS — no mutex waiting under concurrent stream access. - PooledPosixShmSegment: Drop impl returns the segment to the pool instead of unmapping, keeping TLB and page-cache entries hot. - New builder_with_pool(layout, ShmPoolConfig) constructor enables the pool path; existing builder() retains the traditional Talc sub- allocation path unchanged. - ShmPoolConfig::max_idle_per_size (default 8) caps resident memory at 8 MB per provider at 1 MB payload size.
Measures the two steady-state paths of ShmSegmentPool: alloc/cold/1MB ~314 µs shm_open+ftruncate+mmap+mlock per alloc alloc/warm/1MB ~134 ns ArrayQueue pop+push (two CAS, no syscalls) Ratio: ~2300x. Demonstrates what pool reuse eliminates on the hot publish path at 1 MB payload size.
54e6264 to
0ddfe24
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a lock-free segment pool to
PosixShmProviderBackendTalcso that SHM segments are reused across sends instead of being created and destroyed per-message. Previously every send calledshm_open+mmap+ftruncate, which incurred filesystem quota accounting (__dquot_alloc_spaceat ~10% CPU in profiling). The pool reduces the hot path to a singleArrayQueueCAS.Key Changes
commons/zenoh-shm/src/api/protocol_implementations/posix/posix_shm_provider_backend_talc.rs: check pool first inalloc;freeis a no-op when pool is active (segment returned on drop instead)commons/zenoh-shm/src/api/protocol_implementations/posix/posix_shm_segment.rs: exposesize()accessor for pool keyingcommons/zenoh-shm/src/posix_shm/pool.rs:ShmSegmentPool— per-sizeArrayQueueof idle segments; hot path is read lock + single CAScommons/zenoh-shm/src/posix_shm/pooled_segment.rs:PooledPosixShmSegment—Dropreturns segment to pool instead of unmappingcommons/zenoh-shm/benches/shm_pool.rs: criterion benchmark measuring cold (pool miss) and warm (pool hit) allocation latencyBenchmark Results
Flamegraph:
__dquot_alloc_spaceat 9.56% before (filesystem quota accounting per new segment), absent after.Reproducing
Results land in
target/criterion/shm_pool/.Breaking Changes
None
🏷️ Label-Based Checklist
Based on the labels applied to this PR, please complete these additional requirements:
Labels:
new feature🆕 New Feature Requirements
Since this PR adds a new feature:
Consider: Can this feature be split into smaller, incremental PRs?
Instructions:
- [ ]to- [x])This checklist updates automatically when labels change, but preserves your checked boxes.