Skip to content

perf(shm): add lock-free segment pool to PosixShmProviderBackendTalc#2514

Open
YuanYuYuan wants to merge 12 commits into
eclipse-zenoh:mainfrom
ZettaScaleLabs:feat/shm-pool
Open

perf(shm): add lock-free segment pool to PosixShmProviderBackendTalc#2514
YuanYuYuan wants to merge 12 commits into
eclipse-zenoh:mainfrom
ZettaScaleLabs:feat/shm-pool

Conversation

@YuanYuYuan
Copy link
Copy Markdown
Contributor

@YuanYuYuan YuanYuYuan commented Mar 23, 2026

Summary

Stacks on #2513 — review that first; this PR's diff against main includes those commits too.

Add a lock-free segment pool to PosixShmProviderBackendTalc so that SHM segments are reused across sends instead of being created and destroyed per-message. Previously every send called shm_open + mmap + ftruncate, which incurred filesystem quota accounting (__dquot_alloc_space at ~10% CPU in profiling). The pool reduces the hot path to a single ArrayQueue CAS.

Key Changes

  • commons/zenoh-shm/src/api/protocol_implementations/posix/posix_shm_provider_backend_talc.rs: check pool first in alloc; free is a no-op when pool is active (segment returned on drop instead)
  • commons/zenoh-shm/src/api/protocol_implementations/posix/posix_shm_segment.rs: expose size() accessor for pool keying
  • commons/zenoh-shm/src/posix_shm/pool.rs: ShmSegmentPool — per-size ArrayQueue of idle segments; hot path is read lock + single CAS
  • commons/zenoh-shm/src/posix_shm/pooled_segment.rs: PooledPosixShmSegmentDrop returns segment to pool instead of unmapping
  • commons/zenoh-shm/benches/shm_pool.rs: criterion benchmark measuring cold (pool miss) and warm (pool hit) allocation latency

Benchmark Results

path latency
cold (shm_open + mmap) 314 µs
warm (pool hit, ArrayQueue CAS) 134 ns
speedup ~2340×

Flamegraph: __dquot_alloc_space at 9.56% before (filesystem quota accounting per new segment), absent after.

Reproducing

taskset -c 0,1 cargo bench -j4 -p zenoh-shm --bench shm_pool

Results land in target/criterion/shm_pool/.

Breaking Changes

None


🏷️ Label-Based Checklist

Based on the labels applied to this PR, please complete these additional requirements:

Labels: new feature

🆕 New Feature Requirements

Since this PR adds a new feature:

  • Feature scope documented - Clear description of what the feature does and why it's needed
  • Minimum necessary code - Implementation is as simple as possible, doesn't overcomplicate the system
  • New APIs well-designed - Public APIs are intuitive, consistent with existing APIs
  • Comprehensive tests - All functionality is tested (happy path + edge cases + error cases)
  • Examples provided - Usage examples in code comments or separate example files
  • Documentation added - New docs explaining the feature, its use cases, and API
  • Feature flag considered - Can the feature be enabled/disabled for gradual rollout?
  • Performance impact assessed - Memory, CPU, storage implications measured
  • Integration tested - Feature works with existing features

Consider: Can this feature be split into smaller, incremental PRs?

Instructions:

  1. Check off items as you complete them (change - [ ] to - [x])
  2. The PR checklist CI will verify these are completed

This checklist updates automatically when labels change, but preserves your checked boxes.

The Waiter::wait() hot path in the transport pipeline created and dropped
an event_listener::EventListener on every batch send cycle. In perf
profiles of ros-z-pingpong this showed as 12.18% CPU in
drop_in_place<EventListener> on the tx-0 thread.

Replace EventInner { EventLib + AtomicU8 } with { Mutex<u8> + Condvar }.
This eliminates per-wakeup EventListener allocation and intrusive-list
register/deregister overhead. Semantics are preserved:
- Sticky notification: if notify() fires with no waiter, the next wait()
  returns immediately (flag stored in the mutex-protected state).
- Error propagation: last Notifier drop sets ERR and wakes all waiters.
- wait_async: delegated to spawn_blocking (not on the hot path).

All existing zenoh-sync tests pass.
The original EventListener alloc+intrusive-list per wait() cost CPU on
the transport hot path (12% in perf profiles).

Previous fix attempt used spawn_blocking for wait_async(), which is not
cancellation-safe: tokio::time::timeout() cancels the outer future but
the blocking thread detaches and eventually consumes the next real
notification, causing a hang at 64KB+ payloads (fragmented messages).

New approach:
- sync paths (wait/wait_deadline/wait_timeout): std Mutex<u8> + Condvar
  (zero allocation, no intrusive list)
- async path (wait_async): tokio::sync::Notify with enable() pattern
  (cancellation-safe: dropped Notified futures return their permit)

The enable()-before-state-check pattern ensures notifications are never
missed between the state check and the .await.
…hroughput

std::sync::Mutex uncontended lock/unlock costs ~90 ns; parking_lot::Mutex
costs ~20 ns. The ev-fix design requires notify() to briefly acquire the
condvar mutex on every call (lost-wakeup fence). With std that made the
sticky (already-set) wait() path cost 179 ns vs event_listener's 31 ns.

With parking_lot the sticky path measures 32 ns — matching event_listener —
while the cross-thread wakeup path improves to 4.7 µs (from 5.1 µs with
event_listener). No semantic change; parking_lot is already in the lockfile.
@YuanYuYuan YuanYuYuan added the enhancement Existing things could work better label Mar 23, 2026
@YuanYuYuan YuanYuYuan force-pushed the feat/shm-pool branch 3 times, most recently from e5cf1e8 to 1bf318f Compare March 23, 2026 12:42
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 23, 2026

Codecov Report

❌ Patch coverage is 95.29412% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.53%. Comparing base (1fab5d0) to head (24aac22).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
commons/zenoh-sync/src/event.rs 92.40% 6 Missing ⚠️
...entations/posix/posix_shm_provider_backend_talc.rs 95.12% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2514      +/-   ##
==========================================
+ Coverage   72.48%   72.53%   +0.04%     
==========================================
  Files         390      392       +2     
  Lines       63374    63455      +81     
==========================================
+ Hits        45937    46026      +89     
+ Misses      17437    17429       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Measures the sticky (hot) path and cross-thread wakeup path.
Used to compare event_listener vs the Mutex+Condvar+parking_lot
replacement introduced in the preceding commits.
Sort [[bench]] keys alphabetically (harness before name) to satisfy
taplo reorder_keys=true.

Clear the default peer listen endpoint in pub_config of
test_session_from_cloned_config to prevent the OS from assigning
port 38446 to pub_session's random listener, which collided with
sub_session's explicit bind on that port.
Introduce ShmSegmentPool: a per-size pool of idle POSIX SHM segments
that eliminates per-message shm_open+mmap / munmap+shm_unlink cycles
on the hot publish path.

Design:
- ShmSegmentPool stores segments in per-size ArrayQueue (crossbeam,
  bounded MPMC lock-free). Hot path (take/give after warmup) requires
  only a shared RwLock read + single CAS — no mutex waiting under
  concurrent stream access.
- PooledPosixShmSegment: Drop impl returns the segment to the pool
  instead of unmapping, keeping TLB and page-cache entries hot.
- New builder_with_pool(layout, ShmPoolConfig) constructor enables the
  pool path; existing builder() retains the traditional Talc sub-
  allocation path unchanged.
- ShmPoolConfig::max_idle_per_size (default 8) caps resident memory
  at 8 MB per provider at 1 MB payload size.
Measures the two steady-state paths of ShmSegmentPool:

  alloc/cold/1MB  ~314 µs  shm_open+ftruncate+mmap+mlock per alloc
  alloc/warm/1MB  ~134 ns  ArrayQueue pop+push (two CAS, no syscalls)

Ratio: ~2300x. Demonstrates what pool reuse eliminates on the hot
publish path at 1 MB payload size.
@YuanYuYuan YuanYuYuan added breaking-change Indicates that the issue implies a breaking change (be it at compile time or at runtime) new feature Something new is needed and removed breaking-change Indicates that the issue implies a breaking change (be it at compile time or at runtime) labels Mar 24, 2026
@YuanYuYuan YuanYuYuan removed the enhancement Existing things could work better label Mar 24, 2026
@YuanYuYuan YuanYuYuan requested a review from yellowhatter March 24, 2026 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new feature Something new is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant