-
Notifications
You must be signed in to change notification settings - Fork 0
Description
I/O Performance Epic
Consolidated from #133 (FileHint + O_DIRECT + prefetching) + #173 (rate limiter).
Problem
- OS page cache double-caches with lsm-tree BlockCache; cold ops (compaction, scans) pollute it
- readseq is 0.65× RocksDB (4.4M vs 6.8M ops/s) due to zero prefetching
- No O_DIRECT support — can't bypass page cache for compaction I/O
- Compaction runs at full I/O speed — can saturate disk, starving point reads (P99 spikes)
Solution (4 phases)
Phase 1: FileHint enum + posix_fadvise (2d)
pub enum FileHint {
Default, // Normal caching (point reads)
Sequential, // POSIX_FADV_SEQUENTIAL (scans, compaction reads)
WriteOnce, // POSIX_FADV_DONTNEED after write (compaction output, flush)
Random, // POSIX_FADV_RANDOM (point-read SST files — disable readahead)
}- Add
hint()method toFsFiletrait - Platform-specific: posix_fadvise (Linux), fcntl F_RDADVISE (macOS), no-op (Windows)
- Apply Sequential hint on compaction input files
- Apply WriteOnce on compaction output + flush output
- Apply Random on SST files opened for point reads (disable kernel readahead)
- Increase compaction readahead buffer from 32KB → 2MB default
Impact: +10-15% compaction throughput, reduced page cache pollution.
Upstream reference: fjall-rs#57 (POSIX_FADV_RANDOM).
Phase 2: O_DIRECT + AlignedBuf (2d)
pub struct AlignedBuf {
ptr: *mut u8,
len: usize,
capacity: usize,
alignment: usize, // 4096 for O_DIRECT
}O_DIRECTopen flag support in Fs traitAlignedBufwrapper: alloc aligned to 4096, read/write aligned to 512- Config:
direct_io: bool(default false, opt-in) - Apply to compaction I/O first (biggest win — bypasses page cache entirely)
Impact: +5-10% on write-heavy workloads (eliminates double-buffering).
Phase 3: Adaptive block prefetching (3d)
Three sub-phases:
- OS-level readahead via Fs trait (posix_fadvise WILLNEED) — detects sequential access after 2 consecutive reads, exponential growth
- User-space prefetch buffer for O_DIRECT mode — when OS readahead unavailable
- Double-buffering with async I/O — overlap compute and I/O
RocksDB reference: FilePrefetchBuffer with sequential detection, exponential growth, double-buffering.
Impact: Phase 3.1: +20-30% readseq, Phase 3.2: +10-15%, Phase 3.3: +5-10%. Combined target: 0.85-0.95× RocksDB readseq.
Phase 4: Compaction I/O rate limiter (2d) — from #173
RocksDB implements priority-based I/O rate limiting (util/rate_limiter.cc:122-195) to prevent compaction from starving user reads.
pub struct RateLimiter {
rate_bytes_per_sec: AtomicU64,
refill_period: Duration, // default 100ms
available_bytes: AtomicI64,
queues: [Mutex<VecDeque<Waker>>; 3], // Low, High, User
}
pub enum IoPriority {
Low, // Compaction reads/writes
High, // Flush (memtable → SST)
User, // Point reads, range scans
}- Compaction worker calls
rate_limiter.request(bytes, IoPriority::Low)before each I/O - If budget exhausted → sleep until next refill period
- Higher priority requests drain first
- Optional auto-tune: adjust rate based on drain frequency
- Config:
compaction_rate_limit: u64(bytes/sec, 0 = unlimited)
Impact: P99 latency stability under compaction. Not throughput — tail latency predictability.
Files
src/fs/mod.rs— FileHint trait extensionsrc/fs/std_fs.rs— Platform-specific fadvise/fcntlsrc/fs/aligned_buf.rs— NEW: AlignedBuf for O_DIRECTsrc/rate_limiter.rs— NEW: token bucket rate limitersrc/table/block/mod.rs— Apply hints on block readsrc/table/scanner.rs— Prefetch buffer integrationsrc/table/iter.rs— Sequential access detectionsrc/run_reader.rs— Multi-table prefetch coordinationsrc/compaction/worker.rs— Apply WriteOnce hint + rate limiter integrationsrc/config/mod.rs— rate limit + direct_io configuration
Depends on
- feat: Fs trait — filesystem abstraction for pluggable I/O backends #75 (Fs trait) ✅ DONE
- refactor: replace std::fs with Fs trait across all I/O call sites #76 (refactor I/O to use Fs) ✅ DONE
Estimate
Total: 9d (Phase 1: 2d, Phase 2: 2d, Phase 3: 3d, Phase 4: 2d)