Skip to content

perf(fs): I/O performance epic — FileHint + O_DIRECT + prefetching + rate limiter #133

@polaz

Description

@polaz

I/O Performance Epic

Consolidated from #133 (FileHint + O_DIRECT + prefetching) + #173 (rate limiter).


Problem

  1. OS page cache double-caches with lsm-tree BlockCache; cold ops (compaction, scans) pollute it
  2. readseq is 0.65× RocksDB (4.4M vs 6.8M ops/s) due to zero prefetching
  3. No O_DIRECT support — can't bypass page cache for compaction I/O
  4. Compaction runs at full I/O speed — can saturate disk, starving point reads (P99 spikes)

Solution (4 phases)

Phase 1: FileHint enum + posix_fadvise (2d)

pub enum FileHint {
    Default,       // Normal caching (point reads)
    Sequential,    // POSIX_FADV_SEQUENTIAL (scans, compaction reads)
    WriteOnce,     // POSIX_FADV_DONTNEED after write (compaction output, flush)
    Random,        // POSIX_FADV_RANDOM (point-read SST files — disable readahead)
}
  • Add hint() method to FsFile trait
  • Platform-specific: posix_fadvise (Linux), fcntl F_RDADVISE (macOS), no-op (Windows)
  • Apply Sequential hint on compaction input files
  • Apply WriteOnce on compaction output + flush output
  • Apply Random on SST files opened for point reads (disable kernel readahead)
  • Increase compaction readahead buffer from 32KB → 2MB default

Impact: +10-15% compaction throughput, reduced page cache pollution.

Upstream reference: fjall-rs#57 (POSIX_FADV_RANDOM).

Phase 2: O_DIRECT + AlignedBuf (2d)

pub struct AlignedBuf {
    ptr: *mut u8,
    len: usize,
    capacity: usize,
    alignment: usize, // 4096 for O_DIRECT
}
  • O_DIRECT open flag support in Fs trait
  • AlignedBuf wrapper: alloc aligned to 4096, read/write aligned to 512
  • Config: direct_io: bool (default false, opt-in)
  • Apply to compaction I/O first (biggest win — bypasses page cache entirely)

Impact: +5-10% on write-heavy workloads (eliminates double-buffering).

Phase 3: Adaptive block prefetching (3d)

Three sub-phases:

  1. OS-level readahead via Fs trait (posix_fadvise WILLNEED) — detects sequential access after 2 consecutive reads, exponential growth
  2. User-space prefetch buffer for O_DIRECT mode — when OS readahead unavailable
  3. Double-buffering with async I/O — overlap compute and I/O

RocksDB reference: FilePrefetchBuffer with sequential detection, exponential growth, double-buffering.

Impact: Phase 3.1: +20-30% readseq, Phase 3.2: +10-15%, Phase 3.3: +5-10%. Combined target: 0.85-0.95× RocksDB readseq.

Phase 4: Compaction I/O rate limiter (2d) — from #173

RocksDB implements priority-based I/O rate limiting (util/rate_limiter.cc:122-195) to prevent compaction from starving user reads.

pub struct RateLimiter {
    rate_bytes_per_sec: AtomicU64,
    refill_period: Duration,      // default 100ms
    available_bytes: AtomicI64,
    queues: [Mutex<VecDeque<Waker>>; 3], // Low, High, User
}

pub enum IoPriority {
    Low,    // Compaction reads/writes
    High,   // Flush (memtable → SST)
    User,   // Point reads, range scans
}
  • Compaction worker calls rate_limiter.request(bytes, IoPriority::Low) before each I/O
  • If budget exhausted → sleep until next refill period
  • Higher priority requests drain first
  • Optional auto-tune: adjust rate based on drain frequency
  • Config: compaction_rate_limit: u64 (bytes/sec, 0 = unlimited)

Impact: P99 latency stability under compaction. Not throughput — tail latency predictability.

Files

  • src/fs/mod.rs — FileHint trait extension
  • src/fs/std_fs.rs — Platform-specific fadvise/fcntl
  • src/fs/aligned_buf.rs — NEW: AlignedBuf for O_DIRECT
  • src/rate_limiter.rs — NEW: token bucket rate limiter
  • src/table/block/mod.rs — Apply hints on block read
  • src/table/scanner.rs — Prefetch buffer integration
  • src/table/iter.rs — Sequential access detection
  • src/run_reader.rs — Multi-table prefetch coordination
  • src/compaction/worker.rs — Apply WriteOnce hint + rate limiter integration
  • src/config/mod.rs — rate limit + direct_io configuration

Depends on

Estimate

Total: 9d (Phase 1: 2d, Phase 2: 2d, Phase 3: 3d, Phase 4: 2d)

Metadata

Metadata

Assignees

No one assigned

    Labels

    compactionCompaction logic, leveled/tiered strategyenhancementNew feature, new API, new capabilityfs-traitFilesystem abstraction, io_uring, per-level routingperformanceOptimization, reduced allocations, faster path

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions