feat: add range tombstones (delete_range / delete_prefix)#21
feat: add range tombstones (delete_range / delete_prefix)#21
Conversation
- Add RangeTombstone, ActiveTombstoneSet, IntervalTree core types (ported from upstream PR fjall-rs#242 algorithms, own SST persistence) - Add RangeTombstoneFilter for bidirectional iteration suppression - Integrate into Memtable with interval tree for O(log n) queries - Add remove_range(start, end, seqno) and remove_prefix(prefix, seqno) to AbstractTree trait, implemented for Tree and BlobTree - Wire suppression into point reads (memtable + sealed + SST layers) - Wire RangeTombstoneFilter into range/prefix iteration pipeline - SST persistence: raw wire format in BlockType::RangeTombstone block - Flush: collect range tombstones from sealed memtables, write to SSTs - Compaction: propagate RTs from input to output tables with clipping - GC: evict range tombstones below watermark at bottom level - Table-skip: skip tables fully covered by a range tombstone - MultiWriter: clip RTs to each output table's key range on rotation - Handle RT-only tables (derive key range from tombstone bounds) - 16 integration tests covering all paths Closes #16
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds end-to-end range-tombstone support: new RangeTombstone types and filters, interval-tree-backed memtable storage and active trackers, SST on-disk tombstone blocks and writer distribution, flush/compaction wiring, read-path suppression, public remove_range/remove_prefix APIs, tests, and architecture/known-gaps docs. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Tree
participant Memtable
participant Flusher as Flush/Writer
participant SST as SST Files
participant Compactor
rect rgba(200,230,255,0.5)
Client->>Tree: remove_range(start,end,seq)
Tree->>Memtable: insert_range_tombstone(start,end,seq)
end
rect rgba(220,255,200,0.5)
Client->>Tree: read / iterate (read_seqno)
Tree->>Memtable: is_suppressed_by_range_tombstones(key, read_seqno)
Note over Tree,Memtable: suppression may consult sealed memtables and SSTs
end
rect rgba(255,240,200,0.5)
Flusher->>Memtable: flush_to_tables_with_rt(stream, tombstones)
Flusher->>SST: write_range_tombstone block into table
end
rect rgba(255,220,240,0.5)
Compactor->>SST: collect range tombstones from inputs
Compactor->>SST: write surviving tombstones to output tables
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Pull request overview
Adds first-class range tombstones to the LSM-tree to enable efficient range/prefix deletes (remove_range, remove_prefix) with MVCC-aware visibility, and persists tombstones through flush/compaction/recovery via a new SST block and table metadata.
Changes:
- Introduces core range-tombstone types and in-memory structures (interval tree in memtables; sweep-line active sets + iterator filter).
- Extends flush/compaction/table IO to write/read range tombstones from SSTs (
BlockType::RangeTombstone, TOC region, metadata count). - Adds integration tests validating suppression semantics and persistence across flush/compaction/recovery.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration tests for range tombstone behavior and persistence. |
| src/abstract_tree.rs | Adds remove_range API and remove_prefix helper; extends flush path to pass tombstones. |
| src/tree/mod.rs | Tree integration: point-read suppression check + flush now accepts tombstones. |
| src/blob_tree/mod.rs | BlobTree integration: passes tombstones through flush; exposes remove_range. |
| src/memtable/mod.rs | Stores memtable range tombstones in a mutex-protected interval tree and exposes query/flush helpers. |
| src/memtable/interval_tree.rs | AVL interval tree backing for memtable tombstone queries. |
| src/range_tombstone.rs | Defines RangeTombstone, clipping/cover helpers, and key upper-bound helper. |
| src/active_tombstone_set.rs | Sweep-line “active tombstone” trackers for forward/reverse iteration. |
| src/range_tombstone_filter.rs | Bidirectional iterator wrapper to suppress entries covered by visible range tombstones. |
| src/range.rs | Collects tombstones across layers, adds table-skip optimization, and wraps iterators with tombstone filter. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4. |
| src/table/writer/mod.rs | Writes range tombstone blocks + metadata; supports RT-only tables by deriving key range. |
| src/table/regions.rs | Adds range_tombstones section handle in parsed table regions. |
| src/table/multi_writer.rs | Distributes/clips tombstones per output table via set_range_tombstones. |
| src/table/inner.rs | Stores loaded range tombstones in table inner state. |
| src/table/mod.rs | Loads/decodes range tombstone block on table recovery; exposes range_tombstones(). |
| src/compaction/flavour.rs | Extends compaction flavour to accept output range tombstones. |
| src/compaction/worker.rs | Collects tombstones from input tables and propagates them to compaction outputs (with last-level GC filtering). |
| src/lib.rs | Registers new internal modules for range tombstone support. |
| .forge/16/known-gaps.md | Documents remaining limitations and known gaps for Issue #16. |
| .forge/16/analysis.md | Architecture/design analysis and rationale for the chosen approach. |
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (2)
src/table/mod.rs (2)
570-577: Consider validating the block type for range tombstones.The recovery path loads the range tombstone block with
CompressionType::Nonebut doesn't validate that the loaded block hasBlockType::RangeTombstone. This differs from other block loading paths (e.g., filter block at lines 553-560) that verify the block type.♻️ Proposed fix
let range_tombstones = if let Some(rt_handle) = regions.range_tombstones { log::trace!("Loading range tombstone block, with rt_ptr={rt_handle:?}"); let block = Block::from_file(&file, rt_handle, crate::CompressionType::None)?; + if block.header.block_type != BlockType::RangeTombstone { + return Err(crate::Error::InvalidTag(( + "BlockType", + block.header.block_type.into(), + ))); + } Self::decode_range_tombstones(block)? } else { Vec::new() };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/table/mod.rs` around lines 570 - 577, The range tombstone loading path doesn't validate the loaded block type; when regions.range_tombstones is Some you call Block::from_file(..., crate::CompressionType::None) and then Self::decode_range_tombstones(block) without checking block.block_type. Update the code to verify the loaded block's type equals BlockType::RangeTombstone (or return an error if not) before calling Self::decode_range_tombstones; reference the regions.range_tombstones branch, the Block::from_file call, crate::CompressionType::None, and Self::decode_range_tombstones to locate and modify the logic.
620-660: Consider validating tombstone invariantstart < endduring decoding.
RangeTombstone::newhas adebug_assert!(start < end)that only fires in debug builds. Corrupted on-disk data could produce invalid tombstones withstart >= endin release builds, potentially causing incorrect behavior in the interval tree or filtering logic.🛡️ Proposed validation
tombstones.push(RangeTombstone::new( UserKey::from(start_buf), UserKey::from(end_buf), seqno, )); + // Validate tombstone invariant + if let Some(rt) = tombstones.last() { + if rt.start >= rt.end { + return Err(crate::Error::Unrecoverable); + } + }Or validate before constructing:
+ let start_key = UserKey::from(start_buf); + let end_key = UserKey::from(end_buf); + if start_key >= end_key { + return Err(crate::Error::Unrecoverable); + } tombstones.push(RangeTombstone::new( - UserKey::from(start_buf), - UserKey::from(end_buf), + start_key, + end_key, seqno, ));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/table/mod.rs` around lines 620 - 660, The decoder (decode_range_tombstones) must validate the invariant start < end after reading start_buf and end_buf and before calling RangeTombstone::new; if the condition fails, return an appropriate crate::Error (e.g., a corruption/unrecoverable error) instead of constructing the tombstone so release builds don't accept invalid on-disk data; update the logic around UserKey::from(start_buf)/UserKey::from(end_buf) to compare the keys and early-return Err(...) when start >= end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.forge/16/analysis.md:
- Around line 85-95: MD040 is triggered because the two fenced diff-stat blocks
in .forge/16/analysis.md lack a language tag; update the opening fences for the
two blocks that start with "origin/main vs upstream/main:" and "origin/main vs
upstream-pr-242:" to use a language tag (e.g., change ``` to ```text) so the
blocks become ```text; keep the content of the blocks unchanged.
In `@src/abstract_tree.rs`:
- Around line 722-743: The current remove_prefix implementation incorrectly
treats Bound::Unbounded by synthesizing a truncated exclusive end via
range_tombstone::upper_bound_exclusive(prefix), which doesn't form a true
lexicographic upper bound (e.g., b"\xff" still leaves b"\xff\x01"); fix by
adding an explicit unbounded-path instead of synthesizing an end: when
prefix_to_range yields hi == Bound::Unbounded, call a true unbounded-range
removal API (e.g., implement/use self.remove_range_unbounded(start, seqno) or
self.remove_range_from(start, seqno)) to remove [start, +∞), or if an unbounded
removal is not supported, explicitly reject the operation (return 0 or propagate
an error) rather than creating a truncated tombstone; update remove_prefix, and
add/modify remove_range* helper(s) accordingly (referencing remove_prefix,
prefix_to_range, range_tombstone::upper_bound_exclusive, and remove_range).
In `@src/active_tombstone_set.rs`:
- Around line 110-123: The initialize_from method currently activates every
provided RangeTombstone via activate(&rt) even if it starts in the future
relative to the iterator's current_key; change initialize_from to only activate
truly overlapping tombstones by either (A) adding a current_key parameter (e.g.,
initialize_from(&mut self, current_key: &KeyType, tombstones: impl
IntoIterator<Item = RangeTombstone>)) and filtering each rt with rt.start <=
*current_key && *current_key < rt.end before calling activate, or (B) make
initialize_from private/internal and require callers to pre-filter by that same
overlap predicate; apply the same fix to the analogous method/block at the
second occurrence (the code referenced around lines 226–230) and ensure
expire_until usage still enforces the end bound after activation.
In `@src/memtable/interval_tree.rs`:
- Around line 266-269: The insert method incorrectly increments self.len
unconditionally; change insert_node (the helper called by insert) to return
whether it actually created a new node (e.g., return a (node, bool) or a Node
plus created flag) when handling RangeTombstone duplicates, then update insert
to set self.root from the returned node and only increment self.len when the
created flag is true; update any calling sites or signatures for insert_node and
keep references to RangeTombstone, insert_node, and insert consistent.
In `@src/memtable/mod.rs`:
- Around line 31-36: Memtables that only contain range tombstones are treated as
empty because Memtable::is_empty() only checks items; update logic so
Memtable::is_empty() also inspects range_tombstones by locking the Mutex and
calling the interval tree's is_empty (e.g., check
self.range_tombstones.lock().is_empty()), and propagate this change to callers
(Tree::rotate_memtable(), Tree::clear_active_memtable()) so they seal/flush
memtables and compute get_highest_seqno() correctly when only tombstones exist;
ensure get_highest_seqno() and any durability paths (remove_range/remove_prefix)
treat a memtable with non-empty range_tombstones as non-empty.
- Around line 183-190: The insert_range_tombstone method must reject empty or
reversed intervals before persisting: in insert_range_tombstone (and before
calling RangeTombstone::new or inserting into self.range_tombstones), validate
that start < end (using UserKey ordering) and if not, return 0 (do not
compute/record the tombstone). Move or gate the size calculation and the
lock+insert behind this check so reversed/empty ranges are never stored; keep
the rest of the method behavior unchanged.
In `@src/range_tombstone_filter.rs`:
- Around line 39-52: The constructor new() should ensure fwd_tombstones are
ordered by start ascending (and tie-break by seqno descending) before storing,
because fwd_activate_up_to() relies on forward tombstones being sorted by start;
add a sort on fwd_tombstones in new() such as sorting by (start asc, seqno desc)
prior to assigning Self.fwd_tombstones so out-of-order input Vecs can't skip
activation.
In `@tests/range_tombstone.rs`:
- Around line 330-352: Add a new test that writes a range tombstone then tampers
with the SST on-disk range_tombstones block and asserts recovery fails: use
get_tmp_folder(), open_tree(), remove_range(), and flush_active_memtable(0) to
create an SST with the new range tombstone format
([start_len][start][end_len][end][seqno]), then locate the produced SST file in
the folder, corrupt/truncate bytes from the range_tombstones payload (e.g.
remove bytes from the end of the block or zero out length fields), attempt to
reopen the tree with open_tree(folder.path()) and assert that reopening returns
an Err (i.e. recovery fails) rather than succeeding; keep the test focused on
both truncated and malformed length scenarios so it fails on invalid RT
encoding.
- Around line 168-176: The test currently relies on behavior that could pass
even if the tombstone never left the active memtable; modify the test around the
remove_range/rotate_memtable sequence (the block using tree.remove_range,
tree.rotate_memtable, tree.insert, and tree.get) to explicitly assert that the
tombstone was actually sealed/flushed by checking rotate_memtable() returned
Some (or that tree.sealed_memtable_count() > 0) immediately after rotation, and
similarly for the later analogous block (lines ~367-374) assert table_count()
increases or reopen the DB to force RT-only flush visibility; update the
assertions in both places so the test fails if the tombstone stayed in the
active memtable.
---
Nitpick comments:
In `@src/table/mod.rs`:
- Around line 570-577: The range tombstone loading path doesn't validate the
loaded block type; when regions.range_tombstones is Some you call
Block::from_file(..., crate::CompressionType::None) and then
Self::decode_range_tombstones(block) without checking block.block_type. Update
the code to verify the loaded block's type equals BlockType::RangeTombstone (or
return an error if not) before calling Self::decode_range_tombstones; reference
the regions.range_tombstones branch, the Block::from_file call,
crate::CompressionType::None, and Self::decode_range_tombstones to locate and
modify the logic.
- Around line 620-660: The decoder (decode_range_tombstones) must validate the
invariant start < end after reading start_buf and end_buf and before calling
RangeTombstone::new; if the condition fails, return an appropriate crate::Error
(e.g., a corruption/unrecoverable error) instead of constructing the tombstone
so release builds don't accept invalid on-disk data; update the logic around
UserKey::from(start_buf)/UserKey::from(end_buf) to compare the keys and
early-return Err(...) when start >= end.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0a4b651a-d997-4ee5-9e1a-ba1e0f088648
📒 Files selected for processing (21)
.forge/16/analysis.md.forge/16/known-gaps.mdsrc/abstract_tree.rssrc/active_tombstone_set.rssrc/blob_tree/mod.rssrc/compaction/flavour.rssrc/compaction/worker.rssrc/lib.rssrc/memtable/interval_tree.rssrc/memtable/mod.rssrc/range.rssrc/range_tombstone.rssrc/range_tombstone_filter.rssrc/table/block/type.rssrc/table/inner.rssrc/table/mod.rssrc/table/multi_writer.rssrc/table/regions.rssrc/table/writer/mod.rssrc/tree/mod.rstests/range_tombstone.rs
- Fix non-exhaustive BlockType match in table/util.rs (metrics feature) - Add #[allow(dead_code)] for future-use API (CoveringRt, initialize_from) - Add #[allow(clippy::unnecessary_box_returns)] for AVL tree functions - Fix Option<&T> vs &Option<T> in interval_tree traversal functions - Add missing # Panics and # Errors doc sections - Add backticks in doc comments for identifiers - Use let...else patterns where clippy suggests - Fix redundant clone, pass-by-value, pub(crate) visibility - Remove unfulfilled lint expectations (cfg-gated params)
- Validate start < end on SST decode (reject corrupted RT blocks) - Validate start >= end in insert_range_tombstone (release-safe guard) - Fix duplicate insert len accounting in IntervalTree - Fix 0xFF prefix edge case in remove_prefix (return 0 for unrepresentable) - Re-insert tombstones into active memtable when RT-only flush produces no valid SST (no index possible without KV items) - Account for range tombstones in Memtable::is_empty - Add cheap pre-check before collecting range tombstones in iteration - Sort fwd_tombstones inside RangeTombstoneFilter::new for safety - Harden test assertions (sealed count, table count)
Verify loaded block has BlockType::RangeTombstone before decoding, consistent with filter block validation pattern.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
src/table/iter.rs (1)
122-122: Prefer#[expect(clippy::too_many_arguments)]over#[allow(...)].Per coding guidelines,
#[expect]should be used instead of#[allow]because it warns when the suppression becomes unnecessary (e.g., if future refactoring reduces the argument count).♻️ Suggested fix
- #[allow(clippy::too_many_arguments)] + #[expect(clippy::too_many_arguments)]As per coding guidelines: "Prefer
#[expect(lint)]over#[allow(lint)]—#[expect]warns when suppression becomes unnecessary"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/table/iter.rs` at line 122, Replace the #[allow(clippy::too_many_arguments)] attribute on the function/impl annotated in this diff with #[expect(clippy::too_many_arguments)] so the suppression still applies but will warn if it becomes unnecessary; locate the exact attribute instance (the attribute decorating the function that currently uses #[allow(clippy::too_many_arguments)]) and change the attribute name only, leaving the lint and surrounding code unchanged.src/abstract_tree.rs (1)
239-245: Consider#[expect]instead of#[warn]for lint attribute.The
#[warn(clippy::type_complexity)]attribute is unusual here — it would emit warnings rather than suppress them. If the intent is to suppress the lint, use#[expect(clippy::type_complexity)]or#[allow(clippy::type_complexity)].🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/abstract_tree.rs` around lines 239 - 245, The attribute on function flush_to_tables uses #[warn(clippy::type_complexity)] which will emit warnings instead of suppressing them; change it to #[expect(clippy::type_complexity)] (or #[allow(clippy::type_complexity)] if you prefer permanent suppression) placed on the flush_to_tables function declaration so the clippy type_complexity lint is treated as expected/allowed for the flush_to_tables / flush_to_tables_with_rt code path.src/range.rs (1)
101-101: Prefer#[expect(clippy::too_many_lines)]over#[allow(...)].Per coding guidelines, use
#[expect]so you get a warning when the suppression becomes unnecessary.♻️ Suggested fix
- #[allow(clippy::too_many_lines)] + #[expect(clippy::too_many_lines)]As per coding guidelines: "Prefer
#[expect(lint)]over#[allow(lint)]—#[expect]warns when suppression becomes unnecessary"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/range.rs` at line 101, Replace the attribute #[allow(clippy::too_many_lines)] with #[expect(clippy::too_many_lines)] to follow the coding guideline; locate the attribute (attached to the function/module declared near the top of src/range.rs) and change the lint suppression to an expectation so you receive a warning if the suppression becomes unnecessary.src/tree/mod.rs (1)
778-816: Performance concern: Linear scan of all SST tables for range tombstone suppression.The current implementation iterates over every table in every level (lines 802-813) to check range tombstones. For point reads, this could be expensive with many tables.
Consider filtering tables by key range before checking their tombstones, similar to how
get_for_keyfilters tables inget_internal_entry_from_tables. The PR notes this as a known limitation ("linear SST scan for point-read suppression").💡 Potential optimization sketch
- for table in super_version - .version - .iter_levels() - .flat_map(|lvl| lvl.iter()) - .flat_map(|run| run.iter()) - { + // Only check tables whose key range contains the key + for table in super_version + .version + .iter_levels() + .flat_map(|lvl| lvl.iter()) + .filter_map(|run| run.get_for_key(key)) + { for rt in table.range_tombstones() {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tree/mod.rs` around lines 778 - 816, The is_suppressed_by_range_tombstones function currently scans every SST table; limit that by first filtering tables whose key range could contain the target key (like get_for_key / get_internal_entry_from_tables does) before iterating range_tombstones(): for each table from super_version.version.iter_levels().flat_map(...).flat_map(...), check the table's key range (e.g., smallest/largest key or a key_range() accessor on the Table) and skip tables where key is outside that range, then only call table.range_tombstones() and rt.should_suppress; keep the existing active_memtable/sealed_memtables checks and ensure you use the same table-range helpers used elsewhere to avoid duplicating range logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/table/util.rs`:
- Line 31: The attribute #[allow(clippy::too_many_arguments)] in
src/table/util.rs should be changed to #[expect(clippy::too_many_arguments)] to
comply with the repo's policy; locate the function or item annotated with
#[allow(clippy::too_many_arguments)] (the attribute on the item in util.rs) and
replace the allow attribute with #[expect(clippy::too_many_arguments)]
preserving any existing whitespace/comments and ensuring the attribute exactly
matches the expect form.
---
Nitpick comments:
In `@src/abstract_tree.rs`:
- Around line 239-245: The attribute on function flush_to_tables uses
#[warn(clippy::type_complexity)] which will emit warnings instead of suppressing
them; change it to #[expect(clippy::type_complexity)] (or
#[allow(clippy::type_complexity)] if you prefer permanent suppression) placed on
the flush_to_tables function declaration so the clippy type_complexity lint is
treated as expected/allowed for the flush_to_tables / flush_to_tables_with_rt
code path.
In `@src/range.rs`:
- Line 101: Replace the attribute #[allow(clippy::too_many_lines)] with
#[expect(clippy::too_many_lines)] to follow the coding guideline; locate the
attribute (attached to the function/module declared near the top of
src/range.rs) and change the lint suppression to an expectation so you receive a
warning if the suppression becomes unnecessary.
In `@src/table/iter.rs`:
- Line 122: Replace the #[allow(clippy::too_many_arguments)] attribute on the
function/impl annotated in this diff with #[expect(clippy::too_many_arguments)]
so the suppression still applies but will warn if it becomes unnecessary; locate
the exact attribute instance (the attribute decorating the function that
currently uses #[allow(clippy::too_many_arguments)]) and change the attribute
name only, leaving the lint and surrounding code unchanged.
In `@src/tree/mod.rs`:
- Around line 778-816: The is_suppressed_by_range_tombstones function currently
scans every SST table; limit that by first filtering tables whose key range
could contain the target key (like get_for_key / get_internal_entry_from_tables
does) before iterating range_tombstones(): for each table from
super_version.version.iter_levels().flat_map(...).flat_map(...), check the
table's key range (e.g., smallest/largest key or a key_range() accessor on the
Table) and skip tables where key is outside that range, then only call
table.range_tombstones() and rt.should_suppress; keep the existing
active_memtable/sealed_memtables checks and ensure you use the same table-range
helpers used elsewhere to avoid duplicating range logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b5037095-c6a0-47c3-a880-07c3d65eaa33
📒 Files selected for processing (14)
.forge/16/analysis.mdsrc/abstract_tree.rssrc/active_tombstone_set.rssrc/memtable/interval_tree.rssrc/memtable/mod.rssrc/range.rssrc/range_tombstone.rssrc/range_tombstone_filter.rssrc/table/iter.rssrc/table/mod.rssrc/table/util.rssrc/table/writer/mod.rssrc/tree/mod.rstests/range_tombstone.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/range_tombstone.rs
There was a problem hiding this comment.
Pull request overview
Adds MVCC-aware range tombstones to the LSM-tree, enabling efficient contiguous deletions via remove_range / remove_prefix, and persists/propagates tombstones through flush + compaction + recovery.
Changes:
- Introduces core range-tombstone data structures (
RangeTombstone, memtableIntervalTree, sweep-lineActiveTombstoneSet, andRangeTombstoneFilter). - Extends read/write paths to apply range-tombstone suppression (point reads, iterators, flush, compaction, and SST recovery).
- Adds integration + regression tests for correctness across memtables/SSTs/compaction/recovery.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration tests covering range tombstone behavior across layers and lifecycle events. |
| src/abstract_tree.rs | Adds remove_range API and default remove_prefix; extends flush to pass RTs through. |
| src/tree/mod.rs | Implements remove_range; applies RT suppression on point reads; flush now supports RT distribution. |
| src/blob_tree/mod.rs | Forwards range deletes through BlobTree to underlying index tree; flush now supports RT distribution. |
| src/memtable/mod.rs | Stores RTs in a mutex-protected interval tree; updates emptiness + adds RT query/insert APIs. |
| src/memtable/interval_tree.rs | New AVL interval tree for efficient RT overlap/suppression queries in memtables. |
| src/active_tombstone_set.rs | New forward/reverse active RT tracking for sweep-line iteration filtering. |
| src/range_tombstone.rs | Defines the RT type, ordering, and helper utilities (intersection, coverage checks). |
| src/range_tombstone_filter.rs | New bidirectional iterator wrapper that suppresses KV entries covered by RTs. |
| src/range.rs | Collects RTs from all layers; wraps merged iterator with RangeTombstoneFilter; adds table-skip optimization. |
| src/table/block/type.rs | Adds new BlockType::RangeTombstone = 4. |
| src/table/regions.rs | Extends table TOC parsing to include optional range_tombstones region. |
| src/table/inner.rs | Stores recovered table RTs in Inner. |
| src/table/mod.rs | Loads/decodes RT blocks on table open; exposes range_tombstones() accessor. |
| src/table/writer/mod.rs | Adds RT collection and writes RT block + metadata count during table finish. |
| src/table/multi_writer.rs | Distributes/clips RTs across rotated output tables (flush/compaction output). |
| src/table/util.rs | Treats RT blocks like data/meta blocks for metrics counters; changes lint suppression. |
| src/table/iter.rs | Changes lint suppression for constructor with many args. |
| src/compaction/flavour.rs | Adds write_range_tombstones hook so compaction can propagate RTs to outputs. |
| src/compaction/worker.rs | Collects RTs from input tables and propagates them to compaction outputs (with GC at last level). |
| src/lib.rs | Exposes new internal modules for RT support. |
| .forge/16/known-gaps.md | Documents known limitations and optimization opportunities for RT feature. |
| .forge/16/analysis.md | Design/architecture analysis and upstream context for RT implementation choices. |
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/table/mod.rs (1)
648-666: Consider preserving error context for easier debugging.The
map_err(|_| crate::Error::Unrecoverable)discards the original I/O error. WhileUnrecoverableis semantically appropriate for corrupted data, preserving context (e.g., which field failed, byte offset) could aid debugging in production.💡 Optional: Add context to decode errors
If a more descriptive error variant exists or is added later (e.g.,
Error::RangeTombstoneCorrupted { offset, reason }), consider using it here to preserve the cursor position and failure reason.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/table/mod.rs` around lines 648 - 666, The current reads (cursor.read_u16::<LE>(), cursor.read_exact(...), cursor.read_u64::<LE>()) map all IO errors to crate::Error::Unrecoverable, discarding the original error and offset; update these map_err calls to capture and wrap the original error and context (e.g., which field: "start_len"/"start_buf"/"end_len"/"end_buf"/"seqno" and the cursor position/byte offset) into a more informative Error variant or into Unrecoverable with a message, so failures when reading start_len, start_buf, end_len, end_buf, and seqno preserve the underlying io::Error and an indication of the failing field/offset for easier debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/table/mod.rs`:
- Around line 671-675: Add a unit test named
decode_range_tombstones_invalid_interval_returns_error() that builds a
serialized range tombstone blob matching the production format (little-endian
lengths and seqno fields), intentionally sets the interval so start >= end,
calls decode_range_tombstones(...) and asserts it returns
Err(crate::Error::Unrecoverable); reference the decode_range_tombstones function
and use the same byteorder::<LE> encoding for length/seqno fields so the
deserializer exercises the start >= end validation path.
---
Nitpick comments:
In `@src/table/mod.rs`:
- Around line 648-666: The current reads (cursor.read_u16::<LE>(),
cursor.read_exact(...), cursor.read_u64::<LE>()) map all IO errors to
crate::Error::Unrecoverable, discarding the original error and offset; update
these map_err calls to capture and wrap the original error and context (e.g.,
which field: "start_len"/"start_buf"/"end_len"/"end_buf"/"seqno" and the cursor
position/byte offset) into a more informative Error variant or into
Unrecoverable with a message, so failures when reading start_len, start_buf,
end_len, end_buf, and seqno preserve the underlying io::Error and an indication
of the failing field/offset for easier debugging.
- Fix visible_at to use exclusive boundary (seqno < read_seqno), consistent with seqno_filter and Memtable::get conventions - Validate decoded field lengths against remaining block data before allocating buffers (prevent OOM on corrupted blocks) - Skip SST tables by key_range in point read RT suppression check - Add code comments explaining #[allow] vs #[expect] for cfg-gated params - Use #[expect] with reason for too_many_lines on create_range - Fix remove_range doc to match no-op behavior for invalid intervals - Update PR body to reflect RT-only flush semantics
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/table/util.rs (1)
31-32:⚠️ Potential issue | 🟡 MinorUse
cfg_attr(..., expect(...))instead of a permanent#[allow].The feature-gated parameter count is exactly what
cfg_attr(feature = "metrics", expect(...))is for. Leaving#[allow]here suppresses the lint in the one build where it should still fire.Suggested patch
-// NOTE: #[allow] not #[expect] because arg count changes with cfg(feature = "metrics") -#[allow(clippy::too_many_arguments)] +#[cfg_attr( + feature = "metrics", + expect( + clippy::too_many_arguments, + reason = "metrics adds the extra parameter; without that feature this stays at the lint threshold" + ) +)] pub fn load_block(As per coding guidelines, "Prefer
#[expect(lint)]over#[allow(lint)]—#[expect]warns when suppression becomes unnecessary."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/table/util.rs` around lines 31 - 32, Replace the permanent attribute #[allow(clippy::too_many_arguments)] with a conditional expect so the lint still fires when the metrics feature is disabled; specifically change the attribute to #[cfg_attr(feature = "metrics", expect(clippy::too_many_arguments))] (or the inverse depending on feature semantics) at the same item (the attribute applied in src/table/util.rs) so the suppression only applies under the cfg where extra arguments exist and the lint will warn in builds where it should.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/range.rs`:
- Around line 217-226: The table-cover check uses table.metadata.key_range.max()
which is an exclusive upper bound, but RangeTombstone::fully_covers() expects an
inclusive max; convert the exclusive upper bound to an inclusive one before
calling fully_covers (e.g., compute an inclusive_max from
table.metadata.key_range.max() using the appropriate predecessor/prev-key API or
method on the key type, handle empty ranges by skipping the check if min >
inclusive_max), and then call rt.fully_covers(table.metadata.key_range.min(),
inclusive_max) inside the all_range_tombstones.iter().any(...) closure while
keeping the visible_at and seqno comparison logic unchanged.
---
Duplicate comments:
In `@src/table/util.rs`:
- Around line 31-32: Replace the permanent attribute
#[allow(clippy::too_many_arguments)] with a conditional expect so the lint still
fires when the metrics feature is disabled; specifically change the attribute to
#[cfg_attr(feature = "metrics", expect(clippy::too_many_arguments))] (or the
inverse depending on feature semantics) at the same item (the attribute applied
in src/table/util.rs) so the suppression only applies under the cfg where extra
arguments exist and the lint will warn in builds where it should.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f4d3251b-ba68-46f2-8418-870e509cdcef
📒 Files selected for processing (7)
src/abstract_tree.rssrc/range.rssrc/range_tombstone.rssrc/table/iter.rssrc/table/mod.rssrc/table/util.rssrc/tree/mod.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/table/iter.rs
Replace #[allow(clippy::too_many_arguments)] with #[cfg_attr(feature = "metrics", expect(...))] so the suppression only activates when the extra metrics parameter is compiled in.
There was a problem hiding this comment.
Pull request overview
This PR introduces MVCC-correct range tombstones to support efficient delete_range / delete_prefix semantics across memtables and SSTs, including persistence through flush/compaction/recovery and iterator-time suppression.
Changes:
- Add
remove_range/remove_prefixtoAbstractTree, with memtable storage via an AVL interval tree and iterator-time filtering via a sweep-lineRangeTombstoneFilter. - Persist range tombstones into SSTs via a new
BlockType::RangeTombstoneand load/validate them on table recovery. - Propagate/GC range tombstones during compaction and add an integration test suite for end-to-end semantics.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration tests for point reads, iteration, flush/compaction/recovery, prefix deletes, and RT-only flush behavior |
| src/abstract_tree.rs | Adds public remove_range/remove_prefix API + flush plumbing for range tombstones |
| src/tree/mod.rs | Flush writes RTs to SSTs and point-read path checks suppression by RTs across layers |
| src/blob_tree/mod.rs | RT support for blob-tree flush path + forwards remove_range |
| src/memtable/mod.rs | Stores RTs in a mutex-protected interval tree; exposes query/sorted export helpers |
| src/memtable/interval_tree.rs | New AVL interval tree implementation for efficient overlap/suppression queries |
| src/range.rs | Wraps merged iterators with RangeTombstoneFilter and adds table-skip optimization |
| src/range_tombstone.rs | Defines RangeTombstone semantics, ordering, clipping, and helpers |
| src/range_tombstone_filter.rs | Bidirectional sweep-line iterator filter for RT suppression |
| src/active_tombstone_set.rs | Active set structures used by the iterator filter (forward + reverse) |
| src/table/block/type.rs | Adds new on-disk block tag RangeTombstone = 4 |
| src/table/writer/mod.rs | Writes an RT block into SSTs and records range_tombstone_count in metadata |
| src/table/regions.rs | TOC parsing extended to locate the range_tombstones section |
| src/table/inner.rs | Stores loaded RTs in table inner state |
| src/table/mod.rs | Loads/decodes RT blocks on recovery and exposes range_tombstones() |
| src/table/multi_writer.rs | Attempts to distribute/clamp RTs across rotated output tables |
| src/table/util.rs | Treats RT blocks like data/meta for metrics bookkeeping; lint suppression change |
| src/table/iter.rs | Lint suppression change for constructor arg count |
| src/compaction/worker.rs | Collects RTs from input tables and attempts to propagate/GC them |
| src/compaction/flavour.rs | Adds write_range_tombstones hook and wires it to the table writer |
| src/lib.rs | Exposes new internal modules for range tombstone support |
| .forge/16/known-gaps.md | Documents limitations/known gaps for the feature |
| .forge/16/analysis.md | Architecture analysis and design trade-offs documentation |
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/range.rs (2)
302-307: Strengthen the fast path to skip wrapping when no tombstone is visible.A non-empty
all_range_tombstonescan still have zero entries visible atseqno, so this currently wraps with a no-op filter in those cases.⚡ Proposed patch
- // Fast path: skip filter wrapping when no range tombstones exist - if all_range_tombstones.is_empty() { + // Fast path: skip wrapping when no tombstone is visible for this read seqno + if !all_range_tombstones.iter().any(|rt| rt.visible_at(seqno)) { Box::new(iter) } else { Box::new(RangeTombstoneFilter::new(iter, all_range_tombstones, seqno)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/range.rs` around lines 302 - 307, The fast-path should skip wrapping not only when all_range_tombstones.is_empty() but also when none of the tombstones are visible at the given seqno; update the conditional around Box::new(...) to inspect all_range_tombstones for any entry visible at seqno (e.g. using the visibility predicate/method used by RangeTombstoneFilter or by filtering/map-checking the collection) and only construct RangeTombstoneFilter::new(iter, all_range_tombstones, seqno) if at least one tombstone is visible; otherwise return Box::new(iter) unchanged.
176-196: Remove redundant tombstone sorting in this path.
rts.sort()here is duplicate work:RangeTombstoneFilter::newalready sorts the forward list (and builds a reverse-sorted copy). This adds avoidableO(n log n)cost per range open.♻️ Proposed patch
let all_range_tombstones = if rt_count > 0 { let mut rts: Vec<RangeTombstone> = Vec::with_capacity(rt_count); rts.extend(lock.version.active_memtable.range_tombstones_sorted()); for mt in lock.version.sealed_memtables.iter() { rts.extend(mt.range_tombstones_sorted()); } if let Some((mt, _)) = &lock.ephemeral { rts.extend(mt.range_tombstones_sorted()); } for table in lock .version .version .iter_levels() .flat_map(|lvl| lvl.iter()) .flat_map(|run| run.iter()) { rts.extend(table.range_tombstones().iter().cloned()); } - rts.sort(); rts } else { Vec::new() };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/range.rs` around lines 176 - 196, The vector `rts` constructed for `all_range_tombstones` is being sorted with `rts.sort()` redundantly; `RangeTombstoneFilter::new` already sorts the forward list and produces the reverse-sorted copy, so remove the `rts.sort()` call in the block that builds `rts` (the code that extends from `lock.version.active_memtable.range_tombstones_sorted()`, sealed memtables, `lock.ephemeral`, and version runs) and leave the unsorted `rts` to be passed into `RangeTombstoneFilter::new`; this avoids the extra O(n log n) work while preserving correctness for `RangeTombstone` handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/range.rs`:
- Around line 302-307: The fast-path should skip wrapping not only when
all_range_tombstones.is_empty() but also when none of the tombstones are visible
at the given seqno; update the conditional around Box::new(...) to inspect
all_range_tombstones for any entry visible at seqno (e.g. using the visibility
predicate/method used by RangeTombstoneFilter or by filtering/map-checking the
collection) and only construct RangeTombstoneFilter::new(iter,
all_range_tombstones, seqno) if at least one tombstone is visible; otherwise
return Box::new(iter) unchanged.
- Around line 176-196: The vector `rts` constructed for `all_range_tombstones`
is being sorted with `rts.sort()` redundantly; `RangeTombstoneFilter::new`
already sorts the forward list and produces the reverse-sorted copy, so remove
the `rts.sort()` call in the block that builds `rts` (the code that extends from
`lock.version.active_memtable.range_tombstones_sorted()`, sealed memtables,
`lock.ephemeral`, and version runs) and leave the unsorted `rts` to be passed
into `RangeTombstoneFilter::new`; this avoids the extra O(n log n) work while
preserving correctness for `RangeTombstone` handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 36192350-8de8-4946-a95f-d0e3da61329a
📒 Files selected for processing (3)
src/range.rssrc/table/iter.rssrc/table/util.rs
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…unds - Set range tombstones on MultiWriter before KV write loop in flush (Tree, BlobTree) so rotations carry RT metadata in earlier tables - Propagate RTs before merge_iter loop in compaction worker - Assert u16 bound on start/end key lengths in insert_range_tombstone - Use exclusive seqno boundary in visible_at doc strings - Strengthen iteration fast path: check visible_at not just is_empty - Remove redundant sort (RangeTombstoneFilter::new sorts internally)
- Compaction with MultiWriter rotation preserves RTs across tables - Table-skip optimization skips fully-covered tables - BlobTree range tombstone insert, suppress, flush - Invalid interval (start >= end) silently returns 0 - Multiple compaction rounds preserve range tombstones
There was a problem hiding this comment.
Pull request overview
Adds MVCC-aware range tombstones to enable efficient range/prefix deletion (remove_range / remove_prefix) with persistence across flush/compaction/recovery, plus iterator- and point-read suppression.
Changes:
- Introduces core range-tombstone data structures (range tombstone model, sweep-line active sets, memtable interval tree, iterator filter).
- Persists range tombstones in SSTs via a new
BlockType::RangeTombstoneand TOC section, propagating them through compaction. - Updates read path (point reads + range/prefix iteration) to suppress covered entries and adds comprehensive tests/docs.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/range_tombstone.rs |
New integration test suite for RT semantics across memtable/SST/compaction/recovery. |
src/abstract_tree.rs |
Adds remove_range API + remove_prefix helper; flush now forwards RTs to table writers. |
src/blob_tree/mod.rs |
Wires RT-aware flush through blob tree and forwards remove_range. |
src/tree/mod.rs |
Implements remove_range for Tree, RT-aware flush-to-tables, and point-read suppression checks. |
src/range.rs |
Collects RTs across layers for range/prefix iteration; adds RT filtering and table-skip optimization. |
src/memtable/mod.rs |
Stores memtable RTs (mutex-protected interval tree) and exposes query/flush helpers. |
src/memtable/interval_tree.rs |
New AVL interval tree for RT overlap and suppression queries in memtables. |
src/active_tombstone_set.rs |
New forward/reverse sweep-line active tombstone sets for iteration. |
src/range_tombstone.rs |
Defines RangeTombstone and helpers (visibility, suppression, clipping, coverage). |
src/range_tombstone_filter.rs |
New bidirectional iterator wrapper that suppresses keys covered by visible RTs. |
src/table/block/type.rs |
Adds BlockType::RangeTombstone = 4 to the SST block type enum. |
src/table/writer/mod.rs |
Writes RT blocks + count metadata during SST creation; updates writer state to track RTs. |
src/table/regions.rs |
Parses new "range_tombstones" TOC section into table regions. |
src/table/mod.rs |
Loads/decodes RT blocks on table recovery and exposes Table::range_tombstones(). |
src/table/inner.rs |
Stores recovered table RTs in Table inner state. |
src/table/multi_writer.rs |
Distributes/clips RTs across rotated output tables during flush/compaction. |
src/table/util.rs |
Includes RT blocks in metrics accounting for block loads (cache + IO). |
src/table/iter.rs |
Adjusts clippy lint gating for metrics-conditional argument count. |
src/compaction/worker.rs |
Collects RTs from compaction inputs and propagates them to outputs (with GC eviction at last level). |
src/compaction/flavour.rs |
Extends compaction flavour interface to pass RTs into table writers. |
src/lib.rs |
Exposes new internal modules for RT support. |
.forge/16/known-gaps.md |
Documents known limitations/optimization gaps for Issue #16 implementation. |
.forge/16/analysis.md |
Architecture/design analysis and upstream context for range tombstones. |
You can also share your feedback on Copilot code review. Take the survey.
- Write synthetic weak tombstone in RT-only tables to produce valid index - Fix interval_tree pruning: use >= for exclusive seqno boundary - Replace #[allow(dead_code)] with #[expect(dead_code, reason)] throughout - Remove key_range filter from SST RT suppression (RT range may exceed table key range)
There was a problem hiding this comment.
Pull request overview
This PR introduces native range tombstones (half-open [start, end) deletes) to the LSM-tree, integrating them into point reads, range/prefix iteration, flush/compaction, and SST persistence so contiguous deletes don’t require per-key tombstones.
Changes:
- Add
remove_range/remove_prefixAPIs and memtable storage/querying via an AVL interval tree. - Persist range tombstones in SSTs via a new TOC section (
"range_tombstones") andBlockType::RangeTombstone. - Apply MVCC-correct suppression on point reads and iteration, including a table-skip optimization during range scans.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration coverage for MVCC suppression, iteration behavior, persistence, compaction, and BlobTree support. |
| src/abstract_tree.rs | Flush pipeline extended to carry sealed-memtable range tombstones; adds remove_range / remove_prefix API. |
| src/tree/mod.rs | Implements remove_range and point-read suppression by active/sealed/table range tombstones. |
| src/blob_tree/mod.rs | Plumbs remove_range through BlobTree and extends flush to write range tombstones. |
| src/memtable/mod.rs | Stores range tombstones in a mutex-protected interval tree; updates emptiness semantics and exposes query/collection helpers. |
| src/memtable/interval_tree.rs | New AVL interval tree supporting O(log n) suppression queries and sorted extraction for flush. |
| src/range_tombstone.rs | Defines RangeTombstone, MVCC visibility/suppression rules, intersection helpers, and upper_bound_exclusive. |
| src/active_tombstone_set.rs | Sweep-line active-set structures used by iterator filtering (forward + reverse). |
| src/range_tombstone_filter.rs | Bidirectional iterator wrapper that suppresses KV entries covered by visible range tombstones. |
| src/range.rs | Collects tombstones across layers, adds table-skip optimization, and wraps iterators with RangeTombstoneFilter when needed. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4. |
| src/table/regions.rs | Parses "range_tombstones" TOC handle into table regions. |
| src/table/inner.rs | Stores loaded range tombstones in the table inner struct. |
| src/table/mod.rs | Loads/validates/decodes range tombstone blocks and exposes pub(crate) range_tombstones(). |
| src/table/writer/mod.rs | Writes range tombstone blocks and supports RT-only tables via a synthetic weak-tombstone sentinel. |
| src/table/multi_writer.rs | Distributes RTs across rotated output tables; supports optional clipping for compaction outputs. |
| src/table/util.rs | Extends metrics accounting to include RangeTombstone blocks; refines cfg-gated clippy expect. |
| src/table/iter.rs | Refines cfg-gated clippy expect for constructor arg count under metrics. |
| src/compaction/worker.rs | Collects/dedups RTs from inputs, propagates them to outputs, and evicts below GC watermark at last level. |
| src/compaction/flavour.rs | Enables RT clipping for compaction writers and adds a write_range_tombstones hook to flavours. |
| src/lib.rs | Registers new internal modules for range tombstone functionality. |
| .github/instructions/rust.instructions.md | Adds review guidance around caller-handled edge cases / type-system guarantees / documented design decisions. |
| .github/copilot-instructions.md | Adds “Design Decision Analysis” guidance and documents new module locations. |
You can also share your feedback on Copilot code review. Take the survey.
…mpaction fallback key_range - Restore meta.item_count/key_count after sentinel write so RT-only tables report 0 user-visible items (sentinel is implementation detail) - Widen key_range in compaction clip fallback (upper_bound_exclusive=None) to include RT coverage, preventing RT loss during future compactions
There was a problem hiding this comment.
Pull request overview
Adds first-class range tombstones to the LSM tree, enabling efficient delete_range / delete_prefix semantics with MVCC-aware suppression across memtables and SSTs, plus persistence through flush/compaction/recovery.
Changes:
- Introduces range tombstone core types and iterator-time suppression (
RangeTombstoneFilter, active sets, memtable interval tree). - Persists range tombstones in SSTs via a new
BlockType::RangeTombstoneand TOC section, and propagates them through flush/compaction. - Extends
AbstractTreewithremove_rangeandremove_prefix, and adds broad integration test coverage.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | New integration test suite covering RT MVCC behavior, persistence, compaction, and iteration. |
| src/abstract_tree.rs | Adds remove_range / remove_prefix APIs and extends flush path to pass RTs into table writers. |
| src/tree/mod.rs | Implements remove_range, adds RT suppression checks for point reads, and updates flush implementation hook. |
| src/blob_tree/mod.rs | Wires remove_range through BlobTree and updates flush hook to write RTs. |
| src/memtable/mod.rs | Adds RT storage/querying to memtables via a mutex-protected interval tree; updates emptiness semantics. |
| src/memtable/interval_tree.rs | New AVL interval tree implementation for efficient RT suppression queries. |
| src/range.rs | Extends range iteration to collect/apply RT filtering and adds table-skip optimization for fully covered tables. |
| src/range_tombstone.rs | Defines RangeTombstone, intersection/coverage helpers, and upper_bound_exclusive. |
| src/range_tombstone_filter.rs | New bidirectional filter that suppresses KV entries covered by visible RTs during iteration. |
| src/active_tombstone_set.rs | New active RT tracking structures for forward/reverse sweepline iteration. |
| src/lib.rs | Registers new internal modules for RT support. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone with tag 4. |
| src/table/regions.rs | Parses the new range_tombstones TOC section handle. |
| src/table/inner.rs | Stores loaded per-table RTs in table inner state. |
| src/table/mod.rs | Loads and decodes the RT block during table open; exposes range_tombstones() internally. |
| src/table/writer/mod.rs | Adds RT writing, RT-only table handling, and metadata for RT count. |
| src/table/multi_writer.rs | Distributes RTs across output tables; adds clip mode for compaction vs flush semantics. |
| src/table/util.rs | Updates metrics accounting to classify RangeTombstone blocks with data/meta loads. |
| src/table/iter.rs | Adjusts lint gating for too-many-args under metrics feature. |
| src/compaction/worker.rs | Collects/dedups input RTs and propagates them into compaction outputs with GC eviction at last level. |
| src/compaction/flavour.rs | Adds a flavour hook to write RTs into compaction outputs; enables RT clipping for compaction writers. |
| .github/instructions/rust.instructions.md | Adds review guidance about caller-handled edge cases and type-level guarantees. |
| .github/copilot-instructions.md | Adds “Design Decision Analysis” guidance and RT module references. |
You can also share your feedback on Copilot code review. Take the survey.
…able-skip - Use SeqNo::MAX (u64::MAX) instead of MAX_SEQNO for sentinel seqno: MVCC filter (item_seqno < read_seqno) excludes it at all snapshots including SeqNo::MAX unbounded reads - Check table key_range overlap before RT scan in range iteration, avoiding O(rt_count) work for non-overlapping tables
There was a problem hiding this comment.
Pull request overview
Adds end-to-end “range tombstone” support (delete-range / delete-prefix) to the LSM tree, including memtable indexing, iterator-time suppression, SST persistence, compaction propagation, and recovery, with an expanded integration test suite.
Changes:
- Introduces core range-tombstone types and iteration-time suppression (
RangeTombstone, active sets, interval tree,RangeTombstoneFilter). - Persists range tombstones in SSTs via a dedicated TOC section/block type and propagates/clips them through compaction.
- Adds
remove_range/remove_prefixAPIs onAbstractTreeand extensive integration tests covering MVCC visibility and persistence.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | New integration tests for range tombstones across memtable/SST/compaction/recovery and APIs. |
| src/abstract_tree.rs | Adds remove_range / remove_prefix and RT-aware flush plumbing. |
| src/tree/mod.rs | Implements remove_range, RT-aware flush, and point-read RT suppression. |
| src/blob_tree/mod.rs | Wires remove_range and RT-aware flush for BlobTree. |
| src/memtable/mod.rs | Stores/query range tombstones in a mutex-protected interval tree; updates emptiness semantics. |
| src/memtable/interval_tree.rs | New AVL interval tree implementation for efficient memtable RT suppression queries. |
| src/range.rs | Collects RTs across layers, adds table-skip optimization, and wraps iterators with RangeTombstoneFilter. |
| src/range_tombstone.rs | Defines RangeTombstone, ordering, intersection helpers, and upper-bound helper for clipping. |
| src/range_tombstone_filter.rs | New bidirectional iterator filter that suppresses entries covered by RTs. |
| src/active_tombstone_set.rs | New active tombstone tracking sets used by the iterator filter. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone (tag 4). |
| src/table/regions.rs | Parses "range_tombstones" TOC section handle. |
| src/table/inner.rs | Stores loaded per-table RTs in Inner. |
| src/table/mod.rs | Loads/decodes RT block on open and exposes range_tombstones() internally. |
| src/table/writer/mod.rs | Adds RT writing support (RT block + meta count) and RT-only table sentinel behavior. |
| src/table/multi_writer.rs | Distributes RTs across output tables; supports optional clipping for compaction. |
| src/table/util.rs | Extends metrics accounting to include RangeTombstone blocks. |
| src/table/iter.rs | Adjusts clippy gating for metrics-only parameter count changes. |
| src/compaction/worker.rs | Collects/dedups RTs from inputs, GC-evicts at last level, and propagates to outputs. |
| src/compaction/flavour.rs | Adds RT propagation hook and enables RT clipping for compaction writer. |
| src/lib.rs | Adds internal modules for RT implementation pieces. |
| .github/instructions/rust.instructions.md | Updates review guidance to require checking call chains/type guarantees/design comments. |
| .github/copilot-instructions.md | Adds “Design Decision Analysis” guidance and RT architecture notes. |
You can also share your feedback on Copilot code review. Take the survey.
… visibility - Revert sentinel seqno from SeqNo::MAX to MAX_SEQNO: respects MSB reservation for transaction layer, visible at unbounded reads but harmless (WeakTombstone filtered by is_tombstone/CompactionStream) - Make write_range_tombstone pub(crate) — internal API only
There was a problem hiding this comment.
Pull request overview
Adds first-class range tombstone support (delete range / delete prefix) across memtables, SST persistence, compaction, and iteration, enabling efficient bulk deletes while preserving MVCC semantics.
Changes:
- Introduces
RangeTombstone+ in-mem interval tree and sweep-line iterator filtering (RangeTombstoneFilter, active tombstone sets). - Persists range tombstones to SSTs via a dedicated TOC section/block and wires them through flush + compaction (including clipping behavior).
- Updates read paths for point reads and range/prefix iteration, including a table-skip optimization for range scans.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration coverage for point reads, iteration, flush/compaction/recovery, prefix deletes, and RT-only flush scenarios. |
| src/abstract_tree.rs | Adds remove_range/remove_prefix API and extends flush pipeline to pass range tombstones into table writers. |
| src/tree/mod.rs | Implements remove_range and adds point-read suppression using memtable + SST range tombstones. |
| src/blob_tree/mod.rs | Plumbs range tombstones through BlobTree flush and delegates remove_range to the index tree. |
| src/memtable/mod.rs | Stores range tombstones in a mutex-protected interval tree and integrates them into memtable emptiness/size/seqno tracking. |
| src/memtable/interval_tree.rs | New AVL interval tree supporting suppression queries and sorted RT extraction for flush. |
| src/range_tombstone.rs | Core RT model + MVCC visibility/suppression rules + intersection/cover logic + exclusive-upper-bound helper. |
| src/active_tombstone_set.rs | Sweep-line “active set” implementations for forward and reverse iteration suppression. |
| src/range_tombstone_filter.rs | Bidirectional iterator wrapper that suppresses KV entries covered by visible RTs. |
| src/range.rs | Collects RTs from all layers, applies filtering in iterators, and adds a table-skip fast-path. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4. |
| src/table/regions.rs | Adds "range_tombstones" TOC section parsing. |
| src/table/writer/mod.rs | Writes RT blocks into SSTs and handles RT-only tables via a synthetic sentinel key. |
| src/table/multi_writer.rs | Distributes RTs across rotated output tables; supports compaction clipping vs flush passthrough. |
| src/table/mod.rs | Loads and decodes RT blocks on table open; exposes range_tombstones() for read/compaction paths. |
| src/table/inner.rs | Stores loaded table RTs in Inner. |
| src/table/util.rs | Extends metrics categorization to include the RT block type. |
| src/table/iter.rs | Adjusts clippy too_many_arguments expectation under metrics feature. |
| src/compaction/worker.rs | Propagates/dedups RTs from input to output and applies GC eviction at bottom level. |
| src/compaction/flavour.rs | Adds write_range_tombstones hook and enables RT clipping for compaction output. |
| src/lib.rs | Registers new internal modules for RT machinery. |
| .github/instructions/rust.instructions.md | Adds review guidance on caller-handled edge cases, type-level guarantees, and documented design decisions. |
| .github/copilot-instructions.md | Adds “Design Decision Analysis” guidance and documents new RT module locations. |
You can also share your feedback on Copilot code review. Take the survey.
… KV bounds Remove flush and compaction-fallback key_range widening that mixed inclusive KV bounds with exclusive RT ends. Flush writes ALL RTs to ALL output tables so widening is unnecessary. Document table-skip seqno limitation (highest_seqno includes RT seqnos).
There was a problem hiding this comment.
Pull request overview
Adds native range tombstones (delete_range / delete_prefix) to the LSM implementation, including in-memory tracking, iterator-time suppression, SST persistence, and compaction propagation.
Changes:
- Introduces
RangeTombstone+ in-memIntervalTreestorage and range-iteration suppression viaRangeTombstoneFilter. - Persists range tombstones in SSTs via a new
BlockType::RangeTombstoneTOC section and loads/decodes them on table recovery. - Propagates/deduplicates range tombstones during compaction and wires new
remove_range/remove_prefixAPIs throughAbstractTree(Tree + BlobTree).
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration coverage for range tombstones across memtable/SST/compaction/recovery. |
| src/abstract_tree.rs | Adds remove_range/remove_prefix and threads RTs through flush via flush_to_tables_with_rt. |
| src/tree/mod.rs | Implements remove_range, flush RT plumbing, and point-read suppression against RTs. |
| src/blob_tree/mod.rs | Threads RT-aware flush + exposes remove_range by delegating to index tree. |
| src/memtable/mod.rs | Stores RTs in a mutex-protected interval tree and exposes suppression + export for flush. |
| src/memtable/interval_tree.rs | New AVL interval tree implementation for memtable RT queries. |
| src/range.rs | Collects RTs across layers, adds table-skip optimization, and wraps iterators with RT filter. |
| src/range_tombstone.rs | New RT model, visibility/suppression logic, intersection helpers, and bounds helper. |
| src/range_tombstone_filter.rs | New bidirectional iterator filter using sweep-line activation/expiry. |
| src/active_tombstone_set.rs | Active-set data structures for forward/reverse RT activation during scans. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4. |
| src/table/writer/mod.rs | Writes RT blocks to SSTs and supports RT-only tables via a sentinel KV entry. |
| src/table/multi_writer.rs | Distributes RTs across rotated output tables; adds optional clipping for compaction. |
| src/table/regions.rs | Parses optional "range_tombstones" TOC section. |
| src/table/mod.rs | Loads/decodes RT blocks on recover; exposes range_tombstones() to callers. |
| src/table/inner.rs | Stores loaded RTs in table inner state. |
| src/table/util.rs | Treats RT blocks in metrics accounting (cache/io counters). |
| src/table/iter.rs | Adjusts clippy suppression gating for metrics-only arg count changes. |
| src/compaction/worker.rs | Collects/dedups RTs from inputs and propagates them to outputs (with bottom-level GC). |
| src/compaction/flavour.rs | Extends compaction flavour to accept RT propagation; enables RT clipping on output. |
| src/lib.rs | Registers new internal modules for RT support. |
| .github/instructions/rust.instructions.md | Adds review guidance about caller-handled edge cases/type-level guarantees. |
| .github/copilot-instructions.md | Adds “Design Decision Analysis” guidance + RT module pointers. |
You can also share your feedback on Copilot code review. Take the survey.
…T discovery Without widening, disjoint RTs (delete_range on keys only in older SSTs) are invisible to leveled compaction overlap selection — the table's KV key_range doesn't cover the RT's target keys, so compaction never picks up the table, leaving the RT permanently stuck. Use rt.end (exclusive) as conservative inclusive upper bound: over- approximates actual KV max but ensures compaction discovers all RTs.
…action Verify that a range tombstone disjoint from the memtable's KV key range survives multiple compaction rounds. Without flush key_range widening, leveled compaction overlap selection would miss the table carrying the disjoint RT.
There was a problem hiding this comment.
Pull request overview
Adds native range tombstones (delete-range/delete-prefix) to the LSM tree, integrating them into the memtable, flush/compaction pipelines, on-disk SST format, and read/iteration paths to enable efficient bulk deletions with MVCC-correct visibility.
Changes:
- Introduce core range-tombstone types and iteration filtering (sweep-line activation/expiry, memtable interval tree queries).
- Persist range tombstones in SSTs via a new
range_tombstonesTOC section /BlockType::RangeTombstone, and propagate them through flush + compaction (with compaction-time clipping and dedup). - Extend
AbstractTreewithremove_range/remove_prefix, and add extensive integration tests covering correctness and persistence.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | New integration tests for point/range/prefix reads, MVCC visibility, persistence, compaction behavior, and RT-only flush scenarios. |
| src/abstract_tree.rs | Adds remove_range/remove_prefix API; plumbs range tombstones through flush via flush_to_tables_with_rt. |
| src/tree/mod.rs | Implements remove_range; applies range-tombstone suppression on point reads across memtables and SSTs; updates flush implementation signature. |
| src/blob_tree/mod.rs | Wires remove_range through to the underlying tree; updates flush path to accept RTs. |
| src/memtable/mod.rs | Stores range tombstones in a mutex-protected interval tree; updates is_empty semantics; adds query/collection helpers. |
| src/memtable/interval_tree.rs | New AVL interval tree for efficient range tombstone suppression queries and sorted iteration for flush. |
| src/range_tombstone.rs | New RangeTombstone model (half-open intervals), clipping helpers, coverage checks, and upper_bound_exclusive. |
| src/active_tombstone_set.rs | New active-set trackers for forward/reverse iteration (activation/expiry + max-seqno suppression checks). |
| src/range_tombstone_filter.rs | New bidirectional iterator wrapper that suppresses KV entries covered by visible range tombstones. |
| src/range.rs | Collects RTs from all layers for iteration; adds table-skip optimization for fully covered tables; wraps merged iterator with RT filter when needed. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4 for SST block tagging. |
| src/table/regions.rs | Adds parsing of the range_tombstones TOC section handle. |
| src/table/inner.rs | Stores loaded range tombstones in Table inner state. |
| src/table/mod.rs | Loads/decodes range tombstones during table recovery; exposes range_tombstones() accessor. |
| src/table/writer/mod.rs | Adds RT block writing and metadata; introduces RT-only table handling via a synthetic sentinel KV. |
| src/table/multi_writer.rs | Distributes RTs across output tables; supports compaction clipping vs flush “unclipped” behavior; writes RTs on rotate/finish. |
| src/table/util.rs | Updates metrics accounting to treat RT blocks similarly to data/meta blocks; adjusts too_many_arguments lint gating. |
| src/table/iter.rs | Adjusts too_many_arguments lint gating for metrics feature combinations. |
| src/compaction/worker.rs | Collects/dedups input RTs; propagates RTs to compaction outputs and evicts below GC watermark at last level. |
| src/compaction/flavour.rs | Enables RT clipping in compaction MultiWriter; adds a write_range_tombstones hook for compaction flavours. |
| src/lib.rs | Adds internal modules for RT components. |
| .github/instructions/rust.instructions.md | Updates review guidance around caller-handled edge cases/type-level guarantees/documented decisions. |
| .github/copilot-instructions.md | Adds design-decision analysis checklist and documents RT-related module architecture notes. |
You can also share your feedback on Copilot code review. Take the survey.
…ance With MAX_SEQNO, sentinel sorted FIRST in InternalKey ordering (Reverse(seqno)) and could dominate real entries during compaction. Seqno 0 sorts LAST — real entries always appear before sentinel in merge, preventing any data loss. Sentinel gets GC'd at last level.
There was a problem hiding this comment.
Pull request overview
Adds native range tombstone support to the LSM-tree to enable efficient contiguous deletions via remove_range / remove_prefix, including MVCC-aware reads and full persistence through flush/compaction/recovery.
Changes:
- Introduces range tombstone core types (range model, active sets, memtable interval tree) and iterator filtering.
- Persists range tombstones in SSTs via a dedicated
"range_tombstones"block and wires them through flush/compaction. - Adds extensive integration tests covering MVCC visibility, iteration suppression, persistence, and compaction behaviors.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | Integration tests for point reads, iteration, persistence, compaction, and prefix/range deletes. |
| src/abstract_tree.rs | Adds remove_range API + remove_prefix helper; extends flush to pass range tombstones into table writers. |
| src/tree/mod.rs | Implements remove_range; integrates RT suppression into point reads; updates flush to accept RTs. |
| src/blob_tree/mod.rs | Propagates remove_range to underlying index; updates flush to accept RTs. |
| src/memtable/mod.rs | Adds mutex-protected interval tree storage for range tombstones and query APIs. |
| src/memtable/interval_tree.rs | New AVL interval tree for efficient RT suppression queries in memtables. |
| src/range_tombstone.rs | New RangeTombstone model, ordering, intersection helpers, and upper_bound_exclusive. |
| src/active_tombstone_set.rs | New forward/reverse active tombstone tracking for iterator-time suppression. |
| src/range_tombstone_filter.rs | New bidirectional iterator wrapper that suppresses keys covered by visible RTs. |
| src/range.rs | Collects RTs across layers for range iteration, adds table-skip optimization, wraps iterator with RT filter. |
| src/lib.rs | Wires new internal modules into the crate. |
| src/table/block/type.rs | Adds BlockType::RangeTombstone tag (4). |
| src/table/regions.rs | Adds TOC region parsing for "range_tombstones". |
| src/table/inner.rs | Stores loaded range tombstones in the table inner state. |
| src/table/mod.rs | Loads/validates/decodes the range tombstone block on table recovery; exposes range_tombstones(). |
| src/table/writer/mod.rs | Adds RT collection + writes RT block; adds RT-only-table handling via synthetic sentinel logic. |
| src/table/multi_writer.rs | Distributes RTs across rotated output tables; supports clipping in compaction vs unclipped flush. |
| src/table/util.rs | Extends metrics categorization to include RangeTombstone block type. |
| src/table/iter.rs | Adjusts clippy cfg_attr for too_many_arguments under metrics. |
| src/compaction/worker.rs | Collects/dedups RTs from inputs and propagates them into compaction outputs with GC eviction at last level. |
| src/compaction/flavour.rs | Plumbs RT propagation into compaction flavours; enables RT clipping for compaction writers. |
| .github/instructions/rust.instructions.md | Adds review guidance about caller-handled edge cases/type-level guarantees. |
| .github/copilot-instructions.md | Adds “Design Decision Analysis” guidance and notes new RT-related modules. |
You can also share your feedback on Copilot code review. Take the survey.
Verify that RT-only table sentinel at min(rt.start) does not mask real values at the same key in older SSTs when the RT is not yet visible at the read snapshot.
Sentinel seqno derived from saved_lo (lowest RT seqno in the table) instead of hardcoded 0. Ensures sentinel visibility aligns with RT visibility — sentinel only becomes visible when the RTs themselves are visible, preventing masking of older values at intermediate snapshots.
There was a problem hiding this comment.
Pull request overview
This PR introduces native range tombstones to the lsm-tree fork, enabling efficient deletion of contiguous key ranges and prefixes while preserving MVCC/snapshot semantics across memtables, flush, compaction, and recovery.
Changes:
- Adds
remove_range/remove_prefixAPIs onAbstractTree, plus internal plumbing to carry range tombstones through flush + compaction. - Persists range tombstones in SSTs via a new TOC section (
"range_tombstones") andBlockType::RangeTombstone. - Implements read-path suppression for point reads and iterators (including a table-skip optimization for fully covered tables), plus an interval tree for memtable RT queries and extensive tests.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/range_tombstone.rs | New integration tests covering MVCC visibility, flush/compaction/recovery persistence, prefix delete, and regressions. |
| src/abstract_tree.rs | Adds remove_range/remove_prefix APIs and extends flush to pass range tombstones to table writers. |
| src/tree/mod.rs | Implements remove_range, adds point-read suppression via range tombstones, and threads RTs into flush-to-tables. |
| src/blob_tree/mod.rs | Forwards remove_range to the index tree and threads RTs into blob-tree flush-to-tables. |
| src/memtable/mod.rs | Stores range tombstones in a mutex-protected interval tree and exposes query/iteration helpers. |
| src/memtable/interval_tree.rs | New AVL-based interval tree for fast memtable RT suppression and covering-RT queries. |
| src/range.rs | Wraps range iteration with a bidirectional RangeTombstoneFilter and adds table-skip optimization logic. |
| src/range_tombstone_filter.rs | New bidirectional iterator filter that suppresses KVs covered by visible RTs. |
| src/active_tombstone_set.rs | New sweep-line active-set trackers for RT activation/expiry during iteration. |
| src/range_tombstone.rs | New RT data model, ordering, intersection helpers, and upper-bound helper for clipping. |
| src/lib.rs | Wires new internal modules (range_tombstone*, active_tombstone_set). |
| src/table/block/type.rs | Adds BlockType::RangeTombstone = 4. |
| src/table/regions.rs | Adds "range_tombstones" TOC section handle parsing. |
| src/table/inner.rs | Stores loaded per-table range tombstones in Inner. |
| src/table/mod.rs | Loads/decodes the range tombstone block on table open and exposes range_tombstones(). |
| src/table/writer/mod.rs | Writes RT blocks, includes RT counts in meta, and synthesizes a sentinel entry for RT-only tables. |
| src/table/multi_writer.rs | Distributes RTs across output tables; supports compaction clipping vs flush no-clipping + key_range widening. |
| src/table/util.rs | Updates metrics accounting to include BlockType::RangeTombstone; adjusts cfg-based clippy expect. |
| src/table/iter.rs | Adjusts cfg-based clippy expect around too_many_arguments. |
| src/compaction/worker.rs | Collects/dedups RTs from inputs, applies GC eviction at last level, and propagates RTs to outputs. |
| src/compaction/flavour.rs | Adds write_range_tombstones hook and enables RT clipping for compaction output writers. |
| .github/instructions/rust.instructions.md | Adds guidance about tracing call chains/type guarantees before flagging issues. |
| .github/copilot-instructions.md | Adds “Design Decision Analysis” guidance and notes new RT-related modules in architecture. |
You can also share your feedback on Copilot code review. Take the survey.
| // sentinel seqno from the table's lowest RT seqno (saved_lo), | ||
| // ensuring it only becomes visible at or after the earliest | ||
| // real entry in this table and cannot cause snapshot/MVCC | ||
| // regressions by masking older values in lower tables. | ||
| // With Reverse(seqno) ordering, saved_lo sorts after any entry | ||
| // with a higher seqno, so it cannot dominate during merges. | ||
| let sentinel_seqno: crate::SeqNo = saved_lo; | ||
| self.write(InternalValue::new_weak_tombstone( | ||
| start.clone(), | ||
| sentinel_seqno, | ||
| ))?; |
Summary
remove_range(start, end, seqno)andremove_prefix(prefix, seqno)API onAbstractTreeTechnical Details
Core types (ported algorithms from upstream PR fjall-rs#242, own SST format):
RangeTombstone— half-open[start, end)interval with seqnoActiveTombstoneSet/ActiveTombstoneSetReverse— sweep-line trackers for iterationIntervalTree— AVL-balanced interval tree in memtable for O(log n) queriesRangeTombstoneFilter— bidirectional wrapper that suppresses covered entriesRead path:
RangeTombstoneFilterwith sweep-line activation/expiryWrite path:
key_rangewidened to include RT coverage so compaction overlap detection includes these tables.MultiWriterusesclip_range_tombstones=falsefor flushintersect_opt().MultiWriterusesclip_range_tombstones=true(via.use_clip_range_tombstones())sort + dedup) to prevent RT accumulation from MultiWriter rotationWeakTombstonesentinel atMAX_SEQNO(reserved, never assigned to real writes) for index creation;meta.first_key/meta.last_keyset tomin(rt.start)/max(rt.end)for compaction overlapVisibility:
is_key_suppressed_by_range_tombstone,range_tombstones_sorted,range_tombstones()) arepub(crate)— internal implementation detail, not part of public APIRangeTombstonetype ispub(crate)(module not publicly exported)SST format:
BlockType::RangeTombstone = 4[start_len:u16_le][start][end_len:u16_le][end][seqno:u64_le](repeated)"range_tombstones"section in table TOCstart < endinvariant enforced during decodingu16::try_from— oversized keys rejected withlog::warndiagnosticupper_bound_exclusivereturnsOption<UserKey>(Nonewhen key atu16::MAXlength)Known Limitations
Test Plan
remove_prefixAPICloses #16