Building RocksDB-like KV: Range Tombstones and MVCC Point Lookups

Building RocksDB-like KV: Range Tombstones and MVCC Point Lookups

January 13, 2026
DBMS, LSM, RocksDB, cckv, C++
The implementation details for Range Tombstones and MVCC Point Lookups in cckv.

This is the first post in a series documenting the development of cckv, a RocksDB-like KV store written in C++20 for my experimental database sboxdb.

I am writing these posts as notes to record the journey of building cckv for my own future reference. I wish I had documented the development of sboxdb from the beginning, but better late than never.

An LSM-tree storage engine typically needs to support two key features for efficient lookups:

  1. Range Deletion: Efficiently deleting large ranges of keys (e.g., DELETE WHERE id > 1000).
  2. MVCC (Multi-Version Concurrency Control): Allowing readers to see a consistent snapshot of the database at a specific Sequence Number, ignoring newer writes.

For every lookup, we must answer: Is this key visible at the given version? Or, Is this key covered by any range deletion that happened at or before the given version?

This post explains how cckv solves this using Fragmented Range Tombstones, adapting the approach used by RocksDB.

The Challenge: Overlapping Intervals #

Range deletions often overlap. Consider this scenario from the unit tests:

// Visualization of Range Tombstones
//            |--------4---------|
//            |--------3---------|
//      |--1---------|     |----5-------|
// -----b-----d------f-----h----j-------l---

We have four range deletions (Tombstones) with different Sequence Numbers:

  1. [b, f) @ 1
  2. [d, j) @ 3
  3. [d, j) @ 4
  4. [h, l) @ 5

If we read key d at Snapshot 2, it should be deleted (covered by @1). If we read key d at Snapshot 0, it should be visible (tombstone @1 is too new).

Storing these as a simple list requires an O(N) scan for every lookup to find the “max covering sequence” that is visible to our snapshot.

The Solution: Fragmentation via Sweep-Line #

To optimize this, cckv “fragments” these overlapping intervals into disjoint (non-overlapping) pieces using a sweep-line algorithm, identical to the approach used in RocksDB.

The implementation can be found here.

How the Algorithm Works #

The algorithm is driven by the stream of incoming tombstones, which are sorted by their Start Key.

We iterate through these tombstones. Whenever we encounter a new Start Key, we must “flush” our active state up to this new point. This “flush” process handles any tombstones that might have ended in the gap.

  1. Iterate Start Keys: We loop through the input tombstones.
  2. Flush on Change: When the start key changes (next_start_key), we look at our set of Active End Keys:
    • Expired Tombstones: If an active tombstone ends before the next_start_key, we create a fragment ending at that point and remove it from the active set.
    • Current Interval: We create a fragment up to next_start_key for all remaining active tombstones.
  3. Add New: Finally, we add the current tombstone’s end key to the active set and continue.

This ensures that we carve out every necessary boundary defined by either a start key or an end key, but the loop itself is driven purely by the start keys.

For the example above, the timeline is sliced at b, d, f, h, j, l.

The Resulting Fragments:

Fragment Active Sequence Numbers (Sorted Descending)
[b, d) {1}
[d, f) {4, 3, 1}
[f, h) {4, 3}
[h, j) {5, 4, 3}
[j, l) {5}

Now the fragments are sorted by key and do not overlap.

MVCC Point Lookup #

With the fragments built, a point lookup becomes a two-step process: Binary Search + Version Check.

This is implemented in FragmentedRangeTombstones::MaxCoveringSeq.

The Algorithm #

auto MaxCoveringSeq(key, snapshot_seq) -> seq_num {
  // 1. Find the fragment covering the key (O(log N))
  fragment = binary_search(fragments, key);
  if (!fragment) return 0; // Not deleted
  
  // 2. Find the newest sequence <= snapshot_seq (O(log K))
  // The fragment stores a list of sequences sorted descending (e.g., {4, 3, 1})
  for (seq : fragment.seqs) {
    if (seq <= snapshot_seq) return seq;
  }
  return 0; 
}

Examples from the Test Case #

Let’s trace lookups for key d (which falls into the [d, f) fragment) at different snapshots. The [d, f) fragment contains sequences {4, 3, 1}.

  1. Lookup d @ Snapshot 4:

    • Fragment [d, f) found.
    • Is 4 <= 4? Yes.
    • Result: Deleted (covered by seq 4).
  2. Lookup d @ Snapshot 3:

    • Fragment [d, f) found.
    • Is 4 <= 3? No.
    • Is 3 <= 3? Yes.
    • Result: Deleted (covered by seq 3).
  3. Lookup d @ Snapshot 2:

    • Fragment [d, f) found.
    • Is 4 <= 2? No.
    • Is 3 <= 2? No.
    • Is 1 <= 2? Yes.
    • Result: Deleted (covered by seq 1).
  4. Lookup d @ Snapshot 0:

    • Fragment [d, f) found.
    • Is 4 <= 0? No.
    • Is 3 <= 0? No.
    • Is 1 <= 0? No.
    • Result: Not Deleted (0).

Why this matters #

This approach reduces the lookup complexity from O(N) to O(log N). This ensures that point lookups remain fast even when there are a large number of overlapping range tombstones.

What’s next? #

Implement a version-aware scan method using the multi-way merging iterator approach.