Storage Foundations: How Databases Persist a Row

2025-09-24

This article is part of a series.

Series: How Databases Work: Building a PostgreSQL-Inspired DB in Rust

Part 1: Why I'm building a database from scratch (and why you should too)
Part 2: This Article

The simplest question with the deepest answer

When you run INSERT INTO users (name) VALUES ('Alice'), what actually happens? Where does 'Alice' go? How does it survive a power failure? And how does the database find it again later?

These questions seem simple, but their answers touch the deepest layer of every database: the storage engine. This is our starting point for InoxDB, and it’s Milestone 1 of the series.

By the end of this post, you’ll understand:

Why databases use fixed-size pages instead of just appending to a file
How a slotted page stores variable-length rows efficiently
How deletes work without actually deleting data (tombstones)
How compaction reclaims wasted space
How pages get persisted to disk with integrity checks

Let’s build it.

Why fixed-size pages?

Imagine the naive approach: every row gets appended to a file, one after another. Simple! But how do you delete a row from the middle? How do you update a row that grows larger? How do you cache only the “hot” parts of your data in memory?

Every serious database (PostgreSQL, MySQL, SQLite) solves this by dividing storage into fixed-size pages, typically 4KB, 8KB, or 16KB blocks. InoxDB uses 8 KiB pages, same as PostgreSQL’s default.

Pages give us three critical properties:

Addressability. Every page has a number. Page 0 starts at byte 0, page 1 at byte 8192, page 2 at byte 16384. Finding any page is just offset = page_id * 8192.
Cacheability. We can load hot pages into memory and evict cold ones. This is the foundation for the buffer pool we’ll build later.
Atomicity. Operating systems can write a single page atomically (mostly). This is the building block for crash recovery.

But a page is just 8192 bytes. How do we organize data inside it?

The slotted page: a page that manages itself

Our page needs to store multiple variable-length rows. A row might be 10 bytes or 4000 bytes; we don’t know in advance. The slotted page layout solves this elegantly by growing from both ends:

┌──────────────────────────────────────────────────────────────────┐
│ HEADER (10 bytes)                                                │
│  lower(u16) │ upper(u16) │ nslots(u16) │ checksum(u32)           │
├──────────────────────────────────────────────────────────────────┤
│ SLOT DIRECTORY (grows →)                                         │
│  [off₀, len₀] [off₁, len₁] [off₂, len₂] ...                      │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│                    FREE SPACE                                    │
│                                                                  │
├──────────────────────────────────────────────────────────────────┤
│ PAYLOAD (grows ←)                                                │
│  ... [row₂ data] [row₁ data] [row₀ data]                         │
└──────────────────────────────────────────────────────────────────┘
  byte 0                                               byte 8191

The slot directory grows forward from the header. Each slot entry is 4 bytes: a 2-byte offset and a 2-byte length pointing to where the row’s payload lives.

The payload area grows backward from the end of the page. Each new row is placed just below the previous one.

The free space is the gap between them. When they meet, the page is full.

Two pointers in the header track the boundaries: lower (end of the slot directory) and upper (start of the payload). Free space is simply upper - lower.

This layout lets us:

Address rows by slot number: slot 0 is always at the same position in the directory, even if the row’s payload moves during compaction
Store variable-length data: each slot knows exactly where its data starts and how long it is
Add rows efficiently: just append a slot entry forward and payload backward

Implementing the page in Rust

Here’s the core of our Page struct. It’s just a byte array with methods that interpret the layout:

1pub const PAGE_SIZE: usize = 8 * 1024;   // 8 KiB
2pub const HEADER_SIZE: usize = 10;        // lower + upper + nslots + checksum
3pub const SLOT_SIZE: usize = 4;           // off(u16) + len(u16)
4
5pub struct Page {
6    buf: [u8; PAGE_SIZE],
7}

No fancy structs or lifetimes. Just bytes. This is intentional: the page is the exact same bytes that will be written to disk. There’s no serialization step.

Creating an empty page initializes the boundaries:

 1impl Page {
 2    pub fn new() -> Self {
 3        let mut p = Self { buf: [0u8; PAGE_SIZE] };
 4        p.set_lower(HEADER_SIZE);  // slot directory starts right after header
 5        p.set_upper(PAGE_SIZE);    // payload starts at the end
 6        p.set_nslots(0);
 7        p
 8    }
 9
10    pub fn free_space(&self) -> usize {
11        self.upper() - self.lower()
12    }
13}

A fresh page has PAGE_SIZE - HEADER_SIZE = 8182 bytes of free space. That’s the entire gap between the slot directory (which is empty) and the payload area (which starts at the end).

Inserting a row

When we insert a row, we need space for two things: the payload (the row data) and a slot entry (4 bytes to track it). Here’s the key logic:

 1pub fn insert(&mut self, rec: &[u8]) -> Option<SlotId> {
 2    let need = rec.len() + SLOT_SIZE;
 3    if need > self.free_space() {
 4        return None;  // doesn't fit
 5    }
 6
 7    // Place payload at the end of the payload area
 8    let new_upper = self.upper() - rec.len();
 9    self.buf[new_upper..self.upper()].copy_from_slice(rec);
10    self.set_upper(new_upper);
11
12    // Add a slot entry pointing to the payload
13    let slot_id = self.nslots();
14    self.set_slot(slot_id, new_upper as u16, rec.len() as u16);
15    self.set_nslots(slot_id + 1);
16    self.set_lower(self.lower() + SLOT_SIZE);
17
18    Some(slot_id as SlotId)
19}

The returned SlotId is the row’s address within this page. Combined with a page ID, it forms a TupleId, the globally unique address of any row in the database:

1pub struct TupleId {
2    pub pid: PageId,   // which page
3    pub slot: SlotId,  // which slot within that page
4}

Reading a row back is just following the pointer in the slot directory:

1pub fn read_copy(&self, slot: SlotId, dst: &mut Vec<u8>) -> bool {
2    let meta = self.slot_meta(slot);  // read off and len
3    if meta.len == 0 { return false; }  // tombstoned
4
5    dst.clear();
6    dst.extend_from_slice(&self.buf[meta.off..meta.off + meta.len]);
7    true
8}

Deletes without deleting: tombstones

Here’s a fundamental insight: databases don’t erase data on delete. They mark it as dead. In our slotted page, a delete just sets the slot’s length to zero:

1pub fn delete(&mut self, slot: SlotId) -> bool {
2    let meta = self.slot_meta(slot);
3    if meta.len == 0 { return false; }  // already dead
4    self.set_slot(slot as usize, 0, 0);  // tombstone it
5    true
6}

Why? Two reasons:

Slot stability. Other rows reference slot numbers. If we physically removed a slot entry, every slot after it would shift, invalidating those references.
Deferred cleanup. The payload bytes are still sitting in the page. We don’t need to move anything right now; we’ll reclaim the space later through compaction.

This is the same idea PostgreSQL uses (they call dead rows “dead tuples”), and it applies at every level of the storage engine.

Compaction: defragmenting a page

After many inserts and deletes, a page can have wasted space scattered through the payload area. The free space counter says there’s room, but it’s in non-contiguous holes. Compaction fixes this by repacking all live payloads toward the end of the page:

Before compaction:              After compaction:
┌────────────────────┐          ┌────────────────────┐
│ Header + Slots     │          │ Header + Slots     │
├────────────────────┤          ├────────────────────┤
│                    │          │                    │
│   FREE SPACE       │          │                    │
│                    │          │   FREE SPACE       │
├────────────────────┤          │  (now contiguous!) │
│   [row C]          │          │                    │
│   [  hole  ]       │          ├────────────────────┤
│   [row A]          │          │ [row C] [row A]    │
└────────────────────┘          └────────────────────┘

The key property: slot IDs don’t change. Only the payload offsets are updated in the slot directory. Any external reference to “page 3, slot 5” still works after compaction.

Our insert method uses this transparently with insert_with_compact:

 1pub fn insert_with_compact(&mut self, rec: &[u8]) -> Option<SlotId> {
 2    if let Some(sid) = self.insert(rec) {
 3        return Some(sid);  // fits without compaction
 4    }
 5    if !self.can_fit_after_compaction(rec.len()) {
 6        return None;  // won't fit even after compaction
 7    }
 8    self.compact();
 9    self.insert(rec)  // now it fits
10}

There’s another optimization worth mentioning: dead slot reuse. When inserting a new row, we scan the slot directory for tombstoned entries (len == 0) and reuse them instead of appending a new slot entry. This saves 4 bytes per insert, which adds up when a page has been through many delete/insert cycles.

From pages to heaps

A single page can hold at most 8 KiB of data. A real table needs many pages. A heap is simply a collection of pages with no particular ordering:

 1pub struct MemHeap {
 2    pages: Vec<Page>,
 3}
 4
 5impl MemHeap {
 6    pub fn insert(&mut self, row: &[u8]) -> Option<TupleId> { ... }
 7    pub fn read_copy(&self, tid: TupleId, dst: &mut Vec<u8>) -> bool { ... }
 8    pub fn delete(&mut self, tid: TupleId) -> bool { ... }
 9    pub fn scan(&self) -> HeapScan<'_> { ... }
10}

Insert tries the last page first (with compaction). If that fails, it allocates a new page. This is the simplest possible allocation strategy, and it’s good enough for now.

The sequential scan walks every page, every slot, skipping tombstones:

 1pub fn next(&mut self, dst: &mut Vec<u8>) -> Option<TupleId> {
 2    while self.pid < self.heap.pages_len() {
 3        let page = &self.heap.pages[self.pid];
 4        while self.slot < page.nslots() {
 5            let sid = self.slot as SlotId;
 6            self.slot += 1;
 7            if page.read_copy(sid, dst) {
 8                return Some(TupleId { pid: self.pid, slot: sid });
 9            }
10        }
11        self.pid += 1;
12        self.slot = 0;
13    }
14    None
15}

No indexes, no sorting. Just a brute-force walk through every live row. It’s O(n), but it’s the foundation that everything else builds on. Every query plan ultimately reads rows through a scan like this.

Persisting to disk

An in-memory heap is useful for tests, but a database that forgets everything on restart isn’t much of a database. We need to write pages to a file.

Our DiskManager trait defines the contract:

1pub trait DiskManager {
2    fn read_page(&self, pid: PageId, out: &mut [u8; PAGE_SIZE]) -> io::Result<()>;
3    fn write_page(&mut self, pid: PageId, buf: &[u8; PAGE_SIZE]) -> io::Result<()>;
4    fn allocate_page(&mut self) -> io::Result<PageId>;
5    fn num_pages(&self) -> io::Result<PageId>;
6    fn sync(&mut self) -> io::Result<()>;
7}

The FileDisk implementation is straightforward: page N lives at byte offset N * PAGE_SIZE in a single file. No filesystem magic, no complicated layout. Just multiplication.

 1pub struct FileDisk {
 2    f: File,
 3}
 4
 5impl FileDisk {
 6    fn seek_page(&self, pid: PageId) -> io::Result<()> {
 7        let off = (pid as u64) * (PAGE_SIZE as u64);
 8        (&self.f).seek(SeekFrom::Start(off))?;
 9        Ok(())
10    }
11}

Notice (&self.f).seek(): we take a shared reference to File because Rust’s standard library implements Read and Seek on &File. This lets read_page take &self instead of &mut self, which is important. It means we can read pages without exclusive access to the disk manager. Multiple scans can read concurrently.

Integrity: checksums and file locking

Two things can corrupt your data: silent bit rot and concurrent access. We defend against both.

Page checksums

Every page carries a CRC32 checksum in its header. The checksum covers everything in the page except the checksum field itself:

1pub fn page_checksum(buf: &[u8; PAGE_SIZE]) -> u32 {
2    // CRC32 over buf[0..6] ++ buf[10..PAGE_SIZE]
3    // (skips the 4-byte checksum field at offset 6)
4}

On write, we compute the checksum and inject it before writing to disk:

1fn write_page(&mut self, pid: PageId, buf: &[u8; PAGE_SIZE]) -> io::Result<()> {
2    let mut page_buf = *buf;
3    let cksum = page_checksum(&page_buf);
4    page_buf[6..10].copy_from_slice(&cksum.to_le_bytes());
5    self.seek_page(pid)?;
6    self.f.write_all(&page_buf)?;
7    Ok(())
8}

On read, we verify it. If the checksum doesn’t match, the page is corrupt, maybe from a torn write, disk error, or someone editing the file by hand:

 1fn read_page(&self, pid: PageId, out: &mut [u8; PAGE_SIZE]) -> io::Result<()> {
 2    self.seek_page(pid)?;
 3    (&self.f).read_exact(out)?;
 4
 5    let stored = u32::from_le_bytes(out[6..10].try_into().unwrap());
 6    let computed = page_checksum(out);
 7    if stored != computed {
 8        return Err(io::Error::new(
 9            io::ErrorKind::InvalidData,
10            format!("page {pid} checksum mismatch"),
11        ));
12    }
13    Ok(())
14}

This is a detection mechanism, not a repair mechanism. When we add a Write-Ahead Log in Milestone 2, we’ll be able to recover from corrupted pages. For now, we at least know something went wrong.

File locking

When FileDisk opens a data file, it acquires an exclusive advisory lock:

 1pub fn open(path: impl AsRef<Path>) -> io::Result<Self> {
 2    let f = OpenOptions::new()
 3        .read(true).write(true).create(true).truncate(false)
 4        .open(path)?;
 5
 6    f.try_lock().map_err(|e| match e {
 7        TryLockError::WouldBlock => io::Error::new(
 8            io::ErrorKind::WouldBlock,
 9            "data file is locked by another process",
10        ),
11        TryLockError::Error(io_err) => io_err,
12    })?;
13
14    Ok(Self { f })
15}

If another process already has the file open, we get a clear error instead of two writers silently corrupting each other’s pages.

The file-backed heap

The FsHeap brings pages and disk together with write-through semantics: every insert and delete immediately reads the page, modifies it, and writes it back.

 1pub struct FsHeap<D: DiskManager> {
 2    disk: D,
 3    last_pid: PageId,
 4    last_page: Page,  // cached copy of last page
 5}
 6
 7impl<D: DiskManager> FsHeap<D> {
 8    pub fn insert(&mut self, row: &[u8]) -> io::Result<Option<TupleId>> { ... }
 9    pub fn read_copy(&self, tid: TupleId, dst: &mut Vec<u8>) -> io::Result<bool> { ... }
10    pub fn delete(&mut self, tid: TupleId) -> io::Result<bool> { ... }
11    pub fn scan(&self) -> io::Result<FsScan<'_, D>> { ... }
12    pub fn sync(&mut self) -> io::Result<()> { ... }
13}

One optimization: the last page is cached in memory. Since inserts almost always go to the last page, this avoids a disk read on every insert. The cache is write-through: after modification, the page is immediately written back to disk.

Data persists across restarts. We can close the process, reopen the file, and scan back every row we inserted:

 1// First session: insert data
 2let disk = FileDisk::open("heap.dat")?;
 3let mut h = FsHeap::open(disk)?;
 4h.insert(b"Alice")?;
 5h.insert(b"Bob")?;
 6h.sync()?;  // flush to stable storage
 7drop(h);
 8
 9// Second session: data is still there
10let disk = FileDisk::open("heap.dat")?;
11let h = FsHeap::open(disk)?;
12let mut scan = h.scan()?;
13let mut buf = Vec::new();
14while let Some(tid) = scan.next(&mut buf)? {
15    println!("{}: {}", tid.pid, String::from_utf8_lossy(&buf));
16}
17// Output: "Alice", "Bob"

What we haven’t built (yet)

Our storage engine has real persistence and integrity checks, but it’s missing features that a production database needs:

Buffer Pool: Every read of a non-cached page hits disk. A buffer pool would keep frequently accessed pages in memory and flush dirty pages lazily. That’s Milestone 2.
Write-Ahead Log (WAL): If we crash mid-write, a page can be torn (half old data, half new). The checksum will detect this, but we can’t recover. WAL lets us replay the intended change. That’s Milestone 3.
Indexes: Finding a row requires scanning every page. A B+ tree index would let us jump directly to the right page. That comes later.

What we learned

In this milestone we built InoxDB’s storage layer from the ground up:

Pages are the fundamental unit: 8 KiB byte arrays that organize rows using a slot directory and a growing-from-both-ends layout
Tombstones mark deleted rows without physically removing them, keeping slot IDs stable
Compaction defragments a page by repacking live payloads, reclaiming scattered free space
Heaps are unordered collections of pages, the simplest way to store a table
DiskManager writes pages to a file with CRC32 checksums and file locking for integrity
FsHeap ties it all together with write-through persistence

The code for this milestone is on GitHub. In the next post, we’ll tackle the buffer pool, adding an LRU page cache between the heap and disk that dramatically improves performance while introducing the challenge of managing dirty pages.

Building InoxDB alongside me? Have questions or insights to share? Find me on Twitter/X or jump into discussions on the GitHub repo. This is more fun when we learn together.

This article is part of a series.

Series: How Databases Work: Building a PostgreSQL-Inspired DB in Rust

Part 1: Why I'm building a database from scratch (and why you should too)
Part 2: This Article