Let's Build a Simple Database

Writing a sqlite clone from scratch in C

Overview

View on GitHub (pull requests welcome)

Part 18 - A Page Cache and Buffer Pool



“Cache rules everything around me.” – adapted from Wu-Tang Clan

Our pager has a dirty little secret: it writes every page to disk when the database closes, whether the page changed or not. And it happily loads as many pages into memory as there are pages in the file. For a small database that’s fine, but a real database could have millions of pages. We can’t fit them all in memory.

In this part we’re going to turn our naive pager into a proper buffer pool. That means three things:

  1. Dirty page tracking – only write back pages that actually changed
  2. LRU eviction – when the buffer is full, evict the least recently used page
  3. Bounded memory – limit the number of pages we hold in memory at once

Expanding the Pager

First, let’s add the new fields. dirty tracks which pages have been modified. access_time records when each page was last accessed. clock is a monotonically increasing counter:

+#define BUFFER_POOL_SIZE 100
+
 typedef struct {
   int file_descriptor;
   uint32_t file_length;
   uint32_t num_pages;
   void* pages[TABLE_MAX_PAGES];
+  bool dirty[TABLE_MAX_PAGES];
+  uint32_t access_time[TABLE_MAX_PAGES];
+  uint32_t clock;
 } Pager;
+
+void pager_flush(Pager* pager, uint32_t page_num);
+void pager_mark_dirty(Pager* pager, uint32_t page_num);
+uint32_t pager_pages_in_memory(Pager* pager);
+void pager_evict_lru(Pager* pager);

BUFFER_POOL_SIZE limits us to 100 pages in memory. With 4 KB pages, that’s about 400 KB of memory. Real databases like SQLite default to around 2000 pages (8 MB).

Initialize the new fields in pager_open():

+  pager->clock = 0;
   for (uint32_t i = 0; i < TABLE_MAX_PAGES; i++) {
     pager->pages[i] = NULL;
+    pager->dirty[i] = false;
+    pager->access_time[i] = 0;
   }

Marking Pages Dirty

Whenever we modify a page, we need to mark it dirty so we know to write it back:

+void pager_mark_dirty(Pager* pager, uint32_t page_num) {
+  pager->dirty[page_num] = true;
+}

We add calls to pager_mark_dirty() in every function that modifies page data: leaf_node_insert, leaf_node_delete, leaf_node_split_and_insert, create_new_root, and so on. Anywhere a page’s bytes change, we mark it dirty.

LRU Eviction

When we need to load a new page but the buffer pool is full, we evict the least recently used page. If it’s dirty, we flush it to disk first:

+uint32_t pager_pages_in_memory(Pager* pager) {
+  uint32_t count = 0;
+  for (uint32_t i = 0; i < TABLE_MAX_PAGES; i++) {
+    if (pager->pages[i] != NULL) count++;
+  }
+  return count;
+}
+
+void pager_evict_lru(Pager* pager) {
+  uint32_t lru_page = INVALID_PAGE_NUM;
+  uint32_t min_time = UINT32_MAX;
+
+  for (uint32_t i = 0; i < TABLE_MAX_PAGES; i++) {
+    if (pager->pages[i] != NULL && pager->access_time[i] < min_time) {
+      min_time = pager->access_time[i];
+      lru_page = i;
+    }
+  }
+
+  if (lru_page == INVALID_PAGE_NUM) return;
+
+  if (pager->dirty[lru_page]) {
+    pager_flush(pager, lru_page);
+    pager->dirty[lru_page] = false;
+  }
+
+  free(pager->pages[lru_page]);
+  pager->pages[lru_page] = NULL;
+}

This is the simplest possible LRU implementation: scan the array and find the page with the smallest access time. A real database would use a doubly-linked list to make eviction O(1), but for our purposes the linear scan is fine.

Updating get_page()

Now we integrate eviction into get_page(). On a cache miss, we check if the buffer is full and evict if necessary. On every access, we update the access time:

   if (pager->pages[page_num] == NULL) {
+    // Cache miss. Evict if the buffer pool is full.
+    if (pager_pages_in_memory(pager) >= BUFFER_POOL_SIZE) {
+      pager_evict_lru(pager);
+    }
+
     // Allocate memory and load from file.
     void* page = malloc(PAGE_SIZE);
     ...
     pager->pages[page_num] = page;
+    pager->dirty[page_num] = false;
   }

+  pager->access_time[page_num] = pager->clock++;
   return pager->pages[page_num];

Every time a page is accessed – whether it was already in memory or just loaded – the access time updates. Pages that haven’t been touched recently will have the lowest access times and get evicted first.

Only Flushing Dirty Pages

Finally, update db_close() to only write back pages that were modified:

   for (uint32_t i = 0; i < pager->num_pages; i++) {
     if (pager->pages[i] == NULL) {
       continue;
     }
-    pager_flush(pager, i);
+    if (pager->dirty[i]) {
+      pager_flush(pager, i);
+      pager->dirty[i] = false;
+    }
     free(pager->pages[i]);
     pager->pages[i] = NULL;
   }

This is a significant optimization. If you insert one row and then exit, we used to write back every page we’d ever read. Now we only write back the pages that actually changed.

A Note on Pinning

There’s a subtlety we’re not handling: page pinning. When a B-tree split is in progress, we might have several pages in flight that absolutely must not be evicted. A real buffer pool uses a pin count – get_page() increments it, and the caller decrements it when done. A pinned page is never evicted. Our BUFFER_POOL_SIZE of 100 is generous enough that we’ll never evict a page that’s in active use, but a production system would need proper pin management.

Testing

The existing tests continue to pass – dirty page tracking is invisible to the user. Let’s add one test to verify persistence still works with the new buffer pool:

+  it 'persists data correctly with dirty page tracking' do
+    script = (1..20).map do |i|
+      "insert #{i} user#{i} person#{i}@example.com"
+    end
+    script << ".exit"
+    run_script(script)
+
+    result = run_script([
+      "select where id = 10",
+      "select where id = 20",
+      ".exit",
+    ])
+    expect(result).to include("(10, user10, person10@example.com)")
+    expect(result).to include("(20, user20, person20@example.com)")
+  end

Next time we’ll tackle variable-length records so our database can store strings of different lengths efficiently.