Let's Build a Simple Database

Part 14 - Splitting Internal Nodes

Tue, 23 May 2023 00:00:00 +0000

The next leg of our journey will be splitting internal nodes which are unable to accommodate new keys. Consider the example below:

Example of splitting an internal

In this example, we add the key “11” to the tree. This will cause our root to split. When splitting an internal node, we will have to do a few things in order to keep everything straight:

Create a sibling node to store (n-1)/2 of the original node’s keys
Move these keys from the original node to the sibling node
Update the original node’s key in the parent to reflect its new max key after splitting
Insert the sibling node into the parent (could result in the parent also being split)

We will begin by replacing our stub code with the call to internal_node_split_and_insert

+void internal_node_split_and_insert(Table* table, uint32_t parent_page_num,
+                          uint32_t child_page_num);
+
 void internal_node_insert(Table* table, uint32_t parent_page_num,
                           uint32_t child_page_num) {
   /*
@@ -685,25 +714,39 @@ void internal_node_insert(Table* table, uint32_t parent_page_num,
 
   void* parent = get_page(table->pager, parent_page_num);
   void* child = get_page(table->pager, child_page_num);
-  uint32_t child_max_key = get_node_max_key(child);
+  uint32_t child_max_key = get_node_max_key(table->pager, child);
   uint32_t index = internal_node_find_child(parent, child_max_key);
 
   uint32_t original_num_keys = *internal_node_num_keys(parent);
-  *internal_node_num_keys(parent) = original_num_keys + 1;
 
   if (original_num_keys >= INTERNAL_NODE_MAX_CELLS) {
-    printf("Need to implement splitting internal node\n");
-    exit(EXIT_FAILURE);
+    internal_node_split_and_insert(table, parent_page_num, child_page_num);
+    return;
   }
 
   uint32_t right_child_page_num = *internal_node_right_child(parent);
+  /*
+  An internal node with a right child of INVALID_PAGE_NUM is empty
+  */
+  if (right_child_page_num == INVALID_PAGE_NUM) {
+    *internal_node_right_child(parent) = child_page_num;
+    return;
+  }
+
   void* right_child = get_page(table->pager, right_child_page_num);
+  /*
+  If we are already at the max number of cells for a node, we cannot increment
+  before splitting. Incrementing without inserting a new key/child pair
+  and immediately calling internal_node_split_and_insert has the effect
+  of creating a new key at (max_cells + 1) with an uninitialized value
+  */
+  *internal_node_num_keys(parent) = original_num_keys + 1;
 
-  if (child_max_key > get_node_max_key(right_child)) {
+  if (child_max_key > get_node_max_key(table->pager, right_child)) {
     /* Replace right child */
     *internal_node_child(parent, original_num_keys) = right_child_page_num;
     *internal_node_key(parent, original_num_keys) =
-        get_node_max_key(right_child);
+        get_node_max_key(table->pager, right_child);
     *internal_node_right_child(parent) = child_page_num;

There are three important changes we are making here aside from replacing the stub:

First, internal_node_split_and_insert is forward-declared because we will be calling internal_node_insert in its definition to avoid code duplication.
In addition, we are moving the logic which increments the parent’s number of keys further down in the function definition to ensure that this does not happen before the split.
Finally, we are ensuring that a child node inserted into an empty internal node will become that internal node’s right child without any other operations being performed, since an empty internal node has no keys to manipulate.

The changes above require that we be able to identify an empty node - to this end, we will first define a constant which represents an invalid page number that is the child of every empty node.

+#define INVALID_PAGE_NUM UINT32_MAX

Now, when an internal node is initialized, we initialize its right child with this invalid page number.

@@ -330,6 +335,12 @@ void initialize_internal_node(void* node) {
   set_node_type(node, NODE_INTERNAL);
   set_node_root(node, false);
   *internal_node_num_keys(node) = 0;
+  /*
+  Necessary because the root page number is 0; by not initializing an internal 
+  node's right child to an invalid page number when initializing the node, we may
+  end up with 0 as the node's right child, which makes the node a parent of the root
+  */
+  *internal_node_right_child(node) = INVALID_PAGE_NUM;
 }

This step was made necessary by a problem that the comment above attempts to summarize - when initializing an internal node without explicitly initializing the right child field, the value of that field at runtime could be 0 depending on the compiler or the architecture of the machine on which the program is being executed. Since we are using 0 as our root page number, this means that a newly allocated internal node will be a parent of the root.

We have introduced some guards in our internal_node_child function to throw an error in the case of an attempt to access an invalid page.

@@ -186,9 +188,19 @@ uint32_t* internal_node_child(void* node, uint32_t child_num) {
     printf("Tried to access child_num %d > num_keys %d\n", child_num, num_keys);
     exit(EXIT_FAILURE);
   } else if (child_num == num_keys) {
-    return internal_node_right_child(node);
+    uint32_t* right_child = internal_node_right_child(node);
+    if (*right_child == INVALID_PAGE_NUM) {
+      printf("Tried to access right child of node, but was invalid page\n");
+      exit(EXIT_FAILURE);
+    }
+    return right_child;
   } else {
-    return internal_node_cell(node, child_num);
+    uint32_t* child = internal_node_cell(node, child_num);
+    if (*child == INVALID_PAGE_NUM) {
+      printf("Tried to access child %d of node, but was invalid page\n", child_num);
+      exit(EXIT_FAILURE);
+    }
+    return child;
   }
 }

One additional guard is needed in our print_tree function to ensure that we do not attempt to print an empty node, as that would involve trying to access an invalid page.

@@ -294,15 +305,17 @@ void print_tree(Pager* pager, uint32_t page_num, uint32_t indentation_level) {
       num_keys = *internal_node_num_keys(node);
       indent(indentation_level);
       printf("- internal (size %d)\n", num_keys);
-      for (uint32_t i = 0; i < num_keys; i++) {
-        child = *internal_node_child(node, i);
+      if (num_keys > 0) {
+        for (uint32_t i = 0; i < num_keys; i++) {
+          child = *internal_node_child(node, i);
+          print_tree(pager, child, indentation_level + 1);
+
+          indent(indentation_level + 1);
+          printf("- key %d\n", *internal_node_key(node, i));
+        }
+        child = *internal_node_right_child(node);
         print_tree(pager, child, indentation_level + 1);
-
-        indent(indentation_level + 1);
-        printf("- key %d\n", *internal_node_key(node, i));
       }
-      child = *internal_node_right_child(node);
-      print_tree(pager, child, indentation_level + 1);
       break;
   }
 }

Now for the headliner, internal_node_split_and_insert. We will first provide it in its entirety, and then break it down by steps.

+void internal_node_split_and_insert(Table* table, uint32_t parent_page_num,
+                          uint32_t child_page_num) {
+  uint32_t old_page_num = parent_page_num;
+  void* old_node = get_page(table->pager,parent_page_num);
+  uint32_t old_max = get_node_max_key(table->pager, old_node);
+
+  void* child = get_page(table->pager, child_page_num); 
+  uint32_t child_max = get_node_max_key(table->pager, child);
+
+  uint32_t new_page_num = get_unused_page_num(table->pager);
+
+  /*
+  Declaring a flag before updating pointers which
+  records whether this operation involves splitting the root -
+  if it does, we will insert our newly created node during
+  the step where the table's new root is created. If it does
+  not, we have to insert the newly created node into its parent
+  after the old node's keys have been transferred over. We are not
+  able to do this if the newly created node's parent is not a newly
+  initialized root node, because in that case its parent may have existing
+  keys aside from our old node which we are splitting. If that is true, we
+  need to find a place for our newly created node in its parent, and we
+  cannot insert it at the correct index if it does not yet have any keys
+  */
+  uint32_t splitting_root = is_node_root(old_node);
+
+  void* parent;
+  void* new_node;
+  if (splitting_root) {
+    create_new_root(table, new_page_num);
+    parent = get_page(table->pager,table->root_page_num);
+    /*
+    If we are splitting the root, we need to update old_node to point
+    to the new root's left child, new_page_num will already point to
+    the new root's right child
+    */
+    old_page_num = *internal_node_child(parent,0);
+    old_node = get_page(table->pager, old_page_num);
+  } else {
+    parent = get_page(table->pager,*node_parent(old_node));
+    new_node = get_page(table->pager, new_page_num);
+    initialize_internal_node(new_node);
+  }
+  
+  uint32_t* old_num_keys = internal_node_num_keys(old_node);
+
+  uint32_t cur_page_num = *internal_node_right_child(old_node);
+  void* cur = get_page(table->pager, cur_page_num);
+
+  /*
+  First put right child into new node and set right child of old node to invalid page number
+  */
+  internal_node_insert(table, new_page_num, cur_page_num);
+  *node_parent(cur) = new_page_num;
+  *internal_node_right_child(old_node) = INVALID_PAGE_NUM;
+  /*
+  For each key until you get to the middle key, move the key and the child to the new node
+  */
+  for (int i = INTERNAL_NODE_MAX_CELLS - 1; i > INTERNAL_NODE_MAX_CELLS / 2; i--) {
+    cur_page_num = *internal_node_child(old_node, i);
+    cur = get_page(table->pager, cur_page_num);
+
+    internal_node_insert(table, new_page_num, cur_page_num);
+    *node_parent(cur) = new_page_num;
+
+    (*old_num_keys)--;
+  }
+
+  /*
+  Set child before middle key, which is now the highest key, to be node's right child,
+  and decrement number of keys
+  */
+  *internal_node_right_child(old_node) = *internal_node_child(old_node,*old_num_keys - 1);
+  (*old_num_keys)--;
+
+  /*
+  Determine which of the two nodes after the split should contain the child to be inserted,
+  and insert the child
+  */
+  uint32_t max_after_split = get_node_max_key(table->pager, old_node);
+
+  uint32_t destination_page_num = child_max < max_after_split ? old_page_num : new_page_num;
+
+  internal_node_insert(table, destination_page_num, child_page_num);
+  *node_parent(child) = destination_page_num;
+
+  update_internal_node_key(parent, old_max, get_node_max_key(table->pager, old_node));
+
+  if (!splitting_root) {
+    internal_node_insert(table,*node_parent(old_node),new_page_num);
+    *node_parent(new_node) = *node_parent(old_node);
+  }
+}
+

The first thing we need to do is create a variable to store the page number of the node we are splitting (the old node from here out). This is necessary because the page number of the old node will change if it happens to be the table’s root node. We also need to remember what the node’s current max is, because that value represents its key in the parent, and that key will need to be updated with the old node’s new maximum after the split occurs.

+  uint32_t old_page_num = parent_page_num;
+  void* old_node = get_page(table->pager,parent_page_num);
+  uint32_t old_max = get_node_max_key(table->pager, old_node);

The next important step is the branching logic which depends on whether the old node is the table’s root node. We will need to keep track of this value for later use; as the comment attempts to convey, we run into a problem if we do not store this information at the beginning of our function definition - if we are not splitting the root, we cannot insert our newly created sibling node into the old node’s parent right away, because it does not yet contain any keys and therefore will not be placed at the right index among the other key/child pairs which may or may not already be present in the parent node.

+  uint32_t splitting_root = is_node_root(old_node);
+
+  void* parent;
+  void* new_node;
+  if (splitting_root) {
+    create_new_root(table, new_page_num);
+    parent = get_page(table->pager,table->root_page_num);
+    /*
+    If we are splitting the root, we need to update old_node to point
+    to the new root's left child, new_page_num will already point to
+    the new root's right child
+    */
+    old_page_num = *internal_node_child(parent,0);
+    old_node = get_page(table->pager, old_page_num);
+  } else {
+    parent = get_page(table->pager,*node_parent(old_node));
+    new_node = get_page(table->pager, new_page_num);
+    initialize_internal_node(new_node);
+  }

Once we have settled the question of splitting or not splitting the root, we begin moving keys from the old node to its sibling. We must first move the old node’s right child and set its right child field to an invalid page to indicate that it is empty. Now, we loop over the old node’s remaining keys, performing the following steps on each iteration:

Obtain a reference to the old node’s key and child at the current index
Insert the child into the sibling node
Update the child’s parent value to point to the sibling node
Decrement the old node’s number of keys

+  uint32_t* old_num_keys = internal_node_num_keys(old_node);
+
+  uint32_t cur_page_num = *internal_node_right_child(old_node);
+  void* cur = get_page(table->pager, cur_page_num);
+
+  /*
+  First put right child into new node and set right child of old node to invalid page number
+  */
+  internal_node_insert(table, new_page_num, cur_page_num);
+  *node_parent(cur) = new_page_num;
+  *internal_node_right_child(old_node) = INVALID_PAGE_NUM;
+  /*
+  For each key until you get to the middle key, move the key and the child to the new node
+  */
+  for (int i = INTERNAL_NODE_MAX_CELLS - 1; i > INTERNAL_NODE_MAX_CELLS / 2; i--) {
+    cur_page_num = *internal_node_child(old_node, i);
+    cur = get_page(table->pager, cur_page_num);
+
+    internal_node_insert(table, new_page_num, cur_page_num);
+    *node_parent(cur) = new_page_num;
+
+    (*old_num_keys)--;
+  }

Step 4 is important, because it serves the purpose of “erasing” the key/child pair from the old node. Although we are not actually freeing the memory at that byte offset in the old node’s page, by decrementing the old node’s number of keys we are making that memory location inaccessible, and the bytes will be overwritten the next time a child is inserted into the old node.

Also note the behavior of our loop invariant - if our maximum number of internal node keys changes in the future, our logic ensures that both our old node and our sibling node will end up with (n-1)/2 keys after the split, with the 1 remaining node going to the parent. If an even number is chosen as the maximum number of nodes, n/2 nodes will remain with the old node while (n-1)/2 will be moved to the sibling node. This logic would be straightforward to revise as needed.

Once the keys to be moved have been, we set the old node’s i’th child as its right child and decrement its number of keys.

+  /*
+  Set child before middle key, which is now the highest key, to be node's right child,
+  and decrement number of keys
+  */
+  *internal_node_right_child(old_node) = *internal_node_child(old_node,*old_num_keys - 1);
+  (*old_num_keys)--;

We then insert the child node into either the old node or the sibling node depending on the value of its max key.

+  uint32_t max_after_split = get_node_max_key(table->pager, old_node);
+
+  uint32_t destination_page_num = child_max < max_after_split ? old_page_num : new_page_num;
+
+  internal_node_insert(table, destination_page_num, child_page_num);
+  *node_parent(child) = destination_page_num;

Finally, we update the old node’s key in its parent, and insert the sibling node and update the sibling node’s parent pointer if necessary.

+  update_internal_node_key(parent, old_max, get_node_max_key(table->pager, old_node));
+
+  if (!splitting_root) {
+    internal_node_insert(table,*node_parent(old_node),new_page_num);
+    *node_parent(new_node) = *node_parent(old_node);
+  }

One important change required to support this new logic is in our create_new_root function. Before, we were only taking into account situations where the new root’s children would be leaf nodes. If the new root’s children are instead internal nodes, we need to do two things:

Correctly initialize the root’s new children to be internal nodes
In addition to the call to memcpy, we need to insert each of the root’s keys into its new left child and update the parent pointer of each of those children

@@ -661,22 +680,40 @@ void create_new_root(Table* table, uint32_t right_child_page_num) {
   uint32_t left_child_page_num = get_unused_page_num(table->pager);
   void* left_child = get_page(table->pager, left_child_page_num);
 
+  if (get_node_type(root) == NODE_INTERNAL) {
+    initialize_internal_node(right_child);
+    initialize_internal_node(left_child);
+  }
+
   /* Left child has data copied from old root */
   memcpy(left_child, root, PAGE_SIZE);
   set_node_root(left_child, false);
 
+  if (get_node_type(left_child) == NODE_INTERNAL) {
+    void* child;
+    for (int i = 0; i < *internal_node_num_keys(left_child); i++) {
+      child = get_page(table->pager, *internal_node_child(left_child,i));
+      *node_parent(child) = left_child_page_num;
+    }
+    child = get_page(table->pager, *internal_node_right_child(left_child));
+    *node_parent(child) = left_child_page_num;
+  }
+
   /* Root node is a new internal node with one key and two children */
   initialize_internal_node(root);
   set_node_root(root, true);
   *internal_node_num_keys(root) = 1;
   *internal_node_child(root, 0) = left_child_page_num;
-  uint32_t left_child_max_key = get_node_max_key(left_child);
+  uint32_t left_child_max_key = get_node_max_key(table->pager, left_child);
   *internal_node_key(root, 0) = left_child_max_key;
   *internal_node_right_child(root) = right_child_page_num;
   *node_parent(left_child) = table->root_page_num;
   *node_parent(right_child) = table->root_page_num;
 }

Another important change has been made to get_node_max_key, as mentioned at the beginning of this article. Since an internal node’s key represents the maximum of the tree pointed to by the child to its left, and that child can be a tree of arbitrary depth, we need to walk down the right children of that tree until we get to a leaf node, and then take the maximum key of that leaf node.

+uint32_t get_node_max_key(Pager* pager, void* node) {
+  if (get_node_type(node) == NODE_LEAF) {
+    return *leaf_node_key(node, *leaf_node_num_cells(node) - 1);
+  }
+  void* right_child = get_page(pager,*internal_node_right_child(node));
+  return get_node_max_key(pager, right_child);
+}

We have written a single test to demonstrate that our print_tree function still works after the introduction of internal node splitting.

+  it 'allows printing out the structure of a 7-leaf-node btree' do
+    script = [
+      "insert 58 user58 person58@example.com",
+      "insert 56 user56 person56@example.com",
+      "insert 8 user8 person8@example.com",
+      "insert 54 user54 person54@example.com",
+      "insert 77 user77 person77@example.com",
+      "insert 7 user7 person7@example.com",
+      "insert 25 user25 person25@example.com",
+      "insert 71 user71 person71@example.com",
+      "insert 13 user13 person13@example.com",
+      "insert 22 user22 person22@example.com",
+      "insert 53 user53 person53@example.com",
+      "insert 51 user51 person51@example.com",
+      "insert 59 user59 person59@example.com",
+      "insert 32 user32 person32@example.com",
+      "insert 36 user36 person36@example.com",
+      "insert 79 user79 person79@example.com",
+      "insert 10 user10 person10@example.com",
+      "insert 33 user33 person33@example.com",
+      "insert 20 user20 person20@example.com",
+      "insert 4 user4 person4@example.com",
+      "insert 35 user35 person35@example.com",
+      "insert 76 user76 person76@example.com",
+      "insert 49 user49 person49@example.com",
+      "insert 24 user24 person24@example.com",
+      "insert 70 user70 person70@example.com",
+      "insert 48 user48 person48@example.com",
+      "insert 39 user39 person39@example.com",
+      "insert 15 user15 person15@example.com",
+      "insert 47 user47 person47@example.com",
+      "insert 30 user30 person30@example.com",
+      "insert 86 user86 person86@example.com",
+      "insert 31 user31 person31@example.com",
+      "insert 68 user68 person68@example.com",
+      "insert 37 user37 person37@example.com",
+      "insert 66 user66 person66@example.com",
+      "insert 63 user63 person63@example.com",
+      "insert 40 user40 person40@example.com",
+      "insert 78 user78 person78@example.com",
+      "insert 19 user19 person19@example.com",
+      "insert 46 user46 person46@example.com",
+      "insert 14 user14 person14@example.com",
+      "insert 81 user81 person81@example.com",
+      "insert 72 user72 person72@example.com",
+      "insert 6 user6 person6@example.com",
+      "insert 50 user50 person50@example.com",
+      "insert 85 user85 person85@example.com",
+      "insert 67 user67 person67@example.com",
+      "insert 2 user2 person2@example.com",
+      "insert 55 user55 person55@example.com",
+      "insert 69 user69 person69@example.com",
+      "insert 5 user5 person5@example.com",
+      "insert 65 user65 person65@example.com",
+      "insert 52 user52 person52@example.com",
+      "insert 1 user1 person1@example.com",
+      "insert 29 user29 person29@example.com",
+      "insert 9 user9 person9@example.com",
+      "insert 43 user43 person43@example.com",
+      "insert 75 user75 person75@example.com",
+      "insert 21 user21 person21@example.com",
+      "insert 82 user82 person82@example.com",
+      "insert 12 user12 person12@example.com",
+      "insert 18 user18 person18@example.com",
+      "insert 60 user60 person60@example.com",
+      "insert 44 user44 person44@example.com",
+      ".btree",
+      ".exit",
+    ]
+    result = run_script(script)
+
+    expect(result[64...(result.length)]).to match_array([
+      "db > Tree:",
+      "- internal (size 1)",
+      "  - internal (size 2)",
+      "    - leaf (size 7)",
+      "      - 1",
+      "      - 2",
+      "      - 4",
+      "      - 5",
+      "      - 6",
+      "      - 7",
+      "      - 8",
+      "    - key 8",
+      "    - leaf (size 11)",
+      "      - 9",
+      "      - 10",
+      "      - 12",
+      "      - 13",
+      "      - 14",
+      "      - 15",
+      "      - 18",
+      "      - 19",
+      "      - 20",
+      "      - 21",
+      "      - 22",
+      "    - key 22",
+      "    - leaf (size 8)",
+      "      - 24",
+      "      - 25",
+      "      - 29",
+      "      - 30",
+      "      - 31",
+      "      - 32",
+      "      - 33",
+      "      - 35",
+      "  - key 35",
+      "  - internal (size 3)",
+      "    - leaf (size 12)",
+      "      - 36",
+      "      - 37",
+      "      - 39",
+      "      - 40",
+      "      - 43",
+      "      - 44",
+      "      - 46",
+      "      - 47",
+      "      - 48",
+      "      - 49",
+      "      - 50",
+      "      - 51",
+      "    - key 51",
+      "    - leaf (size 11)",
+      "      - 52",
+      "      - 53",
+      "      - 54",
+      "      - 55",
+      "      - 56",
+      "      - 58",
+      "      - 59",
+      "      - 60",
+      "      - 63",
+      "      - 65",
+      "      - 66",
+      "    - key 66",
+      "    - leaf (size 7)",
+      "      - 67",
+      "      - 68",
+      "      - 69",
+      "      - 70",
+      "      - 71",
+      "      - 72",
+      "      - 75",
+      "    - key 75",
+      "    - leaf (size 8)",
+      "      - 76",
+      "      - 77",
+      "      - 78",
+      "      - 79",
+      "      - 81",
+      "      - 82",
+      "      - 85",
+      "      - 86",
+      "db > ",
+    ])
+  end

Part 15 - Deleting Rows from a Leaf Node

Mon, 01 Apr 2024 00:00:00 +0000

We can insert rows and we can read them back out. But we can’t remove them. Every real database needs a delete command, and ours is no exception. In this article we’ll implement simple deletion from a leaf node.

I’m going to hold off on rebalancing the tree after deletion for now – we’ll tackle that in the next part. For now, deleting a row means finding it in the B-tree and removing it from its leaf node.

Parsing the Delete Statement

First, let’s add a new statement type:

-typedef enum { STATEMENT_INSERT, STATEMENT_SELECT } StatementType;
+typedef enum {
+  STATEMENT_INSERT,
+  STATEMENT_SELECT,
+  STATEMENT_DELETE
+} StatementType;

And a new execute result for when the key doesn’t exist:

 typedef enum {
   EXECUTE_SUCCESS,
   EXECUTE_DUPLICATE_KEY,
+  EXECUTE_KEY_NOT_FOUND,
 } ExecuteResult;

The syntax for delete will be delete <id>. Parsing is similar to insert, but we only need the id:

+PrepareResult prepare_delete(InputBuffer* input_buffer, Statement* statement) {
+  statement->type = STATEMENT_DELETE;
+
+  char* keyword = strtok(input_buffer->buffer, " ");
+  char* id_string = strtok(NULL, " ");
+
+  if (id_string == NULL) {
+    return PREPARE_SYNTAX_ERROR;
+  }
+
+  int id = atoi(id_string);
+  if (id < 0) {
+    return PREPARE_NEGATIVE_ID;
+  }
+
+  statement->row_to_insert.id = id;
+
+  return PREPARE_SUCCESS;
+}

We’re reusing row_to_insert.id to store the key we want to delete. It’s a bit of a hack, but it saves us from adding another field to Statement just to hold a single integer.

Now wire it into prepare_statement():

 PrepareResult prepare_statement(InputBuffer* input_buffer,
                                 Statement* statement) {
   if (strncmp(input_buffer->buffer, "insert", 6) == 0) {
     return prepare_insert(input_buffer, statement);
   }
   if (strcmp(input_buffer->buffer, "select") == 0) {
     statement->type = STATEMENT_SELECT;
     return PREPARE_SUCCESS;
   }
+  if (strncmp(input_buffer->buffer, "delete", 6) == 0) {
+    return prepare_delete(input_buffer, statement);
+  }

   return PREPARE_UNRECOGNIZED_STATEMENT;
 }

Removing a Cell from a Leaf Node

The actual removal is straightforward. We shift all cells after the deleted one to the left by one position, then decrement the cell count:

+void leaf_node_delete(Cursor* cursor) {
+  void* node = get_page(cursor->table->pager, cursor->page_num);
+  uint32_t num_cells = *leaf_node_num_cells(node);
+
+  // Shift cells to fill the gap
+  for (uint32_t i = cursor->cell_num; i < num_cells - 1; i++) {
+    memcpy(leaf_node_cell(node, i), leaf_node_cell(node, i + 1),
+           LEAF_NODE_CELL_SIZE);
+  }
+
+  *(leaf_node_num_cells(node)) -= 1;
+}

Think of it like removing an element from the middle of an array. Everything to the right slides over to fill the hole.

Executing the Delete

Now we need execute_delete(). It uses table_find() to locate the key in the B-tree, checks that the key actually exists, and then calls leaf_node_delete():

+ExecuteResult execute_delete(Statement* statement, Table* table) {
+  uint32_t key_to_delete = statement->row_to_insert.id;
+  Cursor* cursor = table_find(table, key_to_delete);
+
+  void* node = get_page(table->pager, cursor->page_num);
+  uint32_t num_cells = *leaf_node_num_cells(node);
+
+  if (cursor->cell_num >= num_cells) {
+    free(cursor);
+    return EXECUTE_KEY_NOT_FOUND;
+  }
+
+  uint32_t key_at_index = *leaf_node_key(node, cursor->cell_num);
+  if (key_at_index != key_to_delete) {
+    free(cursor);
+    return EXECUTE_KEY_NOT_FOUND;
+  }
+
+  leaf_node_delete(cursor);
+
+  free(cursor);
+
+  return EXECUTE_SUCCESS;
+}

table_find() returns a cursor pointing to the position where the key should be. But the key might not actually be there – maybe we’re looking for a key that was never inserted. So we check two things: is the cursor past the end of the leaf, and does the key at the cursor’s position actually match? If either check fails, the key doesn’t exist.

Wire it into execute_statement():

 ExecuteResult execute_statement(Statement* statement, Table* table) {
   switch (statement->type) {
     case (STATEMENT_INSERT):
       return execute_insert(statement, table);
     case (STATEMENT_SELECT):
       return execute_select(statement, table);
+    case (STATEMENT_DELETE):
+      return execute_delete(statement, table);
   }
 }

And handle the new result in main():

     switch (execute_statement(&statement, table)) {
       case (EXECUTE_SUCCESS):
         printf("Executed.\n");
         break;
       case (EXECUTE_DUPLICATE_KEY):
         printf("Error: Duplicate key.\n");
         break;
+      case (EXECUTE_KEY_NOT_FOUND):
+        printf("Error: Key not found.\n");
+        break;
     }

Testing

Let’s try it out:

db > insert 1 user1 person1@example.com
Executed.
db > insert 2 user2 person2@example.com
Executed.
db > insert 3 user3 person3@example.com
Executed.
db > delete 2
Executed.
db > select
(1, user1, person1@example.com)
(3, user3, person3@example.com)
Executed.
db >

Sweet, it works! The row with id 2 is gone.

What happens if we try to delete a key that doesn’t exist?

db > delete 5
Error: Key not found.

And our deletion persists across sessions too:

+  it 'deletes a row' do
+    script = [
+      "insert 1 user1 person1@example.com",
+      "insert 2 user2 person2@example.com",
+      "insert 3 user3 person3@example.com",
+      "delete 2",
+      "select",
+      ".exit",
+    ]
+    result = run_script(script)
+    expect(result).to match_array([
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > (1, user1, person1@example.com)",
+      "(3, user3, person3@example.com)",
+      "Executed.",
+      "db > ",
+    ])
+  end
+
+  it 'prints error message when deleting non-existent key' do
+    script = [
+      "insert 1 user1 person1@example.com",
+      "delete 5",
+      "select",
+      ".exit",
+    ]
+    result = run_script(script)
+    expect(result).to match_array([
+      "db > Executed.",
+      "db > Error: Key not found.",
+      "db > (1, user1, person1@example.com)",
+      "Executed.",
+      "db > ",
+    ])
+  end
+
+  it 'deletes rows and persists changes' do
+    result1 = run_script([
+      "insert 1 user1 person1@example.com",
+      "insert 2 user2 person2@example.com",
+      "insert 3 user3 person3@example.com",
+      "delete 2",
+      ".exit",
+    ])
+
+    result2 = run_script([
+      "select",
+      ".exit",
+    ])
+    expect(result2).to match_array([
+      "db > (1, user1, person1@example.com)",
+      "(3, user3, person3@example.com)",
+      "Executed.",
+      "db > ",
+    ])
+  end

A Looming Problem

This works great for small trees. But there’s a subtle issue we’re ignoring. In a B-tree, every non-root node must maintain a minimum number of keys. When we delete a cell from a leaf that’s already at its minimum occupancy, the node “underflows.” A well-behaved B-tree fixes this by borrowing from a sibling or merging two nodes together.

We’re not doing any of that yet. If you delete enough rows from a leaf, it could end up empty while its parent still points to it. That’s a problem.

Next time we’ll implement rebalancing: borrowing from siblings, merging underflowing nodes, and collapsing the tree when the root becomes unnecessary. It’s gonna be great.

Part 16 - Rebalancing the B-Tree After Deletion

Mon, 15 Apr 2024 00:00:00 +0000

Last time we added a simple delete command. It works, but it leaves the tree in a potentially invalid state. In a B+ tree, every non-root node must maintain a minimum number of keys. When deletion causes a node to drop below that minimum, the node “underflows” and the tree needs to be rebalanced.

This is the deletion counterpart to the splitting we implemented for insertion. Splitting handles overflow; rebalancing handles underflow.

Minimum Occupancy

First, let’s define how few cells a node is allowed to have. The standard rule is half the maximum:

+/*
+ * Minimum occupancy for non-root nodes
+ */
+const uint32_t LEAF_NODE_MIN_CELLS = LEAF_NODE_MAX_CELLS / 2;
+const uint32_t INTERNAL_NODE_MIN_KEYS = INTERNAL_NODE_MAX_KEYS / 2;

With LEAF_NODE_MAX_CELLS at 13, LEAF_NODE_MIN_CELLS is 6. With INTERNAL_NODE_MAX_KEYS at 3, INTERNAL_NODE_MIN_KEYS is 1. The root is exempt from this rule – it can have as few as zero cells.

The Strategy

When a leaf underflows, we have two strategies:

Borrow from a sibling that has more than the minimum. Shift one cell over.
Merge with a sibling if neither has cells to spare. Combine both nodes into one and remove the separator key from the parent.

If merging causes the parent to underflow, the same logic applies recursively up the tree. If the root ends up with zero keys, we promote its only child to be the new root, reducing the tree’s height.

Finding a Child’s Position in its Parent

To rebalance, we need to know which position a node occupies among its parent’s children. This helper scans the parent to find it:

+uint32_t find_child_index(void* parent, uint32_t child_page_num) {
+  uint32_t num_keys = *internal_node_num_keys(parent);
+  for (uint32_t i = 0; i <= num_keys; i++) {
+    if (*internal_node_child(parent, i) == child_page_num) {
+      return i;
+    }
+  }
+  printf("Could not find child in parent node.\n");
+  exit(EXIT_FAILURE);
+}

This iterates through all children (including the right child at index num_keys) until it finds a match.

Leaf Node Rebalancing

Here’s the main rebalancing function for leaf nodes. It checks for underflow, then tries borrowing before falling back to merging:

+void leaf_node_rebalance(Table* table, uint32_t page_num) {
+  void* node = get_page(table->pager, page_num);
+
+  if (is_node_root(node)) {
+    return;
+  }
+
+  uint32_t num_cells = *leaf_node_num_cells(node);
+  if (num_cells >= LEAF_NODE_MIN_CELLS) {
+    return;
+  }
+
+  uint32_t parent_page_num = *node_parent(node);
+  void* parent = get_page(table->pager, parent_page_num);
+  uint32_t child_index = find_child_index(parent, page_num);
+  uint32_t parent_num_keys = *internal_node_num_keys(parent);

The root can have any number of cells, and if the leaf has at least the minimum, there’s nothing to do.

Borrowing from the Right Sibling

If the right sibling has more than the minimum, we take its first cell:

+  if (child_index < parent_num_keys) {
+    uint32_t right_page = *internal_node_child(parent, child_index + 1);
+    void* right_sibling = get_page(table->pager, right_page);
+
+    if (*leaf_node_num_cells(right_sibling) > LEAF_NODE_MIN_CELLS) {
+      memcpy(leaf_node_cell(node, num_cells),
+             leaf_node_cell(right_sibling, 0), LEAF_NODE_CELL_SIZE);
+      *(leaf_node_num_cells(node)) += 1;
+
+      uint32_t right_cells = *leaf_node_num_cells(right_sibling);
+      for (uint32_t i = 0; i < right_cells - 1; i++) {
+        memcpy(leaf_node_cell(right_sibling, i),
+               leaf_node_cell(right_sibling, i + 1), LEAF_NODE_CELL_SIZE);
+      }
+      *(leaf_node_num_cells(right_sibling)) -= 1;
+
+      *internal_node_key(parent, child_index) =
+          get_node_max_key(table->pager, node);
+      return;
+    }
+  }

The borrowed cell goes at the end of the current node (it has a higher key). Then we shift the right sibling’s remaining cells left and update the parent’s key for this node.

Borrowing from the Left Sibling

If there’s no right sibling to borrow from, try the left:

+  if (child_index > 0) {
+    uint32_t left_page = *internal_node_child(parent, child_index - 1);
+    void* left_sibling = get_page(table->pager, left_page);
+
+    if (*leaf_node_num_cells(left_sibling) > LEAF_NODE_MIN_CELLS) {
+      for (uint32_t i = num_cells; i > 0; i--) {
+        memcpy(leaf_node_cell(node, i), leaf_node_cell(node, i - 1),
+               LEAF_NODE_CELL_SIZE);
+      }
+
+      uint32_t left_cells = *leaf_node_num_cells(left_sibling);
+      memcpy(leaf_node_cell(node, 0),
+             leaf_node_cell(left_sibling, left_cells - 1),
+             LEAF_NODE_CELL_SIZE);
+      *(leaf_node_num_cells(node)) += 1;
+      *(leaf_node_num_cells(left_sibling)) -= 1;
+
+      *internal_node_key(parent, child_index - 1) =
+          get_node_max_key(table->pager, left_sibling);
+      return;
+    }
+  }

This time the borrowed cell goes at the beginning of the current node (it has a lower key), so we have to shift our existing cells right first. Then update the parent’s key for the left sibling, since its max key has changed.

Merging

If neither sibling can lend a cell, we merge. If we can merge with the right sibling, we absorb its cells into the current node:

+  if (child_index < parent_num_keys) {
+    uint32_t right_page = *internal_node_child(parent, child_index + 1);
+    void* right_sibling = get_page(table->pager, right_page);
+    uint32_t right_cells = *leaf_node_num_cells(right_sibling);
+
+    for (uint32_t i = 0; i < right_cells; i++) {
+      memcpy(leaf_node_cell(node, num_cells + i),
+             leaf_node_cell(right_sibling, i), LEAF_NODE_CELL_SIZE);
+    }
+    *(leaf_node_num_cells(node)) = num_cells + right_cells;
+    *leaf_node_next_leaf(node) = *leaf_node_next_leaf(right_sibling);
+
+    *internal_node_key(parent, child_index) =
+        get_node_max_key(table->pager, node);
+    internal_node_remove_child(table, parent_page_num, child_index + 1);

We copy all cells from the right sibling, fix the next_leaf pointer chain, update the parent key, and remove the right sibling’s entry from the parent. If the current node is the rightmost child, we merge into the left sibling instead, using the same logic in reverse.

Removing a Child from an Internal Node

When two leaves merge, one of them disappears and we need to remove its entry from the parent. This function handles removal of a child at a given index:

+void internal_node_remove_child(Table* table, uint32_t page_num,
+                                uint32_t child_index) {
+  void* node = get_page(table->pager, page_num);
+  uint32_t num_keys = *internal_node_num_keys(node);
+
+  if (child_index == num_keys) {
+    if (num_keys > 0) {
+      *internal_node_right_child(node) =
+          *internal_node_child(node, num_keys - 1);
+      *(internal_node_num_keys(node)) = num_keys - 1;
+    } else {
+      *internal_node_right_child(node) = INVALID_PAGE_NUM;
+    }
+  } else {
+    for (uint32_t i = child_index; i < num_keys - 1; i++) {
+      memcpy(internal_node_cell(node, i), internal_node_cell(node, i + 1),
+             INTERNAL_NODE_CELL_SIZE);
+    }
+    *(internal_node_num_keys(node)) = num_keys - 1;
+  }

If we’re removing the rightmost child, the last regular cell’s child gets promoted to right child. Otherwise, we shift cells left to fill the gap.

After removing, this function also checks whether the internal node underflows, and if so, kicks off internal_node_rebalance().

Tree Height Reduction

The most satisfying part: when merges cascade up to the root and the root has zero keys left, its only child is promoted to be the new root. The tree gets shorter:

+  if (is_node_root(node) && *internal_node_num_keys(node) == 0) {
+    uint32_t child_page = *internal_node_right_child(node);
+    if (child_page == INVALID_PAGE_NUM) {
+      return;
+    }
+    void* child = get_page(table->pager, child_page);
+    memcpy(node, child, PAGE_SIZE);
+    set_node_root(node, true);
+
+    if (get_node_type(node) == NODE_INTERNAL) {
+      uint32_t promoted_keys = *internal_node_num_keys(node);
+      for (uint32_t i = 0; i < promoted_keys; i++) {
+        void* c = get_page(table->pager, *internal_node_child(node, i));
+        *node_parent(c) = table->root_page_num;
+      }
+      void* rc = get_page(table->pager, *internal_node_right_child(node));
+      *node_parent(rc) = table->root_page_num;
+    }
+  }

We copy the child’s contents into the root page (keeping page 0 as the root), mark it as the root, and update the parent pointers of the promoted node’s children. This is the mirror image of create_new_root() – one grows the tree, the other shrinks it.

Triggering Rebalancing

Finally, we update execute_delete() to update the parent key when the max key changes and to call leaf_node_rebalance() after every deletion:

+  uint32_t leaf_page_num = cursor->page_num;
+  uint32_t old_max = get_node_max_key(table->pager, node);
+
   leaf_node_delete(cursor);
-
-  free(cursor);
+  free(cursor);
+
+  node = get_page(table->pager, leaf_page_num);
+  if (!is_node_root(node) && *leaf_node_num_cells(node) > 0) {
+    uint32_t new_max = get_node_max_key(table->pager, node);
+    if (new_max != old_max) {
+      uint32_t parent_page = *node_parent(node);
+      void* parent = get_page(table->pager, parent_page);
+      update_internal_node_key(parent, old_max, new_max);
+    }
+  }
+
+  leaf_node_rebalance(table, leaf_page_num);

   return EXECUTE_SUCCESS;

Testing

Let’s make sure we can delete from multi-level trees:

+  it 'deletes rows from a multi-level tree' do
+    script = (1..15).map do |i|
+      "insert #{i} user#{i} person#{i}@example.com"
+    end
+    script << "delete 7"
+    script << ".btree"
+    script << "select"
+    script << ".exit"
+    result = run_script(script)
+
+    expect(result).to include("Executed.")
+    expect(result).not_to include("(7, user7, person7@example.com)")
+    expect(result).to include("(1, user1, person1@example.com)")
+    expect(result).to include("(15, user15, person15@example.com)")
+  end

And verify that we can delete every row without crashing:

+  it 'handles deleting all rows' do
+    script = [
+      "insert 1 user1 person1@example.com",
+      "insert 2 user2 person2@example.com",
+      "insert 3 user3 person3@example.com",
+      "delete 1",
+      "delete 2",
+      "delete 3",
+      "select",
+      ".exit",
+    ]
+    result = run_script(script)
+    expect(result).to match_array([
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > ",
+    ])
+  end

The tree grows when we insert enough rows, and now it can shrink back down when we delete them. That’s the full B-tree lifecycle.

Next time we’ll add the ability to search for specific rows with a WHERE clause, putting our B-tree index to real use.

Part 17 - The WHERE Clause

Wed, 01 May 2024 00:00:00 +0000

Up until now, select dumps every row in the table. That’s fine for debugging, but a real database lets you ask for specific rows. Time to add a WHERE clause.

We’ll support filtering on the primary key (id), with five operators: =, >, <, >=, and <=. This is where our B-tree starts to really shine – instead of scanning every row, we can jump directly to the one we want.

Adding WHERE to the Statement

First, we need a way to represent the filter condition. We’ll add a WhereOp enum and two new fields to Statement:

+typedef enum {
+  WHERE_NONE,
+  WHERE_EQ,
+  WHERE_GT,
+  WHERE_LT,
+  WHERE_GTE,
+  WHERE_LTE,
+} WhereOp;
+
 typedef struct {
   StatementType type;
   Row row_to_insert;  // only used by insert statement
+  WhereOp where_op;   // only used by select statement
+  uint32_t where_id;  // only used by select statement
 } Statement;

WHERE_NONE means “no filter” – a full table scan, just like before.

Parsing the WHERE Clause

The syntax is select where id <op> <value>. We parse it by tokenizing after the select keyword:

-  if (strcmp(input_buffer->buffer, "select") == 0) {
+  if (strncmp(input_buffer->buffer, "select", 6) == 0) {
     statement->type = STATEMENT_SELECT;
+    statement->where_op = WHERE_NONE;
+
+    if (strlen(input_buffer->buffer) > 6) {
+      char* token = strtok(input_buffer->buffer, " ");  // "select"
+      token = strtok(NULL, " ");                         // "where"
+      if (token == NULL || strcmp(token, "where") != 0) {
+        return PREPARE_SYNTAX_ERROR;
+      }
+      token = strtok(NULL, " ");  // "id"
+      if (token == NULL || strcmp(token, "id") != 0) {
+        return PREPARE_SYNTAX_ERROR;
+      }
+      token = strtok(NULL, " ");  // operator
+      if (token == NULL) {
+        return PREPARE_SYNTAX_ERROR;
+      }
+      if (strcmp(token, "=") == 0) {
+        statement->where_op = WHERE_EQ;
+      } else if (strcmp(token, ">") == 0) {
+        statement->where_op = WHERE_GT;
+      } else if (strcmp(token, "<") == 0) {
+        statement->where_op = WHERE_LT;
+      } else if (strcmp(token, ">=") == 0) {
+        statement->where_op = WHERE_GTE;
+      } else if (strcmp(token, "<=") == 0) {
+        statement->where_op = WHERE_LTE;
+      } else {
+        return PREPARE_SYNTAX_ERROR;
+      }
+      token = strtok(NULL, " ");  // value
+      if (token == NULL) {
+        return PREPARE_SYNTAX_ERROR;
+      }
+      statement->where_id = (uint32_t)atoi(token);
+    }
+
     return PREPARE_SUCCESS;
   }

If there’s nothing after select, we keep WHERE_NONE and do a full scan as before. If there is, we expect the exact pattern where id <op> <value>.

Executing with WHERE

Here’s where it gets interesting. We rewrite execute_select() to use a switch on the operator:

Point Query (WHERE id = N)

For equality, we use table_find() to jump directly to the key. This is an O(log n) lookup – the whole reason we built a B-tree:

+    case WHERE_EQ: {
+      cursor = table_find(table, statement->where_id);
+      void* node = get_page(table->pager, cursor->page_num);
+      uint32_t num_cells = *leaf_node_num_cells(node);
+      if (cursor->cell_num < num_cells) {
+        uint32_t key = *leaf_node_key(node, cursor->cell_num);
+        if (key == statement->where_id) {
+          deserialize_row(cursor_value(cursor), &row);
+          print_row(&row);
+        }
+      }
+      free(cursor);
+      return EXECUTE_SUCCESS;
+    }

table_find() returns the position where the key should be. We still have to verify it’s actually there, since the key might not exist.

Range Scan (WHERE id > N, WHERE id >= N)

For greater-than queries, we position the cursor at the first qualifying key and scan forward through the sibling chain:

+    case WHERE_GT:
+    case WHERE_GTE: {
+      uint32_t start_key = (statement->where_op == WHERE_GT)
+                               ? statement->where_id + 1
+                               : statement->where_id;
+      cursor = table_find(table, start_key);
+      while (!(cursor->end_of_table)) {
+        deserialize_row(cursor_value(cursor), &row);
+        print_row(&row);
+        cursor_advance(cursor);
+      }
+      free(cursor);
+      return EXECUTE_SUCCESS;
+    }

For WHERE id > 5, we search for key 6. table_find() positions us at the first key >= 6, and we scan to the end. The next_leaf pointers we added in Part 12 make this traversal seamless across leaf node boundaries.

Less-Than Scan (WHERE id < N, WHERE id <= N)

For less-than, we start at the beginning and stop when we hit the boundary:

+    case WHERE_LT:
+    case WHERE_LTE: {
+      cursor = table_start(table);
+      uint32_t limit = statement->where_id;
+      while (!(cursor->end_of_table)) {
+        void* node = get_page(table->pager, cursor->page_num);
+        uint32_t key = *leaf_node_key(node, cursor->cell_num);
+        if (statement->where_op == WHERE_LT && key >= limit) break;
+        if (statement->where_op == WHERE_LTE && key > limit) break;
+        deserialize_row(cursor_value(cursor), &row);
+        print_row(&row);
+        cursor_advance(cursor);
+      }
+      free(cursor);
+      return EXECUTE_SUCCESS;
+    }

Because our keys are stored in sorted order, we can stop early the moment we see a key that’s too large. We don’t have to scan the whole table.

Testing

Let’s try some queries:

db > insert 1 user1 person1@example.com
Executed.
db > insert 2 user2 person2@example.com
Executed.
db > insert 3 user3 person3@example.com
Executed.
db > select where id = 2
(2, user2, person2@example.com)
Executed.
db > select where id > 1
(2, user2, person2@example.com)
(3, user3, person3@example.com)
Executed.
db > select where id < 3
(1, user1, person1@example.com)
(2, user2, person2@example.com)
Executed.

And the automated tests:

+  it 'selects a single row with where id =' do
+    script = (1..5).map do |i|
+      "insert #{i} user#{i} person#{i}@example.com"
+    end
+    script << "select where id = 3"
+    script << ".exit"
+    result = run_script(script)
+    expect(result).to include("(3, user3, person3@example.com)")
+    expect(result).not_to include("(1, user1, person1@example.com)")
+    expect(result).not_to include("(5, user5, person5@example.com)")
+  end
+
+  it 'selects rows with where id >' do
+    script = (1..5).map do |i|
+      "insert #{i} user#{i} person#{i}@example.com"
+    end
+    script << "select where id > 3"
+    script << ".exit"
+    result = run_script(script)
+    expect(result).to include("(4, user4, person4@example.com)")
+    expect(result).to include("(5, user5, person5@example.com)")
+    expect(result).not_to include("(3, user3, person3@example.com)")
+  end
+
+  it 'selects rows with where id <' do
+    script = (1..5).map do |i|
+      "insert #{i} user#{i} person#{i}@example.com"
+    end
+    script << "select where id < 3"
+    script << ".exit"
+    result = run_script(script)
+    expect(result).to include("(1, user1, person1@example.com)")
+    expect(result).to include("(2, user2, person2@example.com)")
+    expect(result).not_to include("(3, user3, person3@example.com)")
+  end
+
+  it 'returns nothing for where clause with no matches' do
+    script = [
+      "insert 1 user1 person1@example.com",
+      "select where id = 5",
+      ".exit",
+    ]
+    result = run_script(script)
+    expect(result).to match_array([
+      "db > Executed.",
+      "db > Executed.",
+      "db > ",
+    ])
+  end

Notice how the equality query uses table_find() – a single O(log n) B-tree traversal – rather than scanning every row. This is the payoff for all that B-tree work. A full table scan touches every row. A point query touches only the pages on the path from root to the target leaf. For a table with millions of rows, that’s the difference between milliseconds and minutes.

Next time we’ll overhaul our pager into a proper buffer pool with dirty page tracking and LRU eviction.

Part 18 - A Page Cache and Buffer Pool

Wed, 15 May 2024 00:00:00 +0000

“Cache rules everything around me.” – adapted from Wu-Tang Clan

Our pager has a dirty little secret: it writes every page to disk when the database closes, whether the page changed or not. And it happily loads as many pages into memory as there are pages in the file. For a small database that’s fine, but a real database could have millions of pages. We can’t fit them all in memory.

In this part we’re going to turn our naive pager into a proper buffer pool. That means three things:

Dirty page tracking – only write back pages that actually changed
LRU eviction – when the buffer is full, evict the least recently used page
Bounded memory – limit the number of pages we hold in memory at once

First, let’s add the new fields. dirty tracks which pages have been modified. access_time records when each page was last accessed. clock is a monotonically increasing counter:

+#define BUFFER_POOL_SIZE 100
+
 typedef struct {
   int file_descriptor;
   uint32_t file_length;
   uint32_t num_pages;
   void* pages[TABLE_MAX_PAGES];
+  bool dirty[TABLE_MAX_PAGES];
+  uint32_t access_time[TABLE_MAX_PAGES];
+  uint32_t clock;
 } Pager;
+
+void pager_flush(Pager* pager, uint32_t page_num);
+void pager_mark_dirty(Pager* pager, uint32_t page_num);
+uint32_t pager_pages_in_memory(Pager* pager);
+void pager_evict_lru(Pager* pager);

BUFFER_POOL_SIZE limits us to 100 pages in memory. With 4 KB pages, that’s about 400 KB of memory. Real databases like SQLite default to around 2000 pages (8 MB).

Initialize the new fields in pager_open():

+  pager->clock = 0;
   for (uint32_t i = 0; i < TABLE_MAX_PAGES; i++) {
     pager->pages[i] = NULL;
+    pager->dirty[i] = false;
+    pager->access_time[i] = 0;
   }

Marking Pages Dirty

Whenever we modify a page, we need to mark it dirty so we know to write it back:

+void pager_mark_dirty(Pager* pager, uint32_t page_num) {
+  pager->dirty[page_num] = true;
+}

We add calls to pager_mark_dirty() in every function that modifies page data: leaf_node_insert, leaf_node_delete, leaf_node_split_and_insert, create_new_root, and so on. Anywhere a page’s bytes change, we mark it dirty.

LRU Eviction

When we need to load a new page but the buffer pool is full, we evict the least recently used page. If it’s dirty, we flush it to disk first:

+uint32_t pager_pages_in_memory(Pager* pager) {
+  uint32_t count = 0;
+  for (uint32_t i = 0; i < TABLE_MAX_PAGES; i++) {
+    if (pager->pages[i] != NULL) count++;
+  }
+  return count;
+}
+
+void pager_evict_lru(Pager* pager) {
+  uint32_t lru_page = INVALID_PAGE_NUM;
+  uint32_t min_time = UINT32_MAX;
+
+  for (uint32_t i = 0; i < TABLE_MAX_PAGES; i++) {
+    if (pager->pages[i] != NULL && pager->access_time[i] < min_time) {
+      min_time = pager->access_time[i];
+      lru_page = i;
+    }
+  }
+
+  if (lru_page == INVALID_PAGE_NUM) return;
+
+  if (pager->dirty[lru_page]) {
+    pager_flush(pager, lru_page);
+    pager->dirty[lru_page] = false;
+  }
+
+  free(pager->pages[lru_page]);
+  pager->pages[lru_page] = NULL;
+}

This is the simplest possible LRU implementation: scan the array and find the page with the smallest access time. A real database would use a doubly-linked list to make eviction O(1), but for our purposes the linear scan is fine.

Updating get_page()

Now we integrate eviction into get_page(). On a cache miss, we check if the buffer is full and evict if necessary. On every access, we update the access time:

   if (pager->pages[page_num] == NULL) {
+    // Cache miss. Evict if the buffer pool is full.
+    if (pager_pages_in_memory(pager) >= BUFFER_POOL_SIZE) {
+      pager_evict_lru(pager);
+    }
+
     // Allocate memory and load from file.
     void* page = malloc(PAGE_SIZE);
     ...
     pager->pages[page_num] = page;
+    pager->dirty[page_num] = false;
   }

+  pager->access_time[page_num] = pager->clock++;
   return pager->pages[page_num];

Every time a page is accessed – whether it was already in memory or just loaded – the access time updates. Pages that haven’t been touched recently will have the lowest access times and get evicted first.

Only Flushing Dirty Pages

Finally, update db_close() to only write back pages that were modified:

   for (uint32_t i = 0; i < pager->num_pages; i++) {
     if (pager->pages[i] == NULL) {
       continue;
     }
-    pager_flush(pager, i);
+    if (pager->dirty[i]) {
+      pager_flush(pager, i);
+      pager->dirty[i] = false;
+    }
     free(pager->pages[i]);
     pager->pages[i] = NULL;
   }

This is a significant optimization. If you insert one row and then exit, we used to write back every page we’d ever read. Now we only write back the pages that actually changed.

A Note on Pinning

There’s a subtlety we’re not handling: page pinning. When a B-tree split is in progress, we might have several pages in flight that absolutely must not be evicted. A real buffer pool uses a pin count – get_page() increments it, and the caller decrements it when done. A pinned page is never evicted. Our BUFFER_POOL_SIZE of 100 is generous enough that we’ll never evict a page that’s in active use, but a production system would need proper pin management.

Testing

The existing tests continue to pass – dirty page tracking is invisible to the user. Let’s add one test to verify persistence still works with the new buffer pool:

+  it 'persists data correctly with dirty page tracking' do
+    script = (1..20).map do |i|
+      "insert #{i} user#{i} person#{i}@example.com"
+    end
+    script << ".exit"
+    run_script(script)
+
+    result = run_script([
+      "select where id = 10",
+      "select where id = 20",
+      ".exit",
+    ])
+    expect(result).to include("(10, user10, person10@example.com)")
+    expect(result).to include("(20, user20, person20@example.com)")
+  end

Next time we’ll tackle variable-length records so our database can store strings of different lengths efficiently.

Part 19 - Variable-Length Records

Sat, 01 Jun 2024 00:00:00 +0000

Up until now, every row in our database takes the same amount of space on disk: 293 bytes. A username of “a” takes 33 bytes. A username of “abcdefghijklmnopqrstuvwxyz012345” also takes 33 bytes. That’s a lot of wasted space for short strings.

Real databases store strings using a variable-length format. Instead of allocating the maximum possible size for every string, they store the actual length followed by only the bytes that are used. Let’s implement that.

Length-Prefixed Strings

The standard approach for serializing variable-length data is length-prefixing: write the number of bytes first, then the actual data. Our new serialized row format looks like this:

field	size
id	4 bytes
username_len	4 bytes
username	32 bytes (max)
email_len	4 bytes
email	255 bytes (max)

We’re adding 8 bytes of overhead for the two length fields. That changes our constants:

 const uint32_t ID_SIZE = size_of_attribute(Row, id);
-const uint32_t USERNAME_SIZE = size_of_attribute(Row, username);
-const uint32_t EMAIL_SIZE = size_of_attribute(Row, email);
+const uint32_t VARCHAR_LEN_SIZE = sizeof(uint32_t);
+
+/*
+ * Serialized Row Layout (length-prefixed strings)
+ *
+ * | id (4) | username_len (4) | username (32) | email_len (4) | email (255) |
+ */
 const uint32_t ID_OFFSET = 0;
-const uint32_t USERNAME_OFFSET = ID_OFFSET + ID_SIZE;
-const uint32_t EMAIL_OFFSET = USERNAME_OFFSET + USERNAME_SIZE;
-const uint32_t ROW_SIZE = ID_SIZE + USERNAME_SIZE + EMAIL_SIZE;
+const uint32_t USERNAME_LEN_OFFSET = ID_OFFSET + ID_SIZE;
+const uint32_t USERNAME_OFFSET = USERNAME_LEN_OFFSET + VARCHAR_LEN_SIZE;
+const uint32_t EMAIL_LEN_OFFSET = USERNAME_OFFSET + COLUMN_USERNAME_SIZE;
+const uint32_t EMAIL_OFFSET = EMAIL_LEN_OFFSET + VARCHAR_LEN_SIZE;
+const uint32_t ROW_SIZE =
+    ID_SIZE + VARCHAR_LEN_SIZE + COLUMN_USERNAME_SIZE + VARCHAR_LEN_SIZE +
+    COLUMN_EMAIL_SIZE;

ROW_SIZE goes from 293 to 299 bytes. LEAF_NODE_CELL_SIZE goes from 297 to 303. But LEAF_NODE_MAX_CELLS stays at 13 because 4082 / 303 = 13.

New Serialization

Here’s the new serialize_row(). We memset the entire destination to zero first – this ensures the unused bytes after each string are clean:

 void serialize_row(Row* source, void* destination) {
-  memcpy(destination + ID_OFFSET, &(source->id), ID_SIZE);
-  memcpy(destination + USERNAME_OFFSET, &(source->username), USERNAME_SIZE);
-  memcpy(destination + EMAIL_OFFSET, &(source->email), EMAIL_SIZE);
+  memset(destination, 0, ROW_SIZE);
+  memcpy(destination + ID_OFFSET, &(source->id), ID_SIZE);
+  uint32_t username_len = strlen(source->username);
+  memcpy(destination + USERNAME_LEN_OFFSET, &username_len, VARCHAR_LEN_SIZE);
+  memcpy(destination + USERNAME_OFFSET, source->username, username_len);
+  uint32_t email_len = strlen(source->email);
+  memcpy(destination + EMAIL_LEN_OFFSET, &email_len, VARCHAR_LEN_SIZE);
+  memcpy(destination + EMAIL_OFFSET, source->email, email_len);
 }

And deserialize_row() reads the length first, then copies only that many bytes:

 void deserialize_row(void* source, Row* destination) {
-  memcpy(&(destination->id), source + ID_OFFSET, ID_SIZE);
-  memcpy(&(destination->username), source + USERNAME_OFFSET, USERNAME_SIZE);
-  memcpy(&(destination->email), source + EMAIL_OFFSET, EMAIL_SIZE);
+  memcpy(&(destination->id), source + ID_OFFSET, ID_SIZE);
+  uint32_t username_len;
+  memcpy(&username_len, source + USERNAME_LEN_OFFSET, VARCHAR_LEN_SIZE);
+  memset(destination->username, 0, COLUMN_USERNAME_SIZE + 1);
+  memcpy(destination->username, source + USERNAME_OFFSET, username_len);
+  uint32_t email_len;
+  memcpy(&email_len, source + EMAIL_LEN_OFFSET, VARCHAR_LEN_SIZE);
+  memset(destination->email, 0, COLUMN_EMAIL_SIZE + 1);
+  memcpy(destination->email, source + EMAIL_OFFSET, email_len);
 }

The memset to zero before memcpy ensures the destination string is properly null-terminated. This is important because strlen() relies on that null byte.

Actual vs Allocated Space

We can compute how much space a row actually uses versus how much it’s allocated:

+uint32_t row_data_size(Row* row) {
+  return ID_SIZE + VARCHAR_LEN_SIZE + strlen(row->username) + VARCHAR_LEN_SIZE +
+         strlen(row->email);
+}

For a row like (1, a, b@c.com), the actual data is only 4 + 4 + 1 + 4 + 6 = 19 bytes. But we allocate 299 bytes for it. That’s 280 bytes of wasted space!

The Elephant in the Room: Slotted Pages

Our cells still occupy fixed-size slots in the leaf node. Even though we serialize the data more efficiently, we pad the slot to ROW_SIZE. A real database solves this with a slotted page format:

+-------------------------------------------+
| Header | Ptr1 | Ptr2 | Ptr3 | ...         |
+-------------------------------------------+
|              Free Space                    |
+-------------------------------------------+
| ... | Cell3 data | Cell2 data | Cell1 data |
+-------------------------------------------+

The cell pointer directory grows downward from the header. Actual cell data is packed upward from the bottom of the page. Each cell takes only as much space as it needs. The free space in the middle shrinks as you add cells.

This is how SQLite, PostgreSQL, and most real databases lay out their pages. We could implement this, but it would require rewriting every function that accesses leaf node cells. For now, our length-prefixed format gives us the serialization story without that complexity.

When Strings Outgrow a Page: Overflow Pages

What about a TEXT column that holds a 10 KB blog post? It doesn’t fit in a single 4 KB page. Real databases use overflow pages (sometimes called TOAST in PostgreSQL). The cell stores a pointer to a separate chain of pages that hold the large value. SQLite uses overflow pages when a record exceeds about 25% of the page size.

We don’t need overflow pages because our column sizes are bounded (32 and 255 bytes), but it’s worth knowing the pattern.

Testing

Short strings, long strings – they all work:

+  it 'handles variable-length strings correctly' do
+    script = [
+      "insert 1 a b@c.com",
+      "insert 2 longername longemail@example.com",
+      "select",
+      ".exit",
+    ]
+    result = run_script(script)
+    expect(result).to include("(1, a, b@c.com)")
+    expect(result).to include("(2, longername, longemail@example.com)")
+  end

And our updated constants:

Constants:
ROW_SIZE: 299
COMMON_NODE_HEADER_SIZE: 6
LEAF_NODE_HEADER_SIZE: 14
LEAF_NODE_CELL_SIZE: 303
LEAF_NODE_SPACE_FOR_CELLS: 4082
LEAF_NODE_MAX_CELLS: 13
VARCHAR_LEN_SIZE: 4

Next time we’ll add secondary indexes – separate B-trees that let you look up rows by columns other than the primary key.

Part 20 - Secondary Indexes

Sat, 15 Jun 2024 00:00:00 +0000

Our WHERE clause can find rows by id efficiently because id is the primary key – it’s the key in our B-tree. But what if you want to find a user by their username? Right now, that means scanning every row. For a table with millions of rows, that’s unacceptable.

The answer is a secondary index: a separate data structure that maps a non-primary column to the primary key. Instead of scanning, you look up the username in the index, get back the id, then use the id to fetch the full row from the primary B-tree. Two lookups instead of a million.

How a Secondary Index Works

The primary B-tree stores rows keyed by id:

Primary B-tree: id -> (id, username, email)

A secondary index on username maps usernames to primary keys:

Username index: hash(username) -> id

To look up username = "bob":

Hash “bob” to get a key
Search the index for that key -> get id = 2
Search the primary B-tree for id = 2 -> get the full row

The Hash Function

We need a hash function to turn strings into integer keys. We’ll use djb2, a simple and effective string hash:

+uint32_t hash_string(const char* str) {
+  uint32_t hash = 5381;
+  int c;
+  while ((c = *str++)) {
+    hash = ((hash << 5) + hash) + c;
+  }
+  return hash;
+}

Index Page Format

For simplicity, we’ll store our index as a sorted array of (hash, primary_key) pairs packed into a single page. Each entry is 8 bytes, giving us room for 511 entries:

+const uint32_t INDEX_ENTRY_SIZE = 2 * sizeof(uint32_t);
+const uint32_t INDEX_HEADER_SIZE = sizeof(uint32_t);
+const uint32_t INDEX_MAX_ENTRIES =
+    (PAGE_SIZE - INDEX_HEADER_SIZE) / INDEX_ENTRY_SIZE;

A real database would use a full B-tree for the index (just like the primary table). We’re using a flat sorted array for clarity, but the concept is the same: a separate data structure that maps column values to primary keys.

Index Operations

Inserting into the index uses binary search to find the right position, then shifts entries right:

+void index_insert(Table* table, uint32_t hash_key, uint32_t primary_key) {
+  if (!table->has_index) return;
+  void* page = get_page(table->pager, table->index_page_num);
+  pager_mark_dirty(table->pager, table->index_page_num);
+  uint32_t num = *index_num_entries(page);
+
+  uint32_t lo = 0, hi = num;
+  while (lo < hi) {
+    uint32_t mid = (lo + hi) / 2;
+    if (*index_entry_hash(page, mid) < hash_key) {
+      lo = mid + 1;
+    } else {
+      hi = mid;
+    }
+  }
+
+  for (uint32_t i = num; i > lo; i--) {
+    *index_entry_hash(page, i) = *index_entry_hash(page, i - 1);
+    *index_entry_pk(page, i) = *index_entry_pk(page, i - 1);
+  }
+
+  *index_entry_hash(page, lo) = hash_key;
+  *index_entry_pk(page, lo) = primary_key;
+  *index_num_entries(page) = num + 1;
+}

Lookup also uses binary search – O(log n):

+uint32_t index_find(Table* table, uint32_t hash_key) {
+  if (!table->has_index) return INVALID_PAGE_NUM;
+  void* page = get_page(table->pager, table->index_page_num);
+  uint32_t num = *index_num_entries(page);
+
+  uint32_t lo = 0, hi = num;
+  while (lo < hi) {
+    uint32_t mid = (lo + hi) / 2;
+    if (*index_entry_hash(page, mid) < hash_key) {
+      lo = mid + 1;
+    } else {
+      hi = mid;
+    }
+  }
+
+  if (lo < num && *index_entry_hash(page, lo) == hash_key) {
+    return *index_entry_pk(page, lo);
+  }
+  return INVALID_PAGE_NUM;
+}

Creating the Index

The create index on username command allocates a new page, scans the entire table, and populates the index:

+  } else if (strcmp(input_buffer->buffer, "create index on username") == 0) {
+    table->index_page_num = get_unused_page_num(table->pager);
+    void* index_page = get_page(table->pager, table->index_page_num);
+    memset(index_page, 0, PAGE_SIZE);
+    pager_mark_dirty(table->pager, table->index_page_num);
+    table->has_index = true;
+
+    Cursor* cursor = table_start(table);
+    Row row;
+    while (!(cursor->end_of_table)) {
+      deserialize_row(cursor_value(cursor), &row);
+      index_insert(table, hash_string(row.username), row.id);
+      cursor_advance(cursor);
+    }
+    free(cursor);
+    printf("Index created on username.\n");

Using the Index

When we see select where username = bob, we hash “bob”, look it up in the index, and use the returned primary key to fetch the full row:

+    case WHERE_USERNAME_EQ: {
+      uint32_t hash = hash_string(statement->where_username);
+      uint32_t pk = index_find(table, hash);
+      if (pk != INVALID_PAGE_NUM) {
+        cursor = table_find(table, pk);
+        void* node = get_page(table->pager, cursor->page_num);
+        uint32_t num_cells = *leaf_node_num_cells(node);
+        if (cursor->cell_num < num_cells) {
+          deserialize_row(cursor_value(cursor), &row);
+          if (strcmp(row.username, statement->where_username) == 0) {
+            print_row(&row);
+          }
+        }
+        free(cursor);
+      }
+      return EXECUTE_SUCCESS;
+    }

Notice the strcmp check after the index lookup. Hash collisions are possible – two different usernames could hash to the same value. The strcmp confirms we actually found the right row. A production index would chain colliding entries and check all of them.

Maintaining the Index

Every insert also inserts into the index:

   leaf_node_insert(cursor, row_to_insert->id, row_to_insert);
+  index_insert(table, hash_string(row_to_insert->username), row_to_insert->id);

The same applies for delete. Any write to the primary table must be reflected in all secondary indexes – this is the maintenance cost of indexes. More indexes mean faster reads but slower writes.

Testing

+  it 'creates an index and looks up by username' do
+    script = [
+      "insert 1 alice alice@example.com",
+      "insert 2 bob bob@example.com",
+      "insert 3 charlie charlie@example.com",
+      "create index on username",
+      "select where username = bob",
+      ".exit",
+    ]
+    result = run_script(script)
+    expect(result).to include("Index created on username.")
+    expect(result).to include("(2, bob, bob@example.com)")
+    expect(result).not_to include("(1, alice, alice@example.com)")
+  end

Without the index, finding “bob” requires scanning every row. With the index, it’s two O(log n) lookups: one in the index, one in the primary B-tree. For a million rows, that’s roughly 20 page reads instead of thousands.

Next time we’ll add transactions so that a sequence of changes can be committed atomically or rolled back entirely.

Part 21 - Transactions

Mon, 01 Jul 2024 00:00:00 +0000

“Either all of it happens, or none of it does.” – the essence of atomicity

Until now, every statement we execute takes effect immediately and permanently. If you insert three rows, they’re all committed right away. If your program crashes halfway through a batch of inserts, you get a partially-updated database. That’s not great.

Real databases support transactions: a group of operations that either all succeed (commit) or all fail (rollback). This is the “A” in ACID – Atomicity.

The Commands

We’ll add three new statements:

begin – start a transaction
commit – make all changes since begin permanent
rollback – undo all changes since begin

 typedef enum {
   STATEMENT_INSERT,
   STATEMENT_SELECT,
-  STATEMENT_DELETE
+  STATEMENT_DELETE,
+  STATEMENT_BEGIN,
+  STATEMENT_COMMIT,
+  STATEMENT_ROLLBACK
 } StatementType;

The Undo Log

Our approach is based on shadow paging: before modifying a page during a transaction, we save a copy of its original state. On commit, we throw away those copies (the changes are already in memory and will be flushed). On rollback, we restore the copies, effectively rewinding time.

We add an undo log to the Pager:

 typedef struct {
   int file_descriptor;
   uint32_t file_length;
   uint32_t num_pages;
   void* pages[TABLE_MAX_PAGES];
   bool dirty[TABLE_MAX_PAGES];
   uint32_t access_time[TABLE_MAX_PAGES];
   uint32_t clock;
+  bool in_transaction;
+  #define MAX_UNDO_PAGES 64
+  uint32_t undo_page_nums[MAX_UNDO_PAGES];
+  void* undo_pages[MAX_UNDO_PAGES];
+  uint32_t num_undo_pages;
 } Pager;

The key insight: we hook into pager_mark_dirty(). This function is already called before every page modification. We piggyback on it to save the undo copy:

 void pager_mark_dirty(Pager* pager, uint32_t page_num) {
+  if (pager->in_transaction) {
+    bool already_saved = false;
+    for (uint32_t i = 0; i < pager->num_undo_pages; i++) {
+      if (pager->undo_page_nums[i] == page_num) {
+        already_saved = true;
+        break;
+      }
+    }
+    if (!already_saved && pager->num_undo_pages < MAX_UNDO_PAGES) {
+      void* copy = malloc(PAGE_SIZE);
+      memcpy(copy, pager->pages[page_num], PAGE_SIZE);
+      uint32_t idx = pager->num_undo_pages++;
+      pager->undo_page_nums[idx] = page_num;
+      pager->undo_pages[idx] = copy;
+    }
+  }
   pager->dirty[page_num] = true;
 }

The first time a page is marked dirty during a transaction, we save a snapshot of its current (pre-modification) state. If the same page is modified again, we skip it – we already have the original saved.

Commit and Rollback

Commit is easy – just free the undo copies and exit the transaction. The modified pages are already in the buffer pool and will be flushed to disk when the database closes:

+ExecuteResult execute_commit(Statement* statement, Table* table) {
+  Pager* pager = table->pager;
+  for (uint32_t i = 0; i < pager->num_undo_pages; i++) {
+    free(pager->undo_pages[i]);
+  }
+  pager->num_undo_pages = 0;
+  pager->in_transaction = false;
+  return EXECUTE_SUCCESS;
+}

Rollback is the reverse – restore each undo copy and clear the dirty flag:

+ExecuteResult execute_rollback(Statement* statement, Table* table) {
+  Pager* pager = table->pager;
+  for (uint32_t i = 0; i < pager->num_undo_pages; i++) {
+    uint32_t page_num = pager->undo_page_nums[i];
+    memcpy(pager->pages[page_num], pager->undo_pages[i], PAGE_SIZE);
+    pager->dirty[page_num] = false;
+    free(pager->undo_pages[i]);
+  }
+  pager->num_undo_pages = 0;
+  pager->in_transaction = false;
+  return EXECUTE_SUCCESS;
+}

After rollback, the in-memory pages are exactly as they were before the transaction started. And since we cleared the dirty flags, they won’t be written to disk.

Testing

Let’s verify that rollback actually undoes changes:

+  it 'rolls back a transaction' do
+    script = [
+      "insert 1 user1 person1@example.com",
+      "begin",
+      "insert 2 user2 person2@example.com",
+      "insert 3 user3 person3@example.com",
+      "rollback",
+      "select",
+      ".exit",
+    ]
+    result = run_script(script)
+    expect(result).to include("(1, user1, person1@example.com)")
+    expect(result).not_to include("(2, user2, person2@example.com)")
+    expect(result).not_to include("(3, user3, person3@example.com)")
+  end

Row 1 was inserted before the transaction, so it survives. Rows 2 and 3 were inserted during the transaction, so rollback erases them.

And commit makes things permanent:

+  it 'commits a transaction' do
+    script = [
+      "begin",
+      "insert 1 user1 person1@example.com",
+      "insert 2 user2 person2@example.com",
+      "commit",
+      "select",
+      ".exit",
+    ]
+    result = run_script(script)
+    expect(result).to include("(1, user1, person1@example.com)")
+    expect(result).to include("(2, user2, person2@example.com)")
+  end

Limitations

Our transaction implementation is minimal but teaches the core concept. A real database adds:

Durability: committed transactions survive crashes (we’ll add WAL next)
Isolation: concurrent readers don’t see uncommitted changes
Nested transactions / savepoints: rolling back to intermediate points
Write-ahead logging: instead of shadow paging, log changes before applying them

That last point is particularly important. Shadow paging works, but it requires copying entire 4 KB pages even if only a few bytes changed. Write-ahead logging is more efficient – and that’s what we’ll implement next.

Part 22 - Write-Ahead Logging

Mon, 15 Jul 2024 00:00:00 +0000

In the last part we added transactions with rollback. But there’s still a durability problem: if the program crashes while writing a page to disk, the database file could be left in a corrupted state – half-written data where a valid page used to be.

The solution is the write-ahead log (WAL). The rule is simple: before writing a page to the main database file, first write it to a separate log file. If the program crashes mid-write, the log has a complete copy of the page that can be replayed on the next startup.

This is the “D” in ACID – Durability.

The WAL File

We store the WAL as a separate file alongside the database, named <dbfile>-wal. Each record in the WAL is a (page_number, page_data) pair:

| page_num (4 bytes) | page_data (4096 bytes) | page_num | page_data | ...

We add a WAL file descriptor and filename to the Pager:

 typedef struct {
   ...
+  int wal_fd;
+  char wal_filename[256];
 } Pager;

Writing to the WAL

Before every write to the main database file, we append the page to the WAL:

+void wal_write(Pager* pager, uint32_t page_num) {
+  if (pager->wal_fd == -1) return;
+  lseek(pager->wal_fd, 0, SEEK_END);
+  write(pager->wal_fd, &page_num, sizeof(uint32_t));
+  write(pager->wal_fd, pager->pages[page_num], PAGE_SIZE);
+}

And we add a call to wal_write() at the beginning of pager_flush():

 void pager_flush(Pager* pager, uint32_t page_num) {
+  wal_write(pager, page_num);
   ...
   // then write to the database file as before

Now the WAL contains a complete copy of every page before it hits the database file. If the database file gets corrupted, the WAL has what we need.

Crash Recovery

On startup, we check if the WAL file has any records. If it does, we replay them – writing each page from the WAL into the correct location in the database file:

+void wal_replay(Pager* pager) {
+  if (pager->wal_fd == -1) return;
+  off_t wal_size = lseek(pager->wal_fd, 0, SEEK_END);
+  if (wal_size <= 0) return;
+
+  printf("Replaying WAL (%d records)...\n",
+         (int)(wal_size / (sizeof(uint32_t) + PAGE_SIZE)));
+  lseek(pager->wal_fd, 0, SEEK_SET);
+
+  while (1) {
+    uint32_t page_num;
+    ssize_t n = read(pager->wal_fd, &page_num, sizeof(uint32_t));
+    if (n <= 0) break;
+    void* page_data = malloc(PAGE_SIZE);
+    n = read(pager->wal_fd, page_data, PAGE_SIZE);
+    if (n < (ssize_t)PAGE_SIZE) {
+      free(page_data);
+      break;
+    }
+    lseek(pager->file_descriptor, page_num * PAGE_SIZE, SEEK_SET);
+    write(pager->file_descriptor, page_data, PAGE_SIZE);
+    free(page_data);
+  }
+
+  /* Clear the WAL */
+  close(pager->wal_fd);
+  pager->wal_fd =
+      open(pager->wal_filename, O_RDWR | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
+}

If the WAL replay finds a truncated record (from a crash during WAL write), it stops. The partially-written WAL record is discarded, but all complete records before it are applied. This guarantees that any page write that completed in the WAL will be recovered.

Checkpointing

When the database is closed cleanly, we’ve already flushed all dirty pages. The WAL is no longer needed, so we clear it:

+void wal_checkpoint(Pager* pager) {
+  if (pager->wal_fd == -1) return;
+  close(pager->wal_fd);
+  pager->wal_fd =
+      open(pager->wal_filename, O_RDWR | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
+}

This is called in db_close() after all dirty pages are flushed:

+  wal_checkpoint(pager);
+  if (pager->wal_fd != -1) {
+    close(pager->wal_fd);
+  }
   int result = close(pager->file_descriptor);

The Write Path

Let’s trace what happens on an insert now:

leaf_node_insert() modifies the page in memory
pager_mark_dirty() marks it for write-back (and saves an undo copy if in a transaction)
On db_close(), pager_flush() is called for each dirty page
pager_flush() first calls wal_write() – the page goes to the WAL file
Then pager_flush() writes the page to the database file
After all pages are flushed, wal_checkpoint() clears the WAL

If the program crashes between steps 4 and 5, the WAL has the page. On the next startup, wal_replay() writes it to the database file, and the data is not lost.

How SQLite Does It

SQLite’s WAL mode is more sophisticated. Instead of writing to the WAL before the database file, it writes only to the WAL during normal operation. Readers check both the WAL and the database file. Periodically, a checkpoint operation transfers WAL pages to the database file. This allows concurrent readers and writers, which our simple implementation doesn’t support.

But the core principle is the same: write changes to a log first, ensure the log is durable, then apply the changes. If anything goes wrong, the log tells you how to recover.

That wraps up our implementation of the ACID properties. We have Atomicity (transactions), and now Durability (WAL). We’ll talk about what we’ve built and where to go from here in the next and final part.

Part 23 - Wrapping Up

Thu, 01 Aug 2024 00:00:00 +0000

“What I cannot create, I do not understand.” – Richard Feynman

We started with a question: how does a database work? And to answer it, we built one. Let’s take a step back and look at what we’ve created.

What We Built

Our database – all of it in a single C file – implements:

Storage engine:

A B+ tree with leaf and internal nodes, supporting insert, delete, and search
Leaf node splitting and internal node splitting to grow the tree
Rebalancing via borrowing and merging to shrink the tree
Tree height reduction when the root becomes unnecessary
Sibling pointers for efficient sequential scans across leaves

Persistence:

File-backed page storage with a buffer pool
Dirty page tracking so we only write back what changed
LRU eviction to bound memory usage
Write-ahead logging for crash recovery

Query processing:

A REPL that parses SQL-like commands: insert, select, delete
A WHERE clause with equality, greater-than, and less-than predicates on the primary key
Point queries that use O(log n) B-tree search instead of full table scans
Range scans that exploit the sorted key order

Indexing:

A primary B-tree index (the table itself)
A secondary index on username using a hash-based lookup
Index maintenance on insert and delete

Transactions:

BEGIN, COMMIT, and ROLLBACK statements
Shadow paging for undo: page copies saved before modification
Atomic rollback by restoring saved page copies

Data format:

Length-prefixed variable-length string serialization
Fixed-size cell slots in leaf nodes with zero-padded strings

That’s a lot of database. Not a toy, either – the fundamentals are the same ones used by SQLite, PostgreSQL, and MySQL. B-trees, page caches, WAL, secondary indexes – these aren’t academic curiosities. They’re what makes your favorite database tick.

The Architecture

Here’s how our components map to the SQLite architecture we looked at in Part 1:

SQLite Component	Our Implementation
Tokenizer / Parser	`prepare_statement()`, `prepare_insert()`, `prepare_delete()`
Code Generator	`execute_statement()` switch
Virtual Machine	`execute_insert()`, `execute_select()`, `execute_delete()`
B-Tree	`leaf_node_`, `internal_node_`, `table_find()`
Pager	`get_page()`, `pager_flush()`, LRU eviction
OS Interface	`open()`, `read()`, `write()`, `lseek()`

We skipped the bytecode layer (our “VM” calls functions directly), but the layering is the same.

What a Real Database Adds

There’s always more to build. Here are the biggest things a production database has that we don’t:

Multiple tables and joins. We have one hardcoded table. A real database has a schema catalog, multiple B-trees (one per table), and join algorithms (nested loop, hash join, sort-merge) for combining data across tables.

A query planner. We always use the B-tree index for primary key lookups and do a full scan otherwise. A real database estimates the cost of different access paths and picks the cheapest one. Sometimes a full scan beats an index scan (e.g., when selecting most of the table).

Concurrency control. We support one connection at a time. Real databases handle many concurrent readers and writers using locks, multiversion concurrency control (MVCC), or both.

A proper SQL parser. Our parser uses strcmp and strtok. A real parser uses a grammar (often generated by tools like Lemon or Bison) to handle the full SQL syntax.

Page compaction and free space management. When we delete rows, the space isn’t reclaimed for reuse. A real database maintains a free page list and compacts pages to avoid fragmentation.

Recovery beyond WAL. Our WAL is simple redo logging. Real databases combine redo and undo logging (ARIES protocol), support checkpoints that bound recovery time, and handle partial page writes.

What We Learned

Building a database from scratch taught us:

Why B-trees? Because disk I/O is expensive, and B-trees minimize it. A tree with a branching factor of 500 can index a billion rows in 3 levels – 3 page reads to find any row.
Why pages? Because disks read in fixed-size blocks. By aligning our data structures to page boundaries, we make every I/O operation useful.
Why write-ahead logging? Because writes can fail. By logging before applying, we ensure that committed data survives crashes.
Why indexes? Because scanning every row is O(n). An index makes point queries O(log n) – the difference between milliseconds and minutes.
Why transactions? Because partial updates are worse than no update. Atomicity ensures all-or-nothing semantics.

These aren’t just database concepts. They’re fundamental computer science – the trade-offs between memory and disk, consistency and performance, simplicity and scalability.

Thank You

If you’ve followed along this far, you’ve done something remarkable. You’ve read thousands of lines of C, understood B-tree splitting and merging, implemented crash recovery, and built something that actually stores and retrieves data reliably. That’s not trivial.

The source code is yours to explore, extend, and break. Add multiple tables. Implement joins. Build a proper parser. Or just read through the code and see how it all fits together. The best way to learn is to build, and now you have a foundation to build on.

Until then!

Let's Build a Simple Database

Part 14 - Splitting Internal Nodes

Part 15 - Deleting Rows from a Leaf Node

Parsing the Delete Statement

Removing a Cell from a Leaf Node

Executing the Delete

Testing

A Looming Problem

Part 16 - Rebalancing the B-Tree After Deletion

Minimum Occupancy

The Strategy

Finding a Child’s Position in its Parent

Leaf Node Rebalancing

Borrowing from the Right Sibling

Borrowing from the Left Sibling

Merging

Removing a Child from an Internal Node

Tree Height Reduction

Triggering Rebalancing

Testing

Part 17 - The WHERE Clause

Adding WHERE to the Statement

Parsing the WHERE Clause

Executing with WHERE

Point Query (WHERE id = N)

Range Scan (WHERE id > N, WHERE id >= N)

Less-Than Scan (WHERE id < N, WHERE id <= N)

Testing

Part 18 - A Page Cache and Buffer Pool

Expanding the Pager

Marking Pages Dirty

LRU Eviction

Updating get_page()

Only Flushing Dirty Pages

A Note on Pinning

Testing

Part 19 - Variable-Length Records

Length-Prefixed Strings

New Serialization

Actual vs Allocated Space

The Elephant in the Room: Slotted Pages

When Strings Outgrow a Page: Overflow Pages

Testing

Part 20 - Secondary Indexes

How a Secondary Index Works

The Hash Function

Index Page Format

Index Operations

Creating the Index

Using the Index

Maintaining the Index

Testing

Part 21 - Transactions

The Commands

The Undo Log

Commit and Rollback

Testing

Limitations

Part 22 - Write-Ahead Logging

The WAL File

Writing to the WAL

Crash Recovery

Checkpointing

The Write Path

How SQLite Does It

Part 23 - Wrapping Up

What We Built

The Architecture

What a Real Database Adds

What We Learned

Thank You