<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Let&apos;s Build a Simple Database</title>
    <description>Writing a sqlite clone from scratch in C</description>
    <link>https://ibra.github.io/db_tutorial</link>
    <atom:link href="https://ibra.github.io/db_tutorial/feed.xml" rel="self" type="application/rss+xml" />
    
    
    
      <item>
        <title>Part 14 - Splitting Internal Nodes</title>
        <description>&lt;p&gt;The next leg of our journey will be splitting internal nodes which are unable to accommodate new keys. Consider the example below:&lt;/p&gt;

&lt;table class=&quot;image&quot;&gt;
&lt;caption align=&quot;bottom&quot;&gt;Example of splitting an internal&lt;/caption&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https://ibra.github.io/db_tutorial/assets/images/splitting-internal-node.png&quot;&gt;&lt;img src=&quot;https://ibra.github.io/db_tutorial/assets/images/splitting-internal-node.png&quot; alt=&quot;Example of splitting an internal&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;In this example, we add the key “11” to the tree. This will cause our root to split. When splitting an internal node, we will have to do a few things in order to keep everything straight:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Create a sibling node to store (n-1)/2 of the original node’s keys&lt;/li&gt;
  &lt;li&gt;Move these keys from the original node to the sibling node&lt;/li&gt;
  &lt;li&gt;Update the original node’s key in the parent to reflect its new max key after splitting&lt;/li&gt;
  &lt;li&gt;Insert the sibling node into the parent (could result in the parent also being split)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We will begin by replacing our stub code with the call to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;internal_node_split_and_insert&lt;/code&gt;&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+void internal_node_split_and_insert(Table* table, uint32_t parent_page_num,
+                          uint32_t child_page_num);
+
&lt;/span&gt; void internal_node_insert(Table* table, uint32_t parent_page_num,
                           uint32_t child_page_num) {
   /*
&lt;span class=&quot;p&quot;&gt;@@ -685,25 +714,39 @@&lt;/span&gt; void internal_node_insert(Table* table, uint32_t parent_page_num,
 
   void* parent = get_page(table-&amp;gt;pager, parent_page_num);
   void* child = get_page(table-&amp;gt;pager, child_page_num);
&lt;span class=&quot;gd&quot;&gt;-  uint32_t child_max_key = get_node_max_key(child);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+  uint32_t child_max_key = get_node_max_key(table-&amp;gt;pager, child);
&lt;/span&gt;   uint32_t index = internal_node_find_child(parent, child_max_key);
 
   uint32_t original_num_keys = *internal_node_num_keys(parent);
&lt;span class=&quot;gd&quot;&gt;-  *internal_node_num_keys(parent) = original_num_keys + 1;
&lt;/span&gt; 
   if (original_num_keys &amp;gt;= INTERNAL_NODE_MAX_CELLS) {
&lt;span class=&quot;gd&quot;&gt;-    printf(&quot;Need to implement splitting internal node\n&quot;);
-    exit(EXIT_FAILURE);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+    internal_node_split_and_insert(table, parent_page_num, child_page_num);
+    return;
&lt;/span&gt;   }
 
   uint32_t right_child_page_num = *internal_node_right_child(parent);
&lt;span class=&quot;gi&quot;&gt;+  /*
+  An internal node with a right child of INVALID_PAGE_NUM is empty
+  */
+  if (right_child_page_num == INVALID_PAGE_NUM) {
+    *internal_node_right_child(parent) = child_page_num;
+    return;
+  }
+
&lt;/span&gt;   void* right_child = get_page(table-&amp;gt;pager, right_child_page_num);
&lt;span class=&quot;gi&quot;&gt;+  /*
+  If we are already at the max number of cells for a node, we cannot increment
+  before splitting. Incrementing without inserting a new key/child pair
+  and immediately calling internal_node_split_and_insert has the effect
+  of creating a new key at (max_cells + 1) with an uninitialized value
+  */
+  *internal_node_num_keys(parent) = original_num_keys + 1;
&lt;/span&gt; 
&lt;span class=&quot;gd&quot;&gt;-  if (child_max_key &amp;gt; get_node_max_key(right_child)) {
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+  if (child_max_key &amp;gt; get_node_max_key(table-&amp;gt;pager, right_child)) {
&lt;/span&gt;     /* Replace right child */
     *internal_node_child(parent, original_num_keys) = right_child_page_num;
     *internal_node_key(parent, original_num_keys) =
&lt;span class=&quot;gd&quot;&gt;-        get_node_max_key(right_child);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+        get_node_max_key(table-&amp;gt;pager, right_child);
&lt;/span&gt;     *internal_node_right_child(parent) = child_page_num;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There are three important changes we are making here aside from replacing the stub:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;First, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;internal_node_split_and_insert&lt;/code&gt; is forward-declared because we will be calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;internal_node_insert&lt;/code&gt; in its definition to avoid code duplication.&lt;/li&gt;
  &lt;li&gt;In addition, we are moving the logic which increments the parent’s number of keys further down in the function definition to ensure that this does not happen before the split.&lt;/li&gt;
  &lt;li&gt;Finally, we are ensuring that a child node inserted into an empty internal node will become that internal node’s right child without any other operations being performed, since an empty internal node has no keys to manipulate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The changes above require that we be able to identify an empty node - to this end, we will first define a constant which represents an invalid page number that is the child of every empty node.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+#define INVALID_PAGE_NUM UINT32_MAX
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Now, when an internal node is initialized, we initialize its right child with this invalid page number.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;@@ -330,6 +335,12 @@&lt;/span&gt; void initialize_internal_node(void* node) {
   set_node_type(node, NODE_INTERNAL);
   set_node_root(node, false);
   *internal_node_num_keys(node) = 0;
&lt;span class=&quot;gi&quot;&gt;+  /*
+  Necessary because the root page number is 0; by not initializing an internal 
+  node&apos;s right child to an invalid page number when initializing the node, we may
+  end up with 0 as the node&apos;s right child, which makes the node a parent of the root
+  */
+  *internal_node_right_child(node) = INVALID_PAGE_NUM;
&lt;/span&gt; }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This step was made necessary by a problem that the comment above attempts to summarize - when initializing an internal node without explicitly initializing the right child field, the value of that field at runtime could be 0 depending on the compiler or the architecture of the machine on which the program is being executed. Since we are using 0 as our root page number, this means that a newly allocated internal node will be a parent of the root.&lt;/p&gt;

&lt;p&gt;We have introduced some guards in our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;internal_node_child&lt;/code&gt; function to throw an error in the case of an attempt to access an invalid page.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;@@ -186,9 +188,19 @@&lt;/span&gt; uint32_t* internal_node_child(void* node, uint32_t child_num) {
     printf(&quot;Tried to access child_num %d &amp;gt; num_keys %d\n&quot;, child_num, num_keys);
     exit(EXIT_FAILURE);
   } else if (child_num == num_keys) {
&lt;span class=&quot;gd&quot;&gt;-    return internal_node_right_child(node);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+    uint32_t* right_child = internal_node_right_child(node);
+    if (*right_child == INVALID_PAGE_NUM) {
+      printf(&quot;Tried to access right child of node, but was invalid page\n&quot;);
+      exit(EXIT_FAILURE);
+    }
+    return right_child;
&lt;/span&gt;   } else {
&lt;span class=&quot;gd&quot;&gt;-    return internal_node_cell(node, child_num);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+    uint32_t* child = internal_node_cell(node, child_num);
+    if (*child == INVALID_PAGE_NUM) {
+      printf(&quot;Tried to access child %d of node, but was invalid page\n&quot;, child_num);
+      exit(EXIT_FAILURE);
+    }
+    return child;
&lt;/span&gt;   }
 }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;One additional guard is needed in our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;print_tree&lt;/code&gt; function to ensure that we do not attempt to print an empty node, as that would involve trying to access an invalid page.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;@@ -294,15 +305,17 @@&lt;/span&gt; void print_tree(Pager* pager, uint32_t page_num, uint32_t indentation_level) {
       num_keys = *internal_node_num_keys(node);
       indent(indentation_level);
       printf(&quot;- internal (size %d)\n&quot;, num_keys);
&lt;span class=&quot;gd&quot;&gt;-      for (uint32_t i = 0; i &amp;lt; num_keys; i++) {
-        child = *internal_node_child(node, i);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+      if (num_keys &amp;gt; 0) {
+        for (uint32_t i = 0; i &amp;lt; num_keys; i++) {
+          child = *internal_node_child(node, i);
+          print_tree(pager, child, indentation_level + 1);
+
+          indent(indentation_level + 1);
+          printf(&quot;- key %d\n&quot;, *internal_node_key(node, i));
+        }
+        child = *internal_node_right_child(node);
&lt;/span&gt;         print_tree(pager, child, indentation_level + 1);
&lt;span class=&quot;gd&quot;&gt;-
-        indent(indentation_level + 1);
-        printf(&quot;- key %d\n&quot;, *internal_node_key(node, i));
&lt;/span&gt;       }
&lt;span class=&quot;gd&quot;&gt;-      child = *internal_node_right_child(node);
-      print_tree(pager, child, indentation_level + 1);
&lt;/span&gt;       break;
   }
 }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now for the headliner, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;internal_node_split_and_insert&lt;/code&gt;. We will first provide it in its entirety, and then break it down by steps.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+void internal_node_split_and_insert(Table* table, uint32_t parent_page_num,
+                          uint32_t child_page_num) {
+  uint32_t old_page_num = parent_page_num;
+  void* old_node = get_page(table-&amp;gt;pager,parent_page_num);
+  uint32_t old_max = get_node_max_key(table-&amp;gt;pager, old_node);
+
+  void* child = get_page(table-&amp;gt;pager, child_page_num); 
+  uint32_t child_max = get_node_max_key(table-&amp;gt;pager, child);
+
+  uint32_t new_page_num = get_unused_page_num(table-&amp;gt;pager);
+
+  /*
+  Declaring a flag before updating pointers which
+  records whether this operation involves splitting the root -
+  if it does, we will insert our newly created node during
+  the step where the table&apos;s new root is created. If it does
+  not, we have to insert the newly created node into its parent
+  after the old node&apos;s keys have been transferred over. We are not
+  able to do this if the newly created node&apos;s parent is not a newly
+  initialized root node, because in that case its parent may have existing
+  keys aside from our old node which we are splitting. If that is true, we
+  need to find a place for our newly created node in its parent, and we
+  cannot insert it at the correct index if it does not yet have any keys
+  */
+  uint32_t splitting_root = is_node_root(old_node);
+
+  void* parent;
+  void* new_node;
+  if (splitting_root) {
+    create_new_root(table, new_page_num);
+    parent = get_page(table-&amp;gt;pager,table-&amp;gt;root_page_num);
+    /*
+    If we are splitting the root, we need to update old_node to point
+    to the new root&apos;s left child, new_page_num will already point to
+    the new root&apos;s right child
+    */
+    old_page_num = *internal_node_child(parent,0);
+    old_node = get_page(table-&amp;gt;pager, old_page_num);
+  } else {
+    parent = get_page(table-&amp;gt;pager,*node_parent(old_node));
+    new_node = get_page(table-&amp;gt;pager, new_page_num);
+    initialize_internal_node(new_node);
+  }
+  
+  uint32_t* old_num_keys = internal_node_num_keys(old_node);
+
+  uint32_t cur_page_num = *internal_node_right_child(old_node);
+  void* cur = get_page(table-&amp;gt;pager, cur_page_num);
+
+  /*
+  First put right child into new node and set right child of old node to invalid page number
+  */
+  internal_node_insert(table, new_page_num, cur_page_num);
+  *node_parent(cur) = new_page_num;
+  *internal_node_right_child(old_node) = INVALID_PAGE_NUM;
+  /*
+  For each key until you get to the middle key, move the key and the child to the new node
+  */
+  for (int i = INTERNAL_NODE_MAX_CELLS - 1; i &amp;gt; INTERNAL_NODE_MAX_CELLS / 2; i--) {
+    cur_page_num = *internal_node_child(old_node, i);
+    cur = get_page(table-&amp;gt;pager, cur_page_num);
+
+    internal_node_insert(table, new_page_num, cur_page_num);
+    *node_parent(cur) = new_page_num;
+
+    (*old_num_keys)--;
+  }
+
+  /*
+  Set child before middle key, which is now the highest key, to be node&apos;s right child,
+  and decrement number of keys
+  */
+  *internal_node_right_child(old_node) = *internal_node_child(old_node,*old_num_keys - 1);
+  (*old_num_keys)--;
+
+  /*
+  Determine which of the two nodes after the split should contain the child to be inserted,
+  and insert the child
+  */
+  uint32_t max_after_split = get_node_max_key(table-&amp;gt;pager, old_node);
+
+  uint32_t destination_page_num = child_max &amp;lt; max_after_split ? old_page_num : new_page_num;
+
+  internal_node_insert(table, destination_page_num, child_page_num);
+  *node_parent(child) = destination_page_num;
+
+  update_internal_node_key(parent, old_max, get_node_max_key(table-&amp;gt;pager, old_node));
+
+  if (!splitting_root) {
+    internal_node_insert(table,*node_parent(old_node),new_page_num);
+    *node_parent(new_node) = *node_parent(old_node);
+  }
+}
+
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The first thing we need to do is create a variable to store the page number of the node we are splitting (the old node from here out). This is necessary because the page number of the old node will change if it happens to be the table’s root node. We also need to remember what the node’s current max is, because that value represents its key in the parent, and that key will need to be updated with the old node’s new maximum after the split occurs.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  uint32_t old_page_num = parent_page_num;
+  void* old_node = get_page(table-&amp;gt;pager,parent_page_num);
+  uint32_t old_max = get_node_max_key(table-&amp;gt;pager, old_node);
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The next important step is the branching logic which depends on whether the old node is the table’s root node. We will need to keep track of this value for later use; as the comment attempts to convey, we run into a problem if we do not store this information at the beginning of our function definition - if we are not splitting the root, we cannot insert our newly created sibling node into the old node’s parent right away, because it does not yet contain any keys and therefore will not be placed at the right index among the other key/child pairs which may or may not already be present in the parent node.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  uint32_t splitting_root = is_node_root(old_node);
+
+  void* parent;
+  void* new_node;
+  if (splitting_root) {
+    create_new_root(table, new_page_num);
+    parent = get_page(table-&amp;gt;pager,table-&amp;gt;root_page_num);
+    /*
+    If we are splitting the root, we need to update old_node to point
+    to the new root&apos;s left child, new_page_num will already point to
+    the new root&apos;s right child
+    */
+    old_page_num = *internal_node_child(parent,0);
+    old_node = get_page(table-&amp;gt;pager, old_page_num);
+  } else {
+    parent = get_page(table-&amp;gt;pager,*node_parent(old_node));
+    new_node = get_page(table-&amp;gt;pager, new_page_num);
+    initialize_internal_node(new_node);
+  }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Once we have settled the question of splitting or not splitting the root, we begin moving keys from the old node to its sibling. We must first move the old node’s right child and set its right child field to an invalid page to indicate that it is empty. Now, we loop over the old node’s remaining keys, performing the following steps on each iteration:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Obtain a reference to the old node’s key and child at the current index&lt;/li&gt;
  &lt;li&gt;Insert the child into the sibling node&lt;/li&gt;
  &lt;li&gt;Update the child’s parent value to point to the sibling node&lt;/li&gt;
  &lt;li&gt;Decrement the old node’s number of keys&lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  uint32_t* old_num_keys = internal_node_num_keys(old_node);
+
+  uint32_t cur_page_num = *internal_node_right_child(old_node);
+  void* cur = get_page(table-&amp;gt;pager, cur_page_num);
+
+  /*
+  First put right child into new node and set right child of old node to invalid page number
+  */
+  internal_node_insert(table, new_page_num, cur_page_num);
+  *node_parent(cur) = new_page_num;
+  *internal_node_right_child(old_node) = INVALID_PAGE_NUM;
+  /*
+  For each key until you get to the middle key, move the key and the child to the new node
+  */
+  for (int i = INTERNAL_NODE_MAX_CELLS - 1; i &amp;gt; INTERNAL_NODE_MAX_CELLS / 2; i--) {
+    cur_page_num = *internal_node_child(old_node, i);
+    cur = get_page(table-&amp;gt;pager, cur_page_num);
+
+    internal_node_insert(table, new_page_num, cur_page_num);
+    *node_parent(cur) = new_page_num;
+
+    (*old_num_keys)--;
+  }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Step 4 is important, because it serves the purpose of “erasing” the key/child pair from the old node. Although we are not actually freeing the memory at that byte offset in the old node’s page, by decrementing the old node’s number of keys we are making that memory location inaccessible, and the bytes will be overwritten the next time a child is inserted into the old node.&lt;/p&gt;

&lt;p&gt;Also note the behavior of our loop invariant - if our maximum number of internal node keys changes in the future, our logic ensures that both our old node and our sibling node will end up with (n-1)/2 keys after the split, with the 1 remaining node going to the parent. If an even number is chosen as the maximum number of nodes, n/2 nodes will remain with the old node while (n-1)/2 will be moved to the sibling node. This logic would be straightforward to revise as needed.&lt;/p&gt;

&lt;p&gt;Once the keys to be moved have been, we set the old node’s i’th child as its right child and decrement its number of keys.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  /*
+  Set child before middle key, which is now the highest key, to be node&apos;s right child,
+  and decrement number of keys
+  */
+  *internal_node_right_child(old_node) = *internal_node_child(old_node,*old_num_keys - 1);
+  (*old_num_keys)--;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We then insert the child node into either the old node or the sibling node depending on the value of its max key.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  uint32_t max_after_split = get_node_max_key(table-&amp;gt;pager, old_node);
+
+  uint32_t destination_page_num = child_max &amp;lt; max_after_split ? old_page_num : new_page_num;
+
+  internal_node_insert(table, destination_page_num, child_page_num);
+  *node_parent(child) = destination_page_num;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Finally, we update the old node’s key in its parent, and insert the sibling node and update the sibling node’s parent pointer if necessary.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  update_internal_node_key(parent, old_max, get_node_max_key(table-&amp;gt;pager, old_node));
+
+  if (!splitting_root) {
+    internal_node_insert(table,*node_parent(old_node),new_page_num);
+    *node_parent(new_node) = *node_parent(old_node);
+  }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;One important change required to support this new logic is in our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;create_new_root&lt;/code&gt; function. Before, we were only taking into account situations where the new root’s children would be leaf nodes. If the new root’s children are instead internal nodes, we need to do two things:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Correctly initialize the root’s new children to be internal nodes&lt;/li&gt;
  &lt;li&gt;In addition to the call to memcpy, we need to insert each of the root’s keys into its new left child and update the parent pointer of each of those children&lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;@@ -661,22 +680,40 @@&lt;/span&gt; void create_new_root(Table* table, uint32_t right_child_page_num) {
   uint32_t left_child_page_num = get_unused_page_num(table-&amp;gt;pager);
   void* left_child = get_page(table-&amp;gt;pager, left_child_page_num);
 
&lt;span class=&quot;gi&quot;&gt;+  if (get_node_type(root) == NODE_INTERNAL) {
+    initialize_internal_node(right_child);
+    initialize_internal_node(left_child);
+  }
+
&lt;/span&gt;   /* Left child has data copied from old root */
   memcpy(left_child, root, PAGE_SIZE);
   set_node_root(left_child, false);
 
&lt;span class=&quot;gi&quot;&gt;+  if (get_node_type(left_child) == NODE_INTERNAL) {
+    void* child;
+    for (int i = 0; i &amp;lt; *internal_node_num_keys(left_child); i++) {
+      child = get_page(table-&amp;gt;pager, *internal_node_child(left_child,i));
+      *node_parent(child) = left_child_page_num;
+    }
+    child = get_page(table-&amp;gt;pager, *internal_node_right_child(left_child));
+    *node_parent(child) = left_child_page_num;
+  }
+
&lt;/span&gt;   /* Root node is a new internal node with one key and two children */
   initialize_internal_node(root);
   set_node_root(root, true);
   *internal_node_num_keys(root) = 1;
   *internal_node_child(root, 0) = left_child_page_num;
&lt;span class=&quot;gd&quot;&gt;-  uint32_t left_child_max_key = get_node_max_key(left_child);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+  uint32_t left_child_max_key = get_node_max_key(table-&amp;gt;pager, left_child);
&lt;/span&gt;   *internal_node_key(root, 0) = left_child_max_key;
   *internal_node_right_child(root) = right_child_page_num;
   *node_parent(left_child) = table-&amp;gt;root_page_num;
   *node_parent(right_child) = table-&amp;gt;root_page_num;
 }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Another important change has been made to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get_node_max_key&lt;/code&gt;, as mentioned at the beginning of this article. Since an internal node’s key represents the maximum of the tree pointed to by the child to its left, and that child can be a tree of arbitrary depth, we need to walk down the right children of that tree until we get to a leaf node, and then take the maximum key of that leaf node.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+uint32_t get_node_max_key(Pager* pager, void* node) {
+  if (get_node_type(node) == NODE_LEAF) {
+    return *leaf_node_key(node, *leaf_node_num_cells(node) - 1);
+  }
+  void* right_child = get_page(pager,*internal_node_right_child(node));
+  return get_node_max_key(pager, right_child);
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We have written a single test to demonstrate that our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;print_tree&lt;/code&gt; function still works after the introduction of internal node splitting.&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  it &apos;allows printing out the structure of a 7-leaf-node btree&apos; do
+    script = [
+      &quot;insert 58 user58 person58@example.com&quot;,
+      &quot;insert 56 user56 person56@example.com&quot;,
+      &quot;insert 8 user8 person8@example.com&quot;,
+      &quot;insert 54 user54 person54@example.com&quot;,
+      &quot;insert 77 user77 person77@example.com&quot;,
+      &quot;insert 7 user7 person7@example.com&quot;,
+      &quot;insert 25 user25 person25@example.com&quot;,
+      &quot;insert 71 user71 person71@example.com&quot;,
+      &quot;insert 13 user13 person13@example.com&quot;,
+      &quot;insert 22 user22 person22@example.com&quot;,
+      &quot;insert 53 user53 person53@example.com&quot;,
+      &quot;insert 51 user51 person51@example.com&quot;,
+      &quot;insert 59 user59 person59@example.com&quot;,
+      &quot;insert 32 user32 person32@example.com&quot;,
+      &quot;insert 36 user36 person36@example.com&quot;,
+      &quot;insert 79 user79 person79@example.com&quot;,
+      &quot;insert 10 user10 person10@example.com&quot;,
+      &quot;insert 33 user33 person33@example.com&quot;,
+      &quot;insert 20 user20 person20@example.com&quot;,
+      &quot;insert 4 user4 person4@example.com&quot;,
+      &quot;insert 35 user35 person35@example.com&quot;,
+      &quot;insert 76 user76 person76@example.com&quot;,
+      &quot;insert 49 user49 person49@example.com&quot;,
+      &quot;insert 24 user24 person24@example.com&quot;,
+      &quot;insert 70 user70 person70@example.com&quot;,
+      &quot;insert 48 user48 person48@example.com&quot;,
+      &quot;insert 39 user39 person39@example.com&quot;,
+      &quot;insert 15 user15 person15@example.com&quot;,
+      &quot;insert 47 user47 person47@example.com&quot;,
+      &quot;insert 30 user30 person30@example.com&quot;,
+      &quot;insert 86 user86 person86@example.com&quot;,
+      &quot;insert 31 user31 person31@example.com&quot;,
+      &quot;insert 68 user68 person68@example.com&quot;,
+      &quot;insert 37 user37 person37@example.com&quot;,
+      &quot;insert 66 user66 person66@example.com&quot;,
+      &quot;insert 63 user63 person63@example.com&quot;,
+      &quot;insert 40 user40 person40@example.com&quot;,
+      &quot;insert 78 user78 person78@example.com&quot;,
+      &quot;insert 19 user19 person19@example.com&quot;,
+      &quot;insert 46 user46 person46@example.com&quot;,
+      &quot;insert 14 user14 person14@example.com&quot;,
+      &quot;insert 81 user81 person81@example.com&quot;,
+      &quot;insert 72 user72 person72@example.com&quot;,
+      &quot;insert 6 user6 person6@example.com&quot;,
+      &quot;insert 50 user50 person50@example.com&quot;,
+      &quot;insert 85 user85 person85@example.com&quot;,
+      &quot;insert 67 user67 person67@example.com&quot;,
+      &quot;insert 2 user2 person2@example.com&quot;,
+      &quot;insert 55 user55 person55@example.com&quot;,
+      &quot;insert 69 user69 person69@example.com&quot;,
+      &quot;insert 5 user5 person5@example.com&quot;,
+      &quot;insert 65 user65 person65@example.com&quot;,
+      &quot;insert 52 user52 person52@example.com&quot;,
+      &quot;insert 1 user1 person1@example.com&quot;,
+      &quot;insert 29 user29 person29@example.com&quot;,
+      &quot;insert 9 user9 person9@example.com&quot;,
+      &quot;insert 43 user43 person43@example.com&quot;,
+      &quot;insert 75 user75 person75@example.com&quot;,
+      &quot;insert 21 user21 person21@example.com&quot;,
+      &quot;insert 82 user82 person82@example.com&quot;,
+      &quot;insert 12 user12 person12@example.com&quot;,
+      &quot;insert 18 user18 person18@example.com&quot;,
+      &quot;insert 60 user60 person60@example.com&quot;,
+      &quot;insert 44 user44 person44@example.com&quot;,
+      &quot;.btree&quot;,
+      &quot;.exit&quot;,
+    ]
+    result = run_script(script)
+
+    expect(result[64...(result.length)]).to match_array([
+      &quot;db &amp;gt; Tree:&quot;,
+      &quot;- internal (size 1)&quot;,
+      &quot;  - internal (size 2)&quot;,
+      &quot;    - leaf (size 7)&quot;,
+      &quot;      - 1&quot;,
+      &quot;      - 2&quot;,
+      &quot;      - 4&quot;,
+      &quot;      - 5&quot;,
+      &quot;      - 6&quot;,
+      &quot;      - 7&quot;,
+      &quot;      - 8&quot;,
+      &quot;    - key 8&quot;,
+      &quot;    - leaf (size 11)&quot;,
+      &quot;      - 9&quot;,
+      &quot;      - 10&quot;,
+      &quot;      - 12&quot;,
+      &quot;      - 13&quot;,
+      &quot;      - 14&quot;,
+      &quot;      - 15&quot;,
+      &quot;      - 18&quot;,
+      &quot;      - 19&quot;,
+      &quot;      - 20&quot;,
+      &quot;      - 21&quot;,
+      &quot;      - 22&quot;,
+      &quot;    - key 22&quot;,
+      &quot;    - leaf (size 8)&quot;,
+      &quot;      - 24&quot;,
+      &quot;      - 25&quot;,
+      &quot;      - 29&quot;,
+      &quot;      - 30&quot;,
+      &quot;      - 31&quot;,
+      &quot;      - 32&quot;,
+      &quot;      - 33&quot;,
+      &quot;      - 35&quot;,
+      &quot;  - key 35&quot;,
+      &quot;  - internal (size 3)&quot;,
+      &quot;    - leaf (size 12)&quot;,
+      &quot;      - 36&quot;,
+      &quot;      - 37&quot;,
+      &quot;      - 39&quot;,
+      &quot;      - 40&quot;,
+      &quot;      - 43&quot;,
+      &quot;      - 44&quot;,
+      &quot;      - 46&quot;,
+      &quot;      - 47&quot;,
+      &quot;      - 48&quot;,
+      &quot;      - 49&quot;,
+      &quot;      - 50&quot;,
+      &quot;      - 51&quot;,
+      &quot;    - key 51&quot;,
+      &quot;    - leaf (size 11)&quot;,
+      &quot;      - 52&quot;,
+      &quot;      - 53&quot;,
+      &quot;      - 54&quot;,
+      &quot;      - 55&quot;,
+      &quot;      - 56&quot;,
+      &quot;      - 58&quot;,
+      &quot;      - 59&quot;,
+      &quot;      - 60&quot;,
+      &quot;      - 63&quot;,
+      &quot;      - 65&quot;,
+      &quot;      - 66&quot;,
+      &quot;    - key 66&quot;,
+      &quot;    - leaf (size 7)&quot;,
+      &quot;      - 67&quot;,
+      &quot;      - 68&quot;,
+      &quot;      - 69&quot;,
+      &quot;      - 70&quot;,
+      &quot;      - 71&quot;,
+      &quot;      - 72&quot;,
+      &quot;      - 75&quot;,
+      &quot;    - key 75&quot;,
+      &quot;    - leaf (size 8)&quot;,
+      &quot;      - 76&quot;,
+      &quot;      - 77&quot;,
+      &quot;      - 78&quot;,
+      &quot;      - 79&quot;,
+      &quot;      - 81&quot;,
+      &quot;      - 82&quot;,
+      &quot;      - 85&quot;,
+      &quot;      - 86&quot;,
+      &quot;db &amp;gt; &quot;,
+    ])
+  end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
</description>
        <pubDate>Tue, 23 May 2023 00:00:00 +0000</pubDate>
        <link>https://ibra.github.io/db_tutorial/parts/part14.html</link>
        <guid isPermaLink="true">https://ibra.github.io/db_tutorial/parts/part14.html</guid>
      </item>
    
      <item>
        <title>Part 15 - Deleting Rows from a Leaf Node</title>
        <description>&lt;p&gt;We can insert rows and we can read them back out. But we can’t remove them. Every real database needs a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;delete&lt;/code&gt; command, and ours is no exception. In this article we’ll implement simple deletion from a leaf node.&lt;/p&gt;

&lt;p&gt;I’m going to hold off on rebalancing the tree after deletion for now – we’ll tackle that in the next part. For now, deleting a row means finding it in the B-tree and removing it from its leaf node.&lt;/p&gt;

&lt;h2 id=&quot;parsing-the-delete-statement&quot;&gt;Parsing the Delete Statement&lt;/h2&gt;

&lt;p&gt;First, let’s add a new statement type:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gd&quot;&gt;-typedef enum { STATEMENT_INSERT, STATEMENT_SELECT } StatementType;
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+typedef enum {
+  STATEMENT_INSERT,
+  STATEMENT_SELECT,
+  STATEMENT_DELETE
+} StatementType;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And a new execute result for when the key doesn’t exist:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; typedef enum {
   EXECUTE_SUCCESS,
   EXECUTE_DUPLICATE_KEY,
&lt;span class=&quot;gi&quot;&gt;+  EXECUTE_KEY_NOT_FOUND,
&lt;/span&gt; } ExecuteResult;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The syntax for delete will be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;delete &amp;lt;id&amp;gt;&lt;/code&gt;. Parsing is similar to insert, but we only need the id:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+PrepareResult prepare_delete(InputBuffer* input_buffer, Statement* statement) {
+  statement-&amp;gt;type = STATEMENT_DELETE;
+
+  char* keyword = strtok(input_buffer-&amp;gt;buffer, &quot; &quot;);
+  char* id_string = strtok(NULL, &quot; &quot;);
+
+  if (id_string == NULL) {
+    return PREPARE_SYNTAX_ERROR;
+  }
+
+  int id = atoi(id_string);
+  if (id &amp;lt; 0) {
+    return PREPARE_NEGATIVE_ID;
+  }
+
+  statement-&amp;gt;row_to_insert.id = id;
+
+  return PREPARE_SUCCESS;
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We’re reusing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;row_to_insert.id&lt;/code&gt; to store the key we want to delete. It’s a bit of a hack, but it saves us from adding another field to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Statement&lt;/code&gt; just to hold a single integer.&lt;/p&gt;

&lt;p&gt;Now wire it into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prepare_statement()&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; PrepareResult prepare_statement(InputBuffer* input_buffer,
                                 Statement* statement) {
   if (strncmp(input_buffer-&amp;gt;buffer, &quot;insert&quot;, 6) == 0) {
     return prepare_insert(input_buffer, statement);
   }
   if (strcmp(input_buffer-&amp;gt;buffer, &quot;select&quot;) == 0) {
     statement-&amp;gt;type = STATEMENT_SELECT;
     return PREPARE_SUCCESS;
   }
&lt;span class=&quot;gi&quot;&gt;+  if (strncmp(input_buffer-&amp;gt;buffer, &quot;delete&quot;, 6) == 0) {
+    return prepare_delete(input_buffer, statement);
+  }
&lt;/span&gt;
   return PREPARE_UNRECOGNIZED_STATEMENT;
 }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;removing-a-cell-from-a-leaf-node&quot;&gt;Removing a Cell from a Leaf Node&lt;/h2&gt;

&lt;p&gt;The actual removal is straightforward. We shift all cells after the deleted one to the left by one position, then decrement the cell count:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+void leaf_node_delete(Cursor* cursor) {
+  void* node = get_page(cursor-&amp;gt;table-&amp;gt;pager, cursor-&amp;gt;page_num);
+  uint32_t num_cells = *leaf_node_num_cells(node);
+
+  // Shift cells to fill the gap
+  for (uint32_t i = cursor-&amp;gt;cell_num; i &amp;lt; num_cells - 1; i++) {
+    memcpy(leaf_node_cell(node, i), leaf_node_cell(node, i + 1),
+           LEAF_NODE_CELL_SIZE);
+  }
+
+  *(leaf_node_num_cells(node)) -= 1;
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Think of it like removing an element from the middle of an array. Everything to the right slides over to fill the hole.&lt;/p&gt;

&lt;h2 id=&quot;executing-the-delete&quot;&gt;Executing the Delete&lt;/h2&gt;

&lt;p&gt;Now we need &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;execute_delete()&lt;/code&gt;. It uses &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;table_find()&lt;/code&gt; to locate the key in the B-tree, checks that the key actually exists, and then calls &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;leaf_node_delete()&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+ExecuteResult execute_delete(Statement* statement, Table* table) {
+  uint32_t key_to_delete = statement-&amp;gt;row_to_insert.id;
+  Cursor* cursor = table_find(table, key_to_delete);
+
+  void* node = get_page(table-&amp;gt;pager, cursor-&amp;gt;page_num);
+  uint32_t num_cells = *leaf_node_num_cells(node);
+
+  if (cursor-&amp;gt;cell_num &amp;gt;= num_cells) {
+    free(cursor);
+    return EXECUTE_KEY_NOT_FOUND;
+  }
+
+  uint32_t key_at_index = *leaf_node_key(node, cursor-&amp;gt;cell_num);
+  if (key_at_index != key_to_delete) {
+    free(cursor);
+    return EXECUTE_KEY_NOT_FOUND;
+  }
+
+  leaf_node_delete(cursor);
+
+  free(cursor);
+
+  return EXECUTE_SUCCESS;
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;table_find()&lt;/code&gt; returns a cursor pointing to the position where the key should be. But the key might not actually be there – maybe we’re looking for a key that was never inserted. So we check two things: is the cursor past the end of the leaf, and does the key at the cursor’s position actually match? If either check fails, the key doesn’t exist.&lt;/p&gt;

&lt;p&gt;Wire it into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;execute_statement()&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; ExecuteResult execute_statement(Statement* statement, Table* table) {
   switch (statement-&amp;gt;type) {
     case (STATEMENT_INSERT):
       return execute_insert(statement, table);
     case (STATEMENT_SELECT):
       return execute_select(statement, table);
&lt;span class=&quot;gi&quot;&gt;+    case (STATEMENT_DELETE):
+      return execute_delete(statement, table);
&lt;/span&gt;   }
 }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And handle the new result in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;main()&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;     switch (execute_statement(&amp;amp;statement, table)) {
       case (EXECUTE_SUCCESS):
         printf(&quot;Executed.\n&quot;);
         break;
       case (EXECUTE_DUPLICATE_KEY):
         printf(&quot;Error: Duplicate key.\n&quot;);
         break;
&lt;span class=&quot;gi&quot;&gt;+      case (EXECUTE_KEY_NOT_FOUND):
+        printf(&quot;Error: Key not found.\n&quot;);
+        break;
&lt;/span&gt;     }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;testing&quot;&gt;Testing&lt;/h2&gt;

&lt;p&gt;Let’s try it out:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;db &amp;gt; insert 1 user1 person1@example.com
Executed.
db &amp;gt; insert 2 user2 person2@example.com
Executed.
db &amp;gt; insert 3 user3 person3@example.com
Executed.
db &amp;gt; delete 2
Executed.
db &amp;gt; select
(1, user1, person1@example.com)
(3, user3, person3@example.com)
Executed.
db &amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Sweet, it works! The row with id 2 is gone.&lt;/p&gt;

&lt;p&gt;What happens if we try to delete a key that doesn’t exist?&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;db &amp;gt; delete 5
Error: Key not found.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And our deletion persists across sessions too:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  it &apos;deletes a row&apos; do
+    script = [
+      &quot;insert 1 user1 person1@example.com&quot;,
+      &quot;insert 2 user2 person2@example.com&quot;,
+      &quot;insert 3 user3 person3@example.com&quot;,
+      &quot;delete 2&quot;,
+      &quot;select&quot;,
+      &quot;.exit&quot;,
+    ]
+    result = run_script(script)
+    expect(result).to match_array([
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; (1, user1, person1@example.com)&quot;,
+      &quot;(3, user3, person3@example.com)&quot;,
+      &quot;Executed.&quot;,
+      &quot;db &amp;gt; &quot;,
+    ])
+  end
+
+  it &apos;prints error message when deleting non-existent key&apos; do
+    script = [
+      &quot;insert 1 user1 person1@example.com&quot;,
+      &quot;delete 5&quot;,
+      &quot;select&quot;,
+      &quot;.exit&quot;,
+    ]
+    result = run_script(script)
+    expect(result).to match_array([
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; Error: Key not found.&quot;,
+      &quot;db &amp;gt; (1, user1, person1@example.com)&quot;,
+      &quot;Executed.&quot;,
+      &quot;db &amp;gt; &quot;,
+    ])
+  end
+
+  it &apos;deletes rows and persists changes&apos; do
+    result1 = run_script([
+      &quot;insert 1 user1 person1@example.com&quot;,
+      &quot;insert 2 user2 person2@example.com&quot;,
+      &quot;insert 3 user3 person3@example.com&quot;,
+      &quot;delete 2&quot;,
+      &quot;.exit&quot;,
+    ])
+
+    result2 = run_script([
+      &quot;select&quot;,
+      &quot;.exit&quot;,
+    ])
+    expect(result2).to match_array([
+      &quot;db &amp;gt; (1, user1, person1@example.com)&quot;,
+      &quot;(3, user3, person3@example.com)&quot;,
+      &quot;Executed.&quot;,
+      &quot;db &amp;gt; &quot;,
+    ])
+  end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;a-looming-problem&quot;&gt;A Looming Problem&lt;/h2&gt;

&lt;p&gt;This works great for small trees. But there’s a subtle issue we’re ignoring. In a B-tree, every non-root node must maintain a minimum number of keys. When we delete a cell from a leaf that’s already at its minimum occupancy, the node “underflows.” A well-behaved B-tree fixes this by borrowing from a sibling or merging two nodes together.&lt;/p&gt;

&lt;p&gt;We’re not doing any of that yet. If you delete enough rows from a leaf, it could end up empty while its parent still points to it. That’s a problem.&lt;/p&gt;

&lt;p&gt;Next time we’ll implement rebalancing: borrowing from siblings, merging underflowing nodes, and collapsing the tree when the root becomes unnecessary. It’s gonna be great.&lt;/p&gt;
</description>
        <pubDate>Mon, 01 Apr 2024 00:00:00 +0000</pubDate>
        <link>https://ibra.github.io/db_tutorial/parts/part15.html</link>
        <guid isPermaLink="true">https://ibra.github.io/db_tutorial/parts/part15.html</guid>
      </item>
    
      <item>
        <title>Part 16 - Rebalancing the B-Tree After Deletion</title>
        <description>&lt;p&gt;Last time we added a simple delete command. It works, but it leaves the tree in a potentially invalid state. In a B+ tree, every non-root node must maintain a minimum number of keys. When deletion causes a node to drop below that minimum, the node “underflows” and the tree needs to be rebalanced.&lt;/p&gt;

&lt;p&gt;This is the deletion counterpart to the splitting we implemented for insertion. Splitting handles overflow; rebalancing handles underflow.&lt;/p&gt;

&lt;h2 id=&quot;minimum-occupancy&quot;&gt;Minimum Occupancy&lt;/h2&gt;

&lt;p&gt;First, let’s define how few cells a node is allowed to have. The standard rule is half the maximum:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+/*
+ * Minimum occupancy for non-root nodes
+ */
+const uint32_t LEAF_NODE_MIN_CELLS = LEAF_NODE_MAX_CELLS / 2;
+const uint32_t INTERNAL_NODE_MIN_KEYS = INTERNAL_NODE_MAX_KEYS / 2;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LEAF_NODE_MAX_CELLS&lt;/code&gt; at 13, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LEAF_NODE_MIN_CELLS&lt;/code&gt; is 6. With &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INTERNAL_NODE_MAX_KEYS&lt;/code&gt; at 3, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INTERNAL_NODE_MIN_KEYS&lt;/code&gt; is 1. The root is exempt from this rule – it can have as few as zero cells.&lt;/p&gt;

&lt;h2 id=&quot;the-strategy&quot;&gt;The Strategy&lt;/h2&gt;

&lt;p&gt;When a leaf underflows, we have two strategies:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Borrow&lt;/strong&gt; from a sibling that has more than the minimum. Shift one cell over.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Merge&lt;/strong&gt; with a sibling if neither has cells to spare. Combine both nodes into one and remove the separator key from the parent.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If merging causes the parent to underflow, the same logic applies recursively up the tree. If the root ends up with zero keys, we promote its only child to be the new root, reducing the tree’s height.&lt;/p&gt;

&lt;h2 id=&quot;finding-a-childs-position-in-its-parent&quot;&gt;Finding a Child’s Position in its Parent&lt;/h2&gt;

&lt;p&gt;To rebalance, we need to know which position a node occupies among its parent’s children. This helper scans the parent to find it:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+uint32_t find_child_index(void* parent, uint32_t child_page_num) {
+  uint32_t num_keys = *internal_node_num_keys(parent);
+  for (uint32_t i = 0; i &amp;lt;= num_keys; i++) {
+    if (*internal_node_child(parent, i) == child_page_num) {
+      return i;
+    }
+  }
+  printf(&quot;Could not find child in parent node.\n&quot;);
+  exit(EXIT_FAILURE);
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This iterates through all children (including the right child at index &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;num_keys&lt;/code&gt;) until it finds a match.&lt;/p&gt;

&lt;h2 id=&quot;leaf-node-rebalancing&quot;&gt;Leaf Node Rebalancing&lt;/h2&gt;

&lt;p&gt;Here’s the main rebalancing function for leaf nodes. It checks for underflow, then tries borrowing before falling back to merging:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+void leaf_node_rebalance(Table* table, uint32_t page_num) {
+  void* node = get_page(table-&amp;gt;pager, page_num);
+
+  if (is_node_root(node)) {
+    return;
+  }
+
+  uint32_t num_cells = *leaf_node_num_cells(node);
+  if (num_cells &amp;gt;= LEAF_NODE_MIN_CELLS) {
+    return;
+  }
+
+  uint32_t parent_page_num = *node_parent(node);
+  void* parent = get_page(table-&amp;gt;pager, parent_page_num);
+  uint32_t child_index = find_child_index(parent, page_num);
+  uint32_t parent_num_keys = *internal_node_num_keys(parent);
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The root can have any number of cells, and if the leaf has at least the minimum, there’s nothing to do.&lt;/p&gt;

&lt;h3 id=&quot;borrowing-from-the-right-sibling&quot;&gt;Borrowing from the Right Sibling&lt;/h3&gt;

&lt;p&gt;If the right sibling has more than the minimum, we take its first cell:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  if (child_index &amp;lt; parent_num_keys) {
+    uint32_t right_page = *internal_node_child(parent, child_index + 1);
+    void* right_sibling = get_page(table-&amp;gt;pager, right_page);
+
+    if (*leaf_node_num_cells(right_sibling) &amp;gt; LEAF_NODE_MIN_CELLS) {
+      memcpy(leaf_node_cell(node, num_cells),
+             leaf_node_cell(right_sibling, 0), LEAF_NODE_CELL_SIZE);
+      *(leaf_node_num_cells(node)) += 1;
+
+      uint32_t right_cells = *leaf_node_num_cells(right_sibling);
+      for (uint32_t i = 0; i &amp;lt; right_cells - 1; i++) {
+        memcpy(leaf_node_cell(right_sibling, i),
+               leaf_node_cell(right_sibling, i + 1), LEAF_NODE_CELL_SIZE);
+      }
+      *(leaf_node_num_cells(right_sibling)) -= 1;
+
+      *internal_node_key(parent, child_index) =
+          get_node_max_key(table-&amp;gt;pager, node);
+      return;
+    }
+  }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The borrowed cell goes at the end of the current node (it has a higher key). Then we shift the right sibling’s remaining cells left and update the parent’s key for this node.&lt;/p&gt;

&lt;h3 id=&quot;borrowing-from-the-left-sibling&quot;&gt;Borrowing from the Left Sibling&lt;/h3&gt;

&lt;p&gt;If there’s no right sibling to borrow from, try the left:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  if (child_index &amp;gt; 0) {
+    uint32_t left_page = *internal_node_child(parent, child_index - 1);
+    void* left_sibling = get_page(table-&amp;gt;pager, left_page);
+
+    if (*leaf_node_num_cells(left_sibling) &amp;gt; LEAF_NODE_MIN_CELLS) {
+      for (uint32_t i = num_cells; i &amp;gt; 0; i--) {
+        memcpy(leaf_node_cell(node, i), leaf_node_cell(node, i - 1),
+               LEAF_NODE_CELL_SIZE);
+      }
+
+      uint32_t left_cells = *leaf_node_num_cells(left_sibling);
+      memcpy(leaf_node_cell(node, 0),
+             leaf_node_cell(left_sibling, left_cells - 1),
+             LEAF_NODE_CELL_SIZE);
+      *(leaf_node_num_cells(node)) += 1;
+      *(leaf_node_num_cells(left_sibling)) -= 1;
+
+      *internal_node_key(parent, child_index - 1) =
+          get_node_max_key(table-&amp;gt;pager, left_sibling);
+      return;
+    }
+  }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This time the borrowed cell goes at the beginning of the current node (it has a lower key), so we have to shift our existing cells right first. Then update the parent’s key for the left sibling, since its max key has changed.&lt;/p&gt;

&lt;h3 id=&quot;merging&quot;&gt;Merging&lt;/h3&gt;

&lt;p&gt;If neither sibling can lend a cell, we merge. If we can merge with the right sibling, we absorb its cells into the current node:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  if (child_index &amp;lt; parent_num_keys) {
+    uint32_t right_page = *internal_node_child(parent, child_index + 1);
+    void* right_sibling = get_page(table-&amp;gt;pager, right_page);
+    uint32_t right_cells = *leaf_node_num_cells(right_sibling);
+
+    for (uint32_t i = 0; i &amp;lt; right_cells; i++) {
+      memcpy(leaf_node_cell(node, num_cells + i),
+             leaf_node_cell(right_sibling, i), LEAF_NODE_CELL_SIZE);
+    }
+    *(leaf_node_num_cells(node)) = num_cells + right_cells;
+    *leaf_node_next_leaf(node) = *leaf_node_next_leaf(right_sibling);
+
+    *internal_node_key(parent, child_index) =
+        get_node_max_key(table-&amp;gt;pager, node);
+    internal_node_remove_child(table, parent_page_num, child_index + 1);
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We copy all cells from the right sibling, fix the next_leaf pointer chain, update the parent key, and remove the right sibling’s entry from the parent. If the current node is the rightmost child, we merge into the left sibling instead, using the same logic in reverse.&lt;/p&gt;

&lt;h2 id=&quot;removing-a-child-from-an-internal-node&quot;&gt;Removing a Child from an Internal Node&lt;/h2&gt;

&lt;p&gt;When two leaves merge, one of them disappears and we need to remove its entry from the parent. This function handles removal of a child at a given index:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+void internal_node_remove_child(Table* table, uint32_t page_num,
+                                uint32_t child_index) {
+  void* node = get_page(table-&amp;gt;pager, page_num);
+  uint32_t num_keys = *internal_node_num_keys(node);
+
+  if (child_index == num_keys) {
+    if (num_keys &amp;gt; 0) {
+      *internal_node_right_child(node) =
+          *internal_node_child(node, num_keys - 1);
+      *(internal_node_num_keys(node)) = num_keys - 1;
+    } else {
+      *internal_node_right_child(node) = INVALID_PAGE_NUM;
+    }
+  } else {
+    for (uint32_t i = child_index; i &amp;lt; num_keys - 1; i++) {
+      memcpy(internal_node_cell(node, i), internal_node_cell(node, i + 1),
+             INTERNAL_NODE_CELL_SIZE);
+    }
+    *(internal_node_num_keys(node)) = num_keys - 1;
+  }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If we’re removing the rightmost child, the last regular cell’s child gets promoted to right child. Otherwise, we shift cells left to fill the gap.&lt;/p&gt;

&lt;p&gt;After removing, this function also checks whether the internal node underflows, and if so, kicks off &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;internal_node_rebalance()&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;tree-height-reduction&quot;&gt;Tree Height Reduction&lt;/h2&gt;

&lt;p&gt;The most satisfying part: when merges cascade up to the root and the root has zero keys left, its only child is promoted to be the new root. The tree gets shorter:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  if (is_node_root(node) &amp;amp;&amp;amp; *internal_node_num_keys(node) == 0) {
+    uint32_t child_page = *internal_node_right_child(node);
+    if (child_page == INVALID_PAGE_NUM) {
+      return;
+    }
+    void* child = get_page(table-&amp;gt;pager, child_page);
+    memcpy(node, child, PAGE_SIZE);
+    set_node_root(node, true);
+
+    if (get_node_type(node) == NODE_INTERNAL) {
+      uint32_t promoted_keys = *internal_node_num_keys(node);
+      for (uint32_t i = 0; i &amp;lt; promoted_keys; i++) {
+        void* c = get_page(table-&amp;gt;pager, *internal_node_child(node, i));
+        *node_parent(c) = table-&amp;gt;root_page_num;
+      }
+      void* rc = get_page(table-&amp;gt;pager, *internal_node_right_child(node));
+      *node_parent(rc) = table-&amp;gt;root_page_num;
+    }
+  }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We copy the child’s contents into the root page (keeping page 0 as the root), mark it as the root, and update the parent pointers of the promoted node’s children. This is the mirror image of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;create_new_root()&lt;/code&gt; – one grows the tree, the other shrinks it.&lt;/p&gt;

&lt;h2 id=&quot;triggering-rebalancing&quot;&gt;Triggering Rebalancing&lt;/h2&gt;

&lt;p&gt;Finally, we update &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;execute_delete()&lt;/code&gt; to update the parent key when the max key changes and to call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;leaf_node_rebalance()&lt;/code&gt; after every deletion:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  uint32_t leaf_page_num = cursor-&amp;gt;page_num;
+  uint32_t old_max = get_node_max_key(table-&amp;gt;pager, node);
+
&lt;/span&gt;   leaf_node_delete(cursor);
&lt;span class=&quot;gd&quot;&gt;-
-  free(cursor);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+  free(cursor);
+
+  node = get_page(table-&amp;gt;pager, leaf_page_num);
+  if (!is_node_root(node) &amp;amp;&amp;amp; *leaf_node_num_cells(node) &amp;gt; 0) {
+    uint32_t new_max = get_node_max_key(table-&amp;gt;pager, node);
+    if (new_max != old_max) {
+      uint32_t parent_page = *node_parent(node);
+      void* parent = get_page(table-&amp;gt;pager, parent_page);
+      update_internal_node_key(parent, old_max, new_max);
+    }
+  }
+
+  leaf_node_rebalance(table, leaf_page_num);
&lt;/span&gt;
   return EXECUTE_SUCCESS;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;testing&quot;&gt;Testing&lt;/h2&gt;

&lt;p&gt;Let’s make sure we can delete from multi-level trees:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  it &apos;deletes rows from a multi-level tree&apos; do
+    script = (1..15).map do |i|
+      &quot;insert #{i} user#{i} person#{i}@example.com&quot;
+    end
+    script &amp;lt;&amp;lt; &quot;delete 7&quot;
+    script &amp;lt;&amp;lt; &quot;.btree&quot;
+    script &amp;lt;&amp;lt; &quot;select&quot;
+    script &amp;lt;&amp;lt; &quot;.exit&quot;
+    result = run_script(script)
+
+    expect(result).to include(&quot;Executed.&quot;)
+    expect(result).not_to include(&quot;(7, user7, person7@example.com)&quot;)
+    expect(result).to include(&quot;(1, user1, person1@example.com)&quot;)
+    expect(result).to include(&quot;(15, user15, person15@example.com)&quot;)
+  end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And verify that we can delete every row without crashing:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  it &apos;handles deleting all rows&apos; do
+    script = [
+      &quot;insert 1 user1 person1@example.com&quot;,
+      &quot;insert 2 user2 person2@example.com&quot;,
+      &quot;insert 3 user3 person3@example.com&quot;,
+      &quot;delete 1&quot;,
+      &quot;delete 2&quot;,
+      &quot;delete 3&quot;,
+      &quot;select&quot;,
+      &quot;.exit&quot;,
+    ]
+    result = run_script(script)
+    expect(result).to match_array([
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; &quot;,
+    ])
+  end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The tree grows when we insert enough rows, and now it can shrink back down when we delete them. That’s the full B-tree lifecycle.&lt;/p&gt;

&lt;p&gt;Next time we’ll add the ability to search for specific rows with a WHERE clause, putting our B-tree index to real use.&lt;/p&gt;
</description>
        <pubDate>Mon, 15 Apr 2024 00:00:00 +0000</pubDate>
        <link>https://ibra.github.io/db_tutorial/parts/part16.html</link>
        <guid isPermaLink="true">https://ibra.github.io/db_tutorial/parts/part16.html</guid>
      </item>
    
      <item>
        <title>Part 17 - The WHERE Clause</title>
        <description>&lt;p&gt;Up until now, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;select&lt;/code&gt; dumps every row in the table. That’s fine for debugging, but a real database lets you ask for specific rows. Time to add a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clause.&lt;/p&gt;

&lt;p&gt;We’ll support filtering on the primary key (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;id&lt;/code&gt;), with five operators: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;=&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;gt;&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;gt;=&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;=&lt;/code&gt;. This is where our B-tree starts to really shine – instead of scanning every row, we can jump directly to the one we want.&lt;/p&gt;

&lt;h2 id=&quot;adding-where-to-the-statement&quot;&gt;Adding WHERE to the Statement&lt;/h2&gt;

&lt;p&gt;First, we need a way to represent the filter condition. We’ll add a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WhereOp&lt;/code&gt; enum and two new fields to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Statement&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+typedef enum {
+  WHERE_NONE,
+  WHERE_EQ,
+  WHERE_GT,
+  WHERE_LT,
+  WHERE_GTE,
+  WHERE_LTE,
+} WhereOp;
+
&lt;/span&gt; typedef struct {
   StatementType type;
   Row row_to_insert;  // only used by insert statement
&lt;span class=&quot;gi&quot;&gt;+  WhereOp where_op;   // only used by select statement
+  uint32_t where_id;  // only used by select statement
&lt;/span&gt; } Statement;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE_NONE&lt;/code&gt; means “no filter” – a full table scan, just like before.&lt;/p&gt;

&lt;h2 id=&quot;parsing-the-where-clause&quot;&gt;Parsing the WHERE Clause&lt;/h2&gt;

&lt;p&gt;The syntax is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;select where id &amp;lt;op&amp;gt; &amp;lt;value&amp;gt;&lt;/code&gt;. We parse it by tokenizing after the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;select&lt;/code&gt; keyword:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gd&quot;&gt;-  if (strcmp(input_buffer-&amp;gt;buffer, &quot;select&quot;) == 0) {
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+  if (strncmp(input_buffer-&amp;gt;buffer, &quot;select&quot;, 6) == 0) {
&lt;/span&gt;     statement-&amp;gt;type = STATEMENT_SELECT;
&lt;span class=&quot;gi&quot;&gt;+    statement-&amp;gt;where_op = WHERE_NONE;
+
+    if (strlen(input_buffer-&amp;gt;buffer) &amp;gt; 6) {
+      char* token = strtok(input_buffer-&amp;gt;buffer, &quot; &quot;);  // &quot;select&quot;
+      token = strtok(NULL, &quot; &quot;);                         // &quot;where&quot;
+      if (token == NULL || strcmp(token, &quot;where&quot;) != 0) {
+        return PREPARE_SYNTAX_ERROR;
+      }
+      token = strtok(NULL, &quot; &quot;);  // &quot;id&quot;
+      if (token == NULL || strcmp(token, &quot;id&quot;) != 0) {
+        return PREPARE_SYNTAX_ERROR;
+      }
+      token = strtok(NULL, &quot; &quot;);  // operator
+      if (token == NULL) {
+        return PREPARE_SYNTAX_ERROR;
+      }
+      if (strcmp(token, &quot;=&quot;) == 0) {
+        statement-&amp;gt;where_op = WHERE_EQ;
+      } else if (strcmp(token, &quot;&amp;gt;&quot;) == 0) {
+        statement-&amp;gt;where_op = WHERE_GT;
+      } else if (strcmp(token, &quot;&amp;lt;&quot;) == 0) {
+        statement-&amp;gt;where_op = WHERE_LT;
+      } else if (strcmp(token, &quot;&amp;gt;=&quot;) == 0) {
+        statement-&amp;gt;where_op = WHERE_GTE;
+      } else if (strcmp(token, &quot;&amp;lt;=&quot;) == 0) {
+        statement-&amp;gt;where_op = WHERE_LTE;
+      } else {
+        return PREPARE_SYNTAX_ERROR;
+      }
+      token = strtok(NULL, &quot; &quot;);  // value
+      if (token == NULL) {
+        return PREPARE_SYNTAX_ERROR;
+      }
+      statement-&amp;gt;where_id = (uint32_t)atoi(token);
+    }
+
&lt;/span&gt;     return PREPARE_SUCCESS;
   }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If there’s nothing after &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;select&lt;/code&gt;, we keep &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE_NONE&lt;/code&gt; and do a full scan as before. If there is, we expect the exact pattern &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;where id &amp;lt;op&amp;gt; &amp;lt;value&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;executing-with-where&quot;&gt;Executing with WHERE&lt;/h2&gt;

&lt;p&gt;Here’s where it gets interesting. We rewrite &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;execute_select()&lt;/code&gt; to use a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;switch&lt;/code&gt; on the operator:&lt;/p&gt;

&lt;h3 id=&quot;point-query-where-id--n&quot;&gt;Point Query (WHERE id = N)&lt;/h3&gt;

&lt;p&gt;For equality, we use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;table_find()&lt;/code&gt; to jump directly to the key. This is an O(log n) lookup – the whole reason we built a B-tree:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+    case WHERE_EQ: {
+      cursor = table_find(table, statement-&amp;gt;where_id);
+      void* node = get_page(table-&amp;gt;pager, cursor-&amp;gt;page_num);
+      uint32_t num_cells = *leaf_node_num_cells(node);
+      if (cursor-&amp;gt;cell_num &amp;lt; num_cells) {
+        uint32_t key = *leaf_node_key(node, cursor-&amp;gt;cell_num);
+        if (key == statement-&amp;gt;where_id) {
+          deserialize_row(cursor_value(cursor), &amp;amp;row);
+          print_row(&amp;amp;row);
+        }
+      }
+      free(cursor);
+      return EXECUTE_SUCCESS;
+    }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;table_find()&lt;/code&gt; returns the position where the key &lt;em&gt;should&lt;/em&gt; be. We still have to verify it’s actually there, since the key might not exist.&lt;/p&gt;

&lt;h3 id=&quot;range-scan-where-id--n-where-id--n&quot;&gt;Range Scan (WHERE id &amp;gt; N, WHERE id &amp;gt;= N)&lt;/h3&gt;

&lt;p&gt;For greater-than queries, we position the cursor at the first qualifying key and scan forward through the sibling chain:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+    case WHERE_GT:
+    case WHERE_GTE: {
+      uint32_t start_key = (statement-&amp;gt;where_op == WHERE_GT)
+                               ? statement-&amp;gt;where_id + 1
+                               : statement-&amp;gt;where_id;
+      cursor = table_find(table, start_key);
+      while (!(cursor-&amp;gt;end_of_table)) {
+        deserialize_row(cursor_value(cursor), &amp;amp;row);
+        print_row(&amp;amp;row);
+        cursor_advance(cursor);
+      }
+      free(cursor);
+      return EXECUTE_SUCCESS;
+    }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE id &amp;gt; 5&lt;/code&gt;, we search for key 6. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;table_find()&lt;/code&gt; positions us at the first key &amp;gt;= 6, and we scan to the end. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;next_leaf&lt;/code&gt; pointers we added in &lt;a href=&quot;/parts/part12.html&quot;&gt;Part 12&lt;/a&gt; make this traversal seamless across leaf node boundaries.&lt;/p&gt;

&lt;h3 id=&quot;less-than-scan-where-id--n-where-id--n&quot;&gt;Less-Than Scan (WHERE id &amp;lt; N, WHERE id &amp;lt;= N)&lt;/h3&gt;

&lt;p&gt;For less-than, we start at the beginning and stop when we hit the boundary:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+    case WHERE_LT:
+    case WHERE_LTE: {
+      cursor = table_start(table);
+      uint32_t limit = statement-&amp;gt;where_id;
+      while (!(cursor-&amp;gt;end_of_table)) {
+        void* node = get_page(table-&amp;gt;pager, cursor-&amp;gt;page_num);
+        uint32_t key = *leaf_node_key(node, cursor-&amp;gt;cell_num);
+        if (statement-&amp;gt;where_op == WHERE_LT &amp;amp;&amp;amp; key &amp;gt;= limit) break;
+        if (statement-&amp;gt;where_op == WHERE_LTE &amp;amp;&amp;amp; key &amp;gt; limit) break;
+        deserialize_row(cursor_value(cursor), &amp;amp;row);
+        print_row(&amp;amp;row);
+        cursor_advance(cursor);
+      }
+      free(cursor);
+      return EXECUTE_SUCCESS;
+    }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Because our keys are stored in sorted order, we can stop early the moment we see a key that’s too large. We don’t have to scan the whole table.&lt;/p&gt;

&lt;h2 id=&quot;testing&quot;&gt;Testing&lt;/h2&gt;

&lt;p&gt;Let’s try some queries:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;db &amp;gt; insert 1 user1 person1@example.com
Executed.
db &amp;gt; insert 2 user2 person2@example.com
Executed.
db &amp;gt; insert 3 user3 person3@example.com
Executed.
db &amp;gt; select where id = 2
(2, user2, person2@example.com)
Executed.
db &amp;gt; select where id &amp;gt; 1
(2, user2, person2@example.com)
(3, user3, person3@example.com)
Executed.
db &amp;gt; select where id &amp;lt; 3
(1, user1, person1@example.com)
(2, user2, person2@example.com)
Executed.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And the automated tests:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  it &apos;selects a single row with where id =&apos; do
+    script = (1..5).map do |i|
+      &quot;insert #{i} user#{i} person#{i}@example.com&quot;
+    end
+    script &amp;lt;&amp;lt; &quot;select where id = 3&quot;
+    script &amp;lt;&amp;lt; &quot;.exit&quot;
+    result = run_script(script)
+    expect(result).to include(&quot;(3, user3, person3@example.com)&quot;)
+    expect(result).not_to include(&quot;(1, user1, person1@example.com)&quot;)
+    expect(result).not_to include(&quot;(5, user5, person5@example.com)&quot;)
+  end
+
+  it &apos;selects rows with where id &amp;gt;&apos; do
+    script = (1..5).map do |i|
+      &quot;insert #{i} user#{i} person#{i}@example.com&quot;
+    end
+    script &amp;lt;&amp;lt; &quot;select where id &amp;gt; 3&quot;
+    script &amp;lt;&amp;lt; &quot;.exit&quot;
+    result = run_script(script)
+    expect(result).to include(&quot;(4, user4, person4@example.com)&quot;)
+    expect(result).to include(&quot;(5, user5, person5@example.com)&quot;)
+    expect(result).not_to include(&quot;(3, user3, person3@example.com)&quot;)
+  end
+
+  it &apos;selects rows with where id &amp;lt;&apos; do
+    script = (1..5).map do |i|
+      &quot;insert #{i} user#{i} person#{i}@example.com&quot;
+    end
+    script &amp;lt;&amp;lt; &quot;select where id &amp;lt; 3&quot;
+    script &amp;lt;&amp;lt; &quot;.exit&quot;
+    result = run_script(script)
+    expect(result).to include(&quot;(1, user1, person1@example.com)&quot;)
+    expect(result).to include(&quot;(2, user2, person2@example.com)&quot;)
+    expect(result).not_to include(&quot;(3, user3, person3@example.com)&quot;)
+  end
+
+  it &apos;returns nothing for where clause with no matches&apos; do
+    script = [
+      &quot;insert 1 user1 person1@example.com&quot;,
+      &quot;select where id = 5&quot;,
+      &quot;.exit&quot;,
+    ]
+    result = run_script(script)
+    expect(result).to match_array([
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; Executed.&quot;,
+      &quot;db &amp;gt; &quot;,
+    ])
+  end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Notice how the equality query uses &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;table_find()&lt;/code&gt; – a single O(log n) B-tree traversal – rather than scanning every row. This is the payoff for all that B-tree work. A full table scan touches every row. A point query touches only the pages on the path from root to the target leaf. For a table with millions of rows, that’s the difference between milliseconds and minutes.&lt;/p&gt;

&lt;p&gt;Next time we’ll overhaul our pager into a proper buffer pool with dirty page tracking and LRU eviction.&lt;/p&gt;
</description>
        <pubDate>Wed, 01 May 2024 00:00:00 +0000</pubDate>
        <link>https://ibra.github.io/db_tutorial/parts/part17.html</link>
        <guid isPermaLink="true">https://ibra.github.io/db_tutorial/parts/part17.html</guid>
      </item>
    
      <item>
        <title>Part 18 - A Page Cache and Buffer Pool</title>
        <description>&lt;blockquote&gt;
  &lt;p&gt;“Cache rules everything around me.” – adapted from Wu-Tang Clan&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Our pager has a dirty little secret: it writes every page to disk when the database closes, whether the page changed or not. And it happily loads as many pages into memory as there are pages in the file. For a small database that’s fine, but a real database could have millions of pages. We can’t fit them all in memory.&lt;/p&gt;

&lt;p&gt;In this part we’re going to turn our naive pager into a proper buffer pool. That means three things:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Dirty page tracking&lt;/strong&gt; – only write back pages that actually changed&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;LRU eviction&lt;/strong&gt; – when the buffer is full, evict the least recently used page&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Bounded memory&lt;/strong&gt; – limit the number of pages we hold in memory at once&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;expanding-the-pager&quot;&gt;Expanding the Pager&lt;/h2&gt;

&lt;p&gt;First, let’s add the new fields. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dirty&lt;/code&gt; tracks which pages have been modified. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;access_time&lt;/code&gt; records when each page was last accessed. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clock&lt;/code&gt; is a monotonically increasing counter:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+#define BUFFER_POOL_SIZE 100
+
&lt;/span&gt; typedef struct {
   int file_descriptor;
   uint32_t file_length;
   uint32_t num_pages;
   void* pages[TABLE_MAX_PAGES];
&lt;span class=&quot;gi&quot;&gt;+  bool dirty[TABLE_MAX_PAGES];
+  uint32_t access_time[TABLE_MAX_PAGES];
+  uint32_t clock;
&lt;/span&gt; } Pager;
&lt;span class=&quot;gi&quot;&gt;+
+void pager_flush(Pager* pager, uint32_t page_num);
+void pager_mark_dirty(Pager* pager, uint32_t page_num);
+uint32_t pager_pages_in_memory(Pager* pager);
+void pager_evict_lru(Pager* pager);
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BUFFER_POOL_SIZE&lt;/code&gt; limits us to 100 pages in memory. With 4 KB pages, that’s about 400 KB of memory. Real databases like SQLite default to around 2000 pages (8 MB).&lt;/p&gt;

&lt;p&gt;Initialize the new fields in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pager_open()&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  pager-&amp;gt;clock = 0;
&lt;/span&gt;   for (uint32_t i = 0; i &amp;lt; TABLE_MAX_PAGES; i++) {
     pager-&amp;gt;pages[i] = NULL;
&lt;span class=&quot;gi&quot;&gt;+    pager-&amp;gt;dirty[i] = false;
+    pager-&amp;gt;access_time[i] = 0;
&lt;/span&gt;   }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;marking-pages-dirty&quot;&gt;Marking Pages Dirty&lt;/h2&gt;

&lt;p&gt;Whenever we modify a page, we need to mark it dirty so we know to write it back:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+void pager_mark_dirty(Pager* pager, uint32_t page_num) {
+  pager-&amp;gt;dirty[page_num] = true;
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We add calls to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pager_mark_dirty()&lt;/code&gt; in every function that modifies page data: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;leaf_node_insert&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;leaf_node_delete&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;leaf_node_split_and_insert&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;create_new_root&lt;/code&gt;, and so on. Anywhere a page’s bytes change, we mark it dirty.&lt;/p&gt;

&lt;h2 id=&quot;lru-eviction&quot;&gt;LRU Eviction&lt;/h2&gt;

&lt;p&gt;When we need to load a new page but the buffer pool is full, we evict the least recently used page. If it’s dirty, we flush it to disk first:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+uint32_t pager_pages_in_memory(Pager* pager) {
+  uint32_t count = 0;
+  for (uint32_t i = 0; i &amp;lt; TABLE_MAX_PAGES; i++) {
+    if (pager-&amp;gt;pages[i] != NULL) count++;
+  }
+  return count;
+}
+
+void pager_evict_lru(Pager* pager) {
+  uint32_t lru_page = INVALID_PAGE_NUM;
+  uint32_t min_time = UINT32_MAX;
+
+  for (uint32_t i = 0; i &amp;lt; TABLE_MAX_PAGES; i++) {
+    if (pager-&amp;gt;pages[i] != NULL &amp;amp;&amp;amp; pager-&amp;gt;access_time[i] &amp;lt; min_time) {
+      min_time = pager-&amp;gt;access_time[i];
+      lru_page = i;
+    }
+  }
+
+  if (lru_page == INVALID_PAGE_NUM) return;
+
+  if (pager-&amp;gt;dirty[lru_page]) {
+    pager_flush(pager, lru_page);
+    pager-&amp;gt;dirty[lru_page] = false;
+  }
+
+  free(pager-&amp;gt;pages[lru_page]);
+  pager-&amp;gt;pages[lru_page] = NULL;
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is the simplest possible LRU implementation: scan the array and find the page with the smallest access time. A real database would use a doubly-linked list to make eviction O(1), but for our purposes the linear scan is fine.&lt;/p&gt;

&lt;h2 id=&quot;updating-get_page&quot;&gt;Updating get_page()&lt;/h2&gt;

&lt;p&gt;Now we integrate eviction into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get_page()&lt;/code&gt;. On a cache miss, we check if the buffer is full and evict if necessary. On every access, we update the access time:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;   if (pager-&amp;gt;pages[page_num] == NULL) {
&lt;span class=&quot;gi&quot;&gt;+    // Cache miss. Evict if the buffer pool is full.
+    if (pager_pages_in_memory(pager) &amp;gt;= BUFFER_POOL_SIZE) {
+      pager_evict_lru(pager);
+    }
+
&lt;/span&gt;     // Allocate memory and load from file.
     void* page = malloc(PAGE_SIZE);
     ...
     pager-&amp;gt;pages[page_num] = page;
&lt;span class=&quot;gi&quot;&gt;+    pager-&amp;gt;dirty[page_num] = false;
&lt;/span&gt;   }

+  pager-&amp;gt;access_time[page_num] = pager-&amp;gt;clock++;
   return pager-&amp;gt;pages[page_num];
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Every time a page is accessed – whether it was already in memory or just loaded – the access time updates. Pages that haven’t been touched recently will have the lowest access times and get evicted first.&lt;/p&gt;

&lt;h2 id=&quot;only-flushing-dirty-pages&quot;&gt;Only Flushing Dirty Pages&lt;/h2&gt;

&lt;p&gt;Finally, update &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db_close()&lt;/code&gt; to only write back pages that were modified:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;   for (uint32_t i = 0; i &amp;lt; pager-&amp;gt;num_pages; i++) {
     if (pager-&amp;gt;pages[i] == NULL) {
       continue;
     }
&lt;span class=&quot;gd&quot;&gt;-    pager_flush(pager, i);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+    if (pager-&amp;gt;dirty[i]) {
+      pager_flush(pager, i);
+      pager-&amp;gt;dirty[i] = false;
+    }
&lt;/span&gt;     free(pager-&amp;gt;pages[i]);
     pager-&amp;gt;pages[i] = NULL;
   }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is a significant optimization. If you insert one row and then exit, we used to write back every page we’d ever read. Now we only write back the pages that actually changed.&lt;/p&gt;

&lt;h2 id=&quot;a-note-on-pinning&quot;&gt;A Note on Pinning&lt;/h2&gt;

&lt;p&gt;There’s a subtlety we’re not handling: page pinning. When a B-tree split is in progress, we might have several pages in flight that absolutely must not be evicted. A real buffer pool uses a pin count – &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get_page()&lt;/code&gt; increments it, and the caller decrements it when done. A pinned page is never evicted. Our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BUFFER_POOL_SIZE&lt;/code&gt; of 100 is generous enough that we’ll never evict a page that’s in active use, but a production system would need proper pin management.&lt;/p&gt;

&lt;h2 id=&quot;testing&quot;&gt;Testing&lt;/h2&gt;

&lt;p&gt;The existing tests continue to pass – dirty page tracking is invisible to the user. Let’s add one test to verify persistence still works with the new buffer pool:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  it &apos;persists data correctly with dirty page tracking&apos; do
+    script = (1..20).map do |i|
+      &quot;insert #{i} user#{i} person#{i}@example.com&quot;
+    end
+    script &amp;lt;&amp;lt; &quot;.exit&quot;
+    run_script(script)
+
+    result = run_script([
+      &quot;select where id = 10&quot;,
+      &quot;select where id = 20&quot;,
+      &quot;.exit&quot;,
+    ])
+    expect(result).to include(&quot;(10, user10, person10@example.com)&quot;)
+    expect(result).to include(&quot;(20, user20, person20@example.com)&quot;)
+  end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Next time we’ll tackle variable-length records so our database can store strings of different lengths efficiently.&lt;/p&gt;
</description>
        <pubDate>Wed, 15 May 2024 00:00:00 +0000</pubDate>
        <link>https://ibra.github.io/db_tutorial/parts/part18.html</link>
        <guid isPermaLink="true">https://ibra.github.io/db_tutorial/parts/part18.html</guid>
      </item>
    
      <item>
        <title>Part 19 - Variable-Length Records</title>
        <description>&lt;p&gt;Up until now, every row in our database takes the same amount of space on disk: 293 bytes. A username of “a” takes 33 bytes. A username of “abcdefghijklmnopqrstuvwxyz012345” also takes 33 bytes. That’s a lot of wasted space for short strings.&lt;/p&gt;

&lt;p&gt;Real databases store strings using a &lt;strong&gt;variable-length&lt;/strong&gt; format. Instead of allocating the maximum possible size for every string, they store the actual length followed by only the bytes that are used. Let’s implement that.&lt;/p&gt;

&lt;h2 id=&quot;length-prefixed-strings&quot;&gt;Length-Prefixed Strings&lt;/h2&gt;

&lt;p&gt;The standard approach for serializing variable-length data is &lt;strong&gt;length-prefixing&lt;/strong&gt;: write the number of bytes first, then the actual data. Our new serialized row format looks like this:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;field&lt;/th&gt;
      &lt;th&gt;size&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;id&lt;/td&gt;
      &lt;td&gt;4 bytes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;username_len&lt;/td&gt;
      &lt;td&gt;4 bytes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;username&lt;/td&gt;
      &lt;td&gt;32 bytes (max)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;email_len&lt;/td&gt;
      &lt;td&gt;4 bytes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;email&lt;/td&gt;
      &lt;td&gt;255 bytes (max)&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;We’re adding 8 bytes of overhead for the two length fields. That changes our constants:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; const uint32_t ID_SIZE = size_of_attribute(Row, id);
&lt;span class=&quot;gd&quot;&gt;-const uint32_t USERNAME_SIZE = size_of_attribute(Row, username);
-const uint32_t EMAIL_SIZE = size_of_attribute(Row, email);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+const uint32_t VARCHAR_LEN_SIZE = sizeof(uint32_t);
+
+/*
+ * Serialized Row Layout (length-prefixed strings)
+ *
+ * | id (4) | username_len (4) | username (32) | email_len (4) | email (255) |
+ */
&lt;/span&gt; const uint32_t ID_OFFSET = 0;
&lt;span class=&quot;gd&quot;&gt;-const uint32_t USERNAME_OFFSET = ID_OFFSET + ID_SIZE;
-const uint32_t EMAIL_OFFSET = USERNAME_OFFSET + USERNAME_SIZE;
-const uint32_t ROW_SIZE = ID_SIZE + USERNAME_SIZE + EMAIL_SIZE;
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+const uint32_t USERNAME_LEN_OFFSET = ID_OFFSET + ID_SIZE;
+const uint32_t USERNAME_OFFSET = USERNAME_LEN_OFFSET + VARCHAR_LEN_SIZE;
+const uint32_t EMAIL_LEN_OFFSET = USERNAME_OFFSET + COLUMN_USERNAME_SIZE;
+const uint32_t EMAIL_OFFSET = EMAIL_LEN_OFFSET + VARCHAR_LEN_SIZE;
+const uint32_t ROW_SIZE =
+    ID_SIZE + VARCHAR_LEN_SIZE + COLUMN_USERNAME_SIZE + VARCHAR_LEN_SIZE +
+    COLUMN_EMAIL_SIZE;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ROW_SIZE&lt;/code&gt; goes from 293 to 299 bytes. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LEAF_NODE_CELL_SIZE&lt;/code&gt; goes from 297 to 303. But &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LEAF_NODE_MAX_CELLS&lt;/code&gt; stays at 13 because &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;4082 / 303 = 13&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;new-serialization&quot;&gt;New Serialization&lt;/h2&gt;

&lt;p&gt;Here’s the new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;serialize_row()&lt;/code&gt;. We &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memset&lt;/code&gt; the entire destination to zero first – this ensures the unused bytes after each string are clean:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; void serialize_row(Row* source, void* destination) {
&lt;span class=&quot;gd&quot;&gt;-  memcpy(destination + ID_OFFSET, &amp;amp;(source-&amp;gt;id), ID_SIZE);
-  memcpy(destination + USERNAME_OFFSET, &amp;amp;(source-&amp;gt;username), USERNAME_SIZE);
-  memcpy(destination + EMAIL_OFFSET, &amp;amp;(source-&amp;gt;email), EMAIL_SIZE);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+  memset(destination, 0, ROW_SIZE);
+  memcpy(destination + ID_OFFSET, &amp;amp;(source-&amp;gt;id), ID_SIZE);
+  uint32_t username_len = strlen(source-&amp;gt;username);
+  memcpy(destination + USERNAME_LEN_OFFSET, &amp;amp;username_len, VARCHAR_LEN_SIZE);
+  memcpy(destination + USERNAME_OFFSET, source-&amp;gt;username, username_len);
+  uint32_t email_len = strlen(source-&amp;gt;email);
+  memcpy(destination + EMAIL_LEN_OFFSET, &amp;amp;email_len, VARCHAR_LEN_SIZE);
+  memcpy(destination + EMAIL_OFFSET, source-&amp;gt;email, email_len);
&lt;/span&gt; }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;deserialize_row()&lt;/code&gt; reads the length first, then copies only that many bytes:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; void deserialize_row(void* source, Row* destination) {
&lt;span class=&quot;gd&quot;&gt;-  memcpy(&amp;amp;(destination-&amp;gt;id), source + ID_OFFSET, ID_SIZE);
-  memcpy(&amp;amp;(destination-&amp;gt;username), source + USERNAME_OFFSET, USERNAME_SIZE);
-  memcpy(&amp;amp;(destination-&amp;gt;email), source + EMAIL_OFFSET, EMAIL_SIZE);
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+  memcpy(&amp;amp;(destination-&amp;gt;id), source + ID_OFFSET, ID_SIZE);
+  uint32_t username_len;
+  memcpy(&amp;amp;username_len, source + USERNAME_LEN_OFFSET, VARCHAR_LEN_SIZE);
+  memset(destination-&amp;gt;username, 0, COLUMN_USERNAME_SIZE + 1);
+  memcpy(destination-&amp;gt;username, source + USERNAME_OFFSET, username_len);
+  uint32_t email_len;
+  memcpy(&amp;amp;email_len, source + EMAIL_LEN_OFFSET, VARCHAR_LEN_SIZE);
+  memset(destination-&amp;gt;email, 0, COLUMN_EMAIL_SIZE + 1);
+  memcpy(destination-&amp;gt;email, source + EMAIL_OFFSET, email_len);
&lt;/span&gt; }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memset&lt;/code&gt; to zero before &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memcpy&lt;/code&gt; ensures the destination string is properly null-terminated. This is important because &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strlen()&lt;/code&gt; relies on that null byte.&lt;/p&gt;

&lt;h2 id=&quot;actual-vs-allocated-space&quot;&gt;Actual vs Allocated Space&lt;/h2&gt;

&lt;p&gt;We can compute how much space a row actually uses versus how much it’s allocated:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+uint32_t row_data_size(Row* row) {
+  return ID_SIZE + VARCHAR_LEN_SIZE + strlen(row-&amp;gt;username) + VARCHAR_LEN_SIZE +
+         strlen(row-&amp;gt;email);
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For a row like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(1, a, b@c.com)&lt;/code&gt;, the actual data is only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;4 + 4 + 1 + 4 + 6 = 19 bytes&lt;/code&gt;. But we allocate 299 bytes for it. That’s 280 bytes of wasted space!&lt;/p&gt;

&lt;h2 id=&quot;the-elephant-in-the-room-slotted-pages&quot;&gt;The Elephant in the Room: Slotted Pages&lt;/h2&gt;

&lt;p&gt;Our cells still occupy fixed-size slots in the leaf node. Even though we serialize the data more efficiently, we pad the slot to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ROW_SIZE&lt;/code&gt;. A real database solves this with a &lt;strong&gt;slotted page&lt;/strong&gt; format:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;+-------------------------------------------+
| Header | Ptr1 | Ptr2 | Ptr3 | ...         |
+-------------------------------------------+
|              Free Space                    |
+-------------------------------------------+
| ... | Cell3 data | Cell2 data | Cell1 data |
+-------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The cell pointer directory grows downward from the header. Actual cell data is packed upward from the bottom of the page. Each cell takes only as much space as it needs. The free space in the middle shrinks as you add cells.&lt;/p&gt;

&lt;p&gt;This is how SQLite, PostgreSQL, and most real databases lay out their pages. We could implement this, but it would require rewriting every function that accesses leaf node cells. For now, our length-prefixed format gives us the serialization story without that complexity.&lt;/p&gt;

&lt;h2 id=&quot;when-strings-outgrow-a-page-overflow-pages&quot;&gt;When Strings Outgrow a Page: Overflow Pages&lt;/h2&gt;

&lt;p&gt;What about a TEXT column that holds a 10 KB blog post? It doesn’t fit in a single 4 KB page. Real databases use &lt;strong&gt;overflow pages&lt;/strong&gt; (sometimes called TOAST in PostgreSQL). The cell stores a pointer to a separate chain of pages that hold the large value. SQLite uses overflow pages when a record exceeds about 25% of the page size.&lt;/p&gt;

&lt;p&gt;We don’t need overflow pages because our column sizes are bounded (32 and 255 bytes), but it’s worth knowing the pattern.&lt;/p&gt;

&lt;h2 id=&quot;testing&quot;&gt;Testing&lt;/h2&gt;

&lt;p&gt;Short strings, long strings – they all work:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  it &apos;handles variable-length strings correctly&apos; do
+    script = [
+      &quot;insert 1 a b@c.com&quot;,
+      &quot;insert 2 longername longemail@example.com&quot;,
+      &quot;select&quot;,
+      &quot;.exit&quot;,
+    ]
+    result = run_script(script)
+    expect(result).to include(&quot;(1, a, b@c.com)&quot;)
+    expect(result).to include(&quot;(2, longername, longemail@example.com)&quot;)
+  end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And our updated constants:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Constants:
ROW_SIZE: 299
COMMON_NODE_HEADER_SIZE: 6
LEAF_NODE_HEADER_SIZE: 14
LEAF_NODE_CELL_SIZE: 303
LEAF_NODE_SPACE_FOR_CELLS: 4082
LEAF_NODE_MAX_CELLS: 13
VARCHAR_LEN_SIZE: 4
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Next time we’ll add secondary indexes – separate B-trees that let you look up rows by columns other than the primary key.&lt;/p&gt;
</description>
        <pubDate>Sat, 01 Jun 2024 00:00:00 +0000</pubDate>
        <link>https://ibra.github.io/db_tutorial/parts/part19.html</link>
        <guid isPermaLink="true">https://ibra.github.io/db_tutorial/parts/part19.html</guid>
      </item>
    
      <item>
        <title>Part 20 - Secondary Indexes</title>
        <description>&lt;p&gt;Our WHERE clause can find rows by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;id&lt;/code&gt; efficiently because &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;id&lt;/code&gt; is the primary key – it’s the key in our B-tree. But what if you want to find a user by their username? Right now, that means scanning every row. For a table with millions of rows, that’s unacceptable.&lt;/p&gt;

&lt;p&gt;The answer is a &lt;strong&gt;secondary index&lt;/strong&gt;: a separate data structure that maps a non-primary column to the primary key. Instead of scanning, you look up the username in the index, get back the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;id&lt;/code&gt;, then use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;id&lt;/code&gt; to fetch the full row from the primary B-tree. Two lookups instead of a million.&lt;/p&gt;

&lt;h2 id=&quot;how-a-secondary-index-works&quot;&gt;How a Secondary Index Works&lt;/h2&gt;

&lt;p&gt;The primary B-tree stores rows keyed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;id&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Primary B-tree: id -&amp;gt; (id, username, email)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A secondary index on username maps usernames to primary keys:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Username index: hash(username) -&amp;gt; id
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To look up &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;username = &quot;bob&quot;&lt;/code&gt;:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Hash “bob” to get a key&lt;/li&gt;
  &lt;li&gt;Search the index for that key -&amp;gt; get &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;id = 2&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Search the primary B-tree for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;id = 2&lt;/code&gt; -&amp;gt; get the full row&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;the-hash-function&quot;&gt;The Hash Function&lt;/h2&gt;

&lt;p&gt;We need a hash function to turn strings into integer keys. We’ll use &lt;a href=&quot;http://www.cse.yorku.ca/~oz/hash.html&quot;&gt;djb2&lt;/a&gt;, a simple and effective string hash:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+uint32_t hash_string(const char* str) {
+  uint32_t hash = 5381;
+  int c;
+  while ((c = *str++)) {
+    hash = ((hash &amp;lt;&amp;lt; 5) + hash) + c;
+  }
+  return hash;
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;index-page-format&quot;&gt;Index Page Format&lt;/h2&gt;

&lt;p&gt;For simplicity, we’ll store our index as a sorted array of (hash, primary_key) pairs packed into a single page. Each entry is 8 bytes, giving us room for 511 entries:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+const uint32_t INDEX_ENTRY_SIZE = 2 * sizeof(uint32_t);
+const uint32_t INDEX_HEADER_SIZE = sizeof(uint32_t);
+const uint32_t INDEX_MAX_ENTRIES =
+    (PAGE_SIZE - INDEX_HEADER_SIZE) / INDEX_ENTRY_SIZE;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A real database would use a full B-tree for the index (just like the primary table). We’re using a flat sorted array for clarity, but the concept is the same: a separate data structure that maps column values to primary keys.&lt;/p&gt;

&lt;h2 id=&quot;index-operations&quot;&gt;Index Operations&lt;/h2&gt;

&lt;p&gt;Inserting into the index uses binary search to find the right position, then shifts entries right:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+void index_insert(Table* table, uint32_t hash_key, uint32_t primary_key) {
+  if (!table-&amp;gt;has_index) return;
+  void* page = get_page(table-&amp;gt;pager, table-&amp;gt;index_page_num);
+  pager_mark_dirty(table-&amp;gt;pager, table-&amp;gt;index_page_num);
+  uint32_t num = *index_num_entries(page);
+
+  uint32_t lo = 0, hi = num;
+  while (lo &amp;lt; hi) {
+    uint32_t mid = (lo + hi) / 2;
+    if (*index_entry_hash(page, mid) &amp;lt; hash_key) {
+      lo = mid + 1;
+    } else {
+      hi = mid;
+    }
+  }
+
+  for (uint32_t i = num; i &amp;gt; lo; i--) {
+    *index_entry_hash(page, i) = *index_entry_hash(page, i - 1);
+    *index_entry_pk(page, i) = *index_entry_pk(page, i - 1);
+  }
+
+  *index_entry_hash(page, lo) = hash_key;
+  *index_entry_pk(page, lo) = primary_key;
+  *index_num_entries(page) = num + 1;
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Lookup also uses binary search – O(log n):&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+uint32_t index_find(Table* table, uint32_t hash_key) {
+  if (!table-&amp;gt;has_index) return INVALID_PAGE_NUM;
+  void* page = get_page(table-&amp;gt;pager, table-&amp;gt;index_page_num);
+  uint32_t num = *index_num_entries(page);
+
+  uint32_t lo = 0, hi = num;
+  while (lo &amp;lt; hi) {
+    uint32_t mid = (lo + hi) / 2;
+    if (*index_entry_hash(page, mid) &amp;lt; hash_key) {
+      lo = mid + 1;
+    } else {
+      hi = mid;
+    }
+  }
+
+  if (lo &amp;lt; num &amp;amp;&amp;amp; *index_entry_hash(page, lo) == hash_key) {
+    return *index_entry_pk(page, lo);
+  }
+  return INVALID_PAGE_NUM;
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;creating-the-index&quot;&gt;Creating the Index&lt;/h2&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;create index on username&lt;/code&gt; command allocates a new page, scans the entire table, and populates the index:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  } else if (strcmp(input_buffer-&amp;gt;buffer, &quot;create index on username&quot;) == 0) {
+    table-&amp;gt;index_page_num = get_unused_page_num(table-&amp;gt;pager);
+    void* index_page = get_page(table-&amp;gt;pager, table-&amp;gt;index_page_num);
+    memset(index_page, 0, PAGE_SIZE);
+    pager_mark_dirty(table-&amp;gt;pager, table-&amp;gt;index_page_num);
+    table-&amp;gt;has_index = true;
+
+    Cursor* cursor = table_start(table);
+    Row row;
+    while (!(cursor-&amp;gt;end_of_table)) {
+      deserialize_row(cursor_value(cursor), &amp;amp;row);
+      index_insert(table, hash_string(row.username), row.id);
+      cursor_advance(cursor);
+    }
+    free(cursor);
+    printf(&quot;Index created on username.\n&quot;);
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;using-the-index&quot;&gt;Using the Index&lt;/h2&gt;

&lt;p&gt;When we see &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;select where username = bob&lt;/code&gt;, we hash “bob”, look it up in the index, and use the returned primary key to fetch the full row:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+    case WHERE_USERNAME_EQ: {
+      uint32_t hash = hash_string(statement-&amp;gt;where_username);
+      uint32_t pk = index_find(table, hash);
+      if (pk != INVALID_PAGE_NUM) {
+        cursor = table_find(table, pk);
+        void* node = get_page(table-&amp;gt;pager, cursor-&amp;gt;page_num);
+        uint32_t num_cells = *leaf_node_num_cells(node);
+        if (cursor-&amp;gt;cell_num &amp;lt; num_cells) {
+          deserialize_row(cursor_value(cursor), &amp;amp;row);
+          if (strcmp(row.username, statement-&amp;gt;where_username) == 0) {
+            print_row(&amp;amp;row);
+          }
+        }
+        free(cursor);
+      }
+      return EXECUTE_SUCCESS;
+    }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Notice the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strcmp&lt;/code&gt; check after the index lookup. Hash collisions are possible – two different usernames could hash to the same value. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strcmp&lt;/code&gt; confirms we actually found the right row. A production index would chain colliding entries and check all of them.&lt;/p&gt;

&lt;h2 id=&quot;maintaining-the-index&quot;&gt;Maintaining the Index&lt;/h2&gt;

&lt;p&gt;Every insert also inserts into the index:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;   leaf_node_insert(cursor, row_to_insert-&amp;gt;id, row_to_insert);
&lt;span class=&quot;gi&quot;&gt;+  index_insert(table, hash_string(row_to_insert-&amp;gt;username), row_to_insert-&amp;gt;id);
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The same applies for delete. Any write to the primary table must be reflected in all secondary indexes – this is the maintenance cost of indexes. More indexes mean faster reads but slower writes.&lt;/p&gt;

&lt;h2 id=&quot;testing&quot;&gt;Testing&lt;/h2&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  it &apos;creates an index and looks up by username&apos; do
+    script = [
+      &quot;insert 1 alice alice@example.com&quot;,
+      &quot;insert 2 bob bob@example.com&quot;,
+      &quot;insert 3 charlie charlie@example.com&quot;,
+      &quot;create index on username&quot;,
+      &quot;select where username = bob&quot;,
+      &quot;.exit&quot;,
+    ]
+    result = run_script(script)
+    expect(result).to include(&quot;Index created on username.&quot;)
+    expect(result).to include(&quot;(2, bob, bob@example.com)&quot;)
+    expect(result).not_to include(&quot;(1, alice, alice@example.com)&quot;)
+  end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Without the index, finding “bob” requires scanning every row. With the index, it’s two O(log n) lookups: one in the index, one in the primary B-tree. For a million rows, that’s roughly 20 page reads instead of thousands.&lt;/p&gt;

&lt;p&gt;Next time we’ll add transactions so that a sequence of changes can be committed atomically or rolled back entirely.&lt;/p&gt;
</description>
        <pubDate>Sat, 15 Jun 2024 00:00:00 +0000</pubDate>
        <link>https://ibra.github.io/db_tutorial/parts/part20.html</link>
        <guid isPermaLink="true">https://ibra.github.io/db_tutorial/parts/part20.html</guid>
      </item>
    
      <item>
        <title>Part 21 - Transactions</title>
        <description>&lt;blockquote&gt;
  &lt;p&gt;“Either all of it happens, or none of it does.” – the essence of atomicity&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Until now, every statement we execute takes effect immediately and permanently. If you insert three rows, they’re all committed right away. If your program crashes halfway through a batch of inserts, you get a partially-updated database. That’s not great.&lt;/p&gt;

&lt;p&gt;Real databases support &lt;strong&gt;transactions&lt;/strong&gt;: a group of operations that either all succeed (commit) or all fail (rollback). This is the “A” in ACID – Atomicity.&lt;/p&gt;

&lt;h2 id=&quot;the-commands&quot;&gt;The Commands&lt;/h2&gt;

&lt;p&gt;We’ll add three new statements:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;begin&lt;/code&gt; – start a transaction&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;commit&lt;/code&gt; – make all changes since &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;begin&lt;/code&gt; permanent&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rollback&lt;/code&gt; – undo all changes since &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;begin&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; typedef enum {
   STATEMENT_INSERT,
   STATEMENT_SELECT,
&lt;span class=&quot;gd&quot;&gt;-  STATEMENT_DELETE
&lt;/span&gt;&lt;span class=&quot;gi&quot;&gt;+  STATEMENT_DELETE,
+  STATEMENT_BEGIN,
+  STATEMENT_COMMIT,
+  STATEMENT_ROLLBACK
&lt;/span&gt; } StatementType;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;the-undo-log&quot;&gt;The Undo Log&lt;/h2&gt;

&lt;p&gt;Our approach is based on &lt;strong&gt;shadow paging&lt;/strong&gt;: before modifying a page during a transaction, we save a copy of its original state. On commit, we throw away those copies (the changes are already in memory and will be flushed). On rollback, we restore the copies, effectively rewinding time.&lt;/p&gt;

&lt;p&gt;We add an undo log to the Pager:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; typedef struct {
   int file_descriptor;
   uint32_t file_length;
   uint32_t num_pages;
   void* pages[TABLE_MAX_PAGES];
   bool dirty[TABLE_MAX_PAGES];
   uint32_t access_time[TABLE_MAX_PAGES];
   uint32_t clock;
&lt;span class=&quot;gi&quot;&gt;+  bool in_transaction;
+  #define MAX_UNDO_PAGES 64
+  uint32_t undo_page_nums[MAX_UNDO_PAGES];
+  void* undo_pages[MAX_UNDO_PAGES];
+  uint32_t num_undo_pages;
&lt;/span&gt; } Pager;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The key insight: we hook into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pager_mark_dirty()&lt;/code&gt;. This function is already called before every page modification. We piggyback on it to save the undo copy:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; void pager_mark_dirty(Pager* pager, uint32_t page_num) {
&lt;span class=&quot;gi&quot;&gt;+  if (pager-&amp;gt;in_transaction) {
+    bool already_saved = false;
+    for (uint32_t i = 0; i &amp;lt; pager-&amp;gt;num_undo_pages; i++) {
+      if (pager-&amp;gt;undo_page_nums[i] == page_num) {
+        already_saved = true;
+        break;
+      }
+    }
+    if (!already_saved &amp;amp;&amp;amp; pager-&amp;gt;num_undo_pages &amp;lt; MAX_UNDO_PAGES) {
+      void* copy = malloc(PAGE_SIZE);
+      memcpy(copy, pager-&amp;gt;pages[page_num], PAGE_SIZE);
+      uint32_t idx = pager-&amp;gt;num_undo_pages++;
+      pager-&amp;gt;undo_page_nums[idx] = page_num;
+      pager-&amp;gt;undo_pages[idx] = copy;
+    }
+  }
&lt;/span&gt;   pager-&amp;gt;dirty[page_num] = true;
 }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The first time a page is marked dirty during a transaction, we save a snapshot of its current (pre-modification) state. If the same page is modified again, we skip it – we already have the original saved.&lt;/p&gt;

&lt;h2 id=&quot;commit-and-rollback&quot;&gt;Commit and Rollback&lt;/h2&gt;

&lt;p&gt;Commit is easy – just free the undo copies and exit the transaction. The modified pages are already in the buffer pool and will be flushed to disk when the database closes:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+ExecuteResult execute_commit(Statement* statement, Table* table) {
+  Pager* pager = table-&amp;gt;pager;
+  for (uint32_t i = 0; i &amp;lt; pager-&amp;gt;num_undo_pages; i++) {
+    free(pager-&amp;gt;undo_pages[i]);
+  }
+  pager-&amp;gt;num_undo_pages = 0;
+  pager-&amp;gt;in_transaction = false;
+  return EXECUTE_SUCCESS;
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Rollback is the reverse – restore each undo copy and clear the dirty flag:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+ExecuteResult execute_rollback(Statement* statement, Table* table) {
+  Pager* pager = table-&amp;gt;pager;
+  for (uint32_t i = 0; i &amp;lt; pager-&amp;gt;num_undo_pages; i++) {
+    uint32_t page_num = pager-&amp;gt;undo_page_nums[i];
+    memcpy(pager-&amp;gt;pages[page_num], pager-&amp;gt;undo_pages[i], PAGE_SIZE);
+    pager-&amp;gt;dirty[page_num] = false;
+    free(pager-&amp;gt;undo_pages[i]);
+  }
+  pager-&amp;gt;num_undo_pages = 0;
+  pager-&amp;gt;in_transaction = false;
+  return EXECUTE_SUCCESS;
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After rollback, the in-memory pages are exactly as they were before the transaction started. And since we cleared the dirty flags, they won’t be written to disk.&lt;/p&gt;

&lt;h2 id=&quot;testing&quot;&gt;Testing&lt;/h2&gt;

&lt;p&gt;Let’s verify that rollback actually undoes changes:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  it &apos;rolls back a transaction&apos; do
+    script = [
+      &quot;insert 1 user1 person1@example.com&quot;,
+      &quot;begin&quot;,
+      &quot;insert 2 user2 person2@example.com&quot;,
+      &quot;insert 3 user3 person3@example.com&quot;,
+      &quot;rollback&quot;,
+      &quot;select&quot;,
+      &quot;.exit&quot;,
+    ]
+    result = run_script(script)
+    expect(result).to include(&quot;(1, user1, person1@example.com)&quot;)
+    expect(result).not_to include(&quot;(2, user2, person2@example.com)&quot;)
+    expect(result).not_to include(&quot;(3, user3, person3@example.com)&quot;)
+  end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Row 1 was inserted before the transaction, so it survives. Rows 2 and 3 were inserted during the transaction, so rollback erases them.&lt;/p&gt;

&lt;p&gt;And commit makes things permanent:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  it &apos;commits a transaction&apos; do
+    script = [
+      &quot;begin&quot;,
+      &quot;insert 1 user1 person1@example.com&quot;,
+      &quot;insert 2 user2 person2@example.com&quot;,
+      &quot;commit&quot;,
+      &quot;select&quot;,
+      &quot;.exit&quot;,
+    ]
+    result = run_script(script)
+    expect(result).to include(&quot;(1, user1, person1@example.com)&quot;)
+    expect(result).to include(&quot;(2, user2, person2@example.com)&quot;)
+  end
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;limitations&quot;&gt;Limitations&lt;/h2&gt;

&lt;p&gt;Our transaction implementation is minimal but teaches the core concept. A real database adds:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Durability&lt;/strong&gt;: committed transactions survive crashes (we’ll add WAL next)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Isolation&lt;/strong&gt;: concurrent readers don’t see uncommitted changes&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Nested transactions / savepoints&lt;/strong&gt;: rolling back to intermediate points&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Write-ahead logging&lt;/strong&gt;: instead of shadow paging, log changes before applying them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point is particularly important. Shadow paging works, but it requires copying entire 4 KB pages even if only a few bytes changed. Write-ahead logging is more efficient – and that’s what we’ll implement next.&lt;/p&gt;
</description>
        <pubDate>Mon, 01 Jul 2024 00:00:00 +0000</pubDate>
        <link>https://ibra.github.io/db_tutorial/parts/part21.html</link>
        <guid isPermaLink="true">https://ibra.github.io/db_tutorial/parts/part21.html</guid>
      </item>
    
      <item>
        <title>Part 22 - Write-Ahead Logging</title>
        <description>&lt;p&gt;In the last part we added transactions with rollback. But there’s still a durability problem: if the program crashes while writing a page to disk, the database file could be left in a corrupted state – half-written data where a valid page used to be.&lt;/p&gt;

&lt;p&gt;The solution is the &lt;strong&gt;write-ahead log&lt;/strong&gt; (WAL). The rule is simple: before writing a page to the main database file, first write it to a separate log file. If the program crashes mid-write, the log has a complete copy of the page that can be replayed on the next startup.&lt;/p&gt;

&lt;p&gt;This is the “D” in ACID – Durability.&lt;/p&gt;

&lt;h2 id=&quot;the-wal-file&quot;&gt;The WAL File&lt;/h2&gt;

&lt;p&gt;We store the WAL as a separate file alongside the database, named &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;dbfile&amp;gt;-wal&lt;/code&gt;. Each record in the WAL is a (page_number, page_data) pair:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;| page_num (4 bytes) | page_data (4096 bytes) | page_num | page_data | ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We add a WAL file descriptor and filename to the Pager:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; typedef struct {
   ...
&lt;span class=&quot;gi&quot;&gt;+  int wal_fd;
+  char wal_filename[256];
&lt;/span&gt; } Pager;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;writing-to-the-wal&quot;&gt;Writing to the WAL&lt;/h2&gt;

&lt;p&gt;Before every write to the main database file, we append the page to the WAL:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+void wal_write(Pager* pager, uint32_t page_num) {
+  if (pager-&amp;gt;wal_fd == -1) return;
+  lseek(pager-&amp;gt;wal_fd, 0, SEEK_END);
+  write(pager-&amp;gt;wal_fd, &amp;amp;page_num, sizeof(uint32_t));
+  write(pager-&amp;gt;wal_fd, pager-&amp;gt;pages[page_num], PAGE_SIZE);
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And we add a call to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wal_write()&lt;/code&gt; at the beginning of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pager_flush()&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; void pager_flush(Pager* pager, uint32_t page_num) {
&lt;span class=&quot;gi&quot;&gt;+  wal_write(pager, page_num);
&lt;/span&gt;   ...
   // then write to the database file as before
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now the WAL contains a complete copy of every page before it hits the database file. If the database file gets corrupted, the WAL has what we need.&lt;/p&gt;

&lt;h2 id=&quot;crash-recovery&quot;&gt;Crash Recovery&lt;/h2&gt;

&lt;p&gt;On startup, we check if the WAL file has any records. If it does, we replay them – writing each page from the WAL into the correct location in the database file:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+void wal_replay(Pager* pager) {
+  if (pager-&amp;gt;wal_fd == -1) return;
+  off_t wal_size = lseek(pager-&amp;gt;wal_fd, 0, SEEK_END);
+  if (wal_size &amp;lt;= 0) return;
+
+  printf(&quot;Replaying WAL (%d records)...\n&quot;,
+         (int)(wal_size / (sizeof(uint32_t) + PAGE_SIZE)));
+  lseek(pager-&amp;gt;wal_fd, 0, SEEK_SET);
+
+  while (1) {
+    uint32_t page_num;
+    ssize_t n = read(pager-&amp;gt;wal_fd, &amp;amp;page_num, sizeof(uint32_t));
+    if (n &amp;lt;= 0) break;
+    void* page_data = malloc(PAGE_SIZE);
+    n = read(pager-&amp;gt;wal_fd, page_data, PAGE_SIZE);
+    if (n &amp;lt; (ssize_t)PAGE_SIZE) {
+      free(page_data);
+      break;
+    }
+    lseek(pager-&amp;gt;file_descriptor, page_num * PAGE_SIZE, SEEK_SET);
+    write(pager-&amp;gt;file_descriptor, page_data, PAGE_SIZE);
+    free(page_data);
+  }
+
+  /* Clear the WAL */
+  close(pager-&amp;gt;wal_fd);
+  pager-&amp;gt;wal_fd =
+      open(pager-&amp;gt;wal_filename, O_RDWR | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If the WAL replay finds a truncated record (from a crash during WAL write), it stops. The partially-written WAL record is discarded, but all complete records before it are applied. This guarantees that any page write that completed in the WAL will be recovered.&lt;/p&gt;

&lt;h2 id=&quot;checkpointing&quot;&gt;Checkpointing&lt;/h2&gt;

&lt;p&gt;When the database is closed cleanly, we’ve already flushed all dirty pages. The WAL is no longer needed, so we clear it:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+void wal_checkpoint(Pager* pager) {
+  if (pager-&amp;gt;wal_fd == -1) return;
+  close(pager-&amp;gt;wal_fd);
+  pager-&amp;gt;wal_fd =
+      open(pager-&amp;gt;wal_filename, O_RDWR | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
+}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is called in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db_close()&lt;/code&gt; after all dirty pages are flushed:&lt;/p&gt;

&lt;div class=&quot;language-diff highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;gi&quot;&gt;+  wal_checkpoint(pager);
+  if (pager-&amp;gt;wal_fd != -1) {
+    close(pager-&amp;gt;wal_fd);
+  }
&lt;/span&gt;   int result = close(pager-&amp;gt;file_descriptor);
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;the-write-path&quot;&gt;The Write Path&lt;/h2&gt;

&lt;p&gt;Let’s trace what happens on an insert now:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;leaf_node_insert()&lt;/code&gt; modifies the page in memory&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pager_mark_dirty()&lt;/code&gt; marks it for write-back (and saves an undo copy if in a transaction)&lt;/li&gt;
  &lt;li&gt;On &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db_close()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pager_flush()&lt;/code&gt; is called for each dirty page&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pager_flush()&lt;/code&gt; first calls &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wal_write()&lt;/code&gt; – the page goes to the WAL file&lt;/li&gt;
  &lt;li&gt;Then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pager_flush()&lt;/code&gt; writes the page to the database file&lt;/li&gt;
  &lt;li&gt;After all pages are flushed, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wal_checkpoint()&lt;/code&gt; clears the WAL&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the program crashes between steps 4 and 5, the WAL has the page. On the next startup, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wal_replay()&lt;/code&gt; writes it to the database file, and the data is not lost.&lt;/p&gt;

&lt;h2 id=&quot;how-sqlite-does-it&quot;&gt;How SQLite Does It&lt;/h2&gt;

&lt;p&gt;SQLite’s WAL mode is more sophisticated. Instead of writing to the WAL before the database file, it writes &lt;em&gt;only&lt;/em&gt; to the WAL during normal operation. Readers check both the WAL and the database file. Periodically, a checkpoint operation transfers WAL pages to the database file. This allows concurrent readers and writers, which our simple implementation doesn’t support.&lt;/p&gt;

&lt;p&gt;But the core principle is the same: write changes to a log first, ensure the log is durable, then apply the changes. If anything goes wrong, the log tells you how to recover.&lt;/p&gt;

&lt;p&gt;That wraps up our implementation of the ACID properties. We have Atomicity (transactions), and now Durability (WAL). We’ll talk about what we’ve built and where to go from here in the next and final part.&lt;/p&gt;
</description>
        <pubDate>Mon, 15 Jul 2024 00:00:00 +0000</pubDate>
        <link>https://ibra.github.io/db_tutorial/parts/part22.html</link>
        <guid isPermaLink="true">https://ibra.github.io/db_tutorial/parts/part22.html</guid>
      </item>
    
      <item>
        <title>Part 23 - Wrapping Up</title>
        <description>&lt;blockquote&gt;
  &lt;p&gt;“What I cannot create, I do not understand.” – &lt;a href=&quot;https://en.m.wikiquote.org/wiki/Richard_Feynman&quot;&gt;Richard Feynman&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We started with a question: &lt;em&gt;how does a database work?&lt;/em&gt; And to answer it, we built one. Let’s take a step back and look at what we’ve created.&lt;/p&gt;

&lt;h2 id=&quot;what-we-built&quot;&gt;What We Built&lt;/h2&gt;

&lt;p&gt;Our database – all of it in a single C file – implements:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Storage engine:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;A B+ tree with leaf and internal nodes, supporting insert, delete, and search&lt;/li&gt;
  &lt;li&gt;Leaf node splitting and internal node splitting to grow the tree&lt;/li&gt;
  &lt;li&gt;Rebalancing via borrowing and merging to shrink the tree&lt;/li&gt;
  &lt;li&gt;Tree height reduction when the root becomes unnecessary&lt;/li&gt;
  &lt;li&gt;Sibling pointers for efficient sequential scans across leaves&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Persistence:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;File-backed page storage with a buffer pool&lt;/li&gt;
  &lt;li&gt;Dirty page tracking so we only write back what changed&lt;/li&gt;
  &lt;li&gt;LRU eviction to bound memory usage&lt;/li&gt;
  &lt;li&gt;Write-ahead logging for crash recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Query processing:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;A REPL that parses SQL-like commands: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;insert&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;select&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;delete&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WHERE&lt;/code&gt; clause with equality, greater-than, and less-than predicates on the primary key&lt;/li&gt;
  &lt;li&gt;Point queries that use O(log n) B-tree search instead of full table scans&lt;/li&gt;
  &lt;li&gt;Range scans that exploit the sorted key order&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Indexing:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;A primary B-tree index (the table itself)&lt;/li&gt;
  &lt;li&gt;A secondary index on username using a hash-based lookup&lt;/li&gt;
  &lt;li&gt;Index maintenance on insert and delete&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Transactions:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BEGIN&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;COMMIT&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ROLLBACK&lt;/code&gt; statements&lt;/li&gt;
  &lt;li&gt;Shadow paging for undo: page copies saved before modification&lt;/li&gt;
  &lt;li&gt;Atomic rollback by restoring saved page copies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data format:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Length-prefixed variable-length string serialization&lt;/li&gt;
  &lt;li&gt;Fixed-size cell slots in leaf nodes with zero-padded strings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s a lot of database. Not a toy, either – the fundamentals are the same ones used by SQLite, PostgreSQL, and MySQL. B-trees, page caches, WAL, secondary indexes – these aren’t academic curiosities. They’re what makes your favorite database tick.&lt;/p&gt;

&lt;h2 id=&quot;the-architecture&quot;&gt;The Architecture&lt;/h2&gt;

&lt;p&gt;Here’s how our components map to the &lt;a href=&quot;https://www.sqlite.org/arch.html&quot;&gt;SQLite architecture&lt;/a&gt; we looked at in Part 1:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;SQLite Component&lt;/th&gt;
      &lt;th&gt;Our Implementation&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Tokenizer / Parser&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prepare_statement()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prepare_insert()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;prepare_delete()&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Code Generator&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;execute_statement()&lt;/code&gt; switch&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Virtual Machine&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;execute_insert()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;execute_select()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;execute_delete()&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;B-Tree&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;leaf_node_*&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;internal_node_*&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;table_find()&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Pager&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get_page()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pager_flush()&lt;/code&gt;, LRU eviction&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;OS Interface&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;open()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;read()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;write()&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;lseek()&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;We skipped the bytecode layer (our “VM” calls functions directly), but the layering is the same.&lt;/p&gt;

&lt;h2 id=&quot;what-a-real-database-adds&quot;&gt;What a Real Database Adds&lt;/h2&gt;

&lt;p&gt;There’s always more to build. Here are the biggest things a production database has that we don’t:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multiple tables and joins.&lt;/strong&gt; We have one hardcoded table. A real database has a schema catalog, multiple B-trees (one per table), and join algorithms (nested loop, hash join, sort-merge) for combining data across tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A query planner.&lt;/strong&gt; We always use the B-tree index for primary key lookups and do a full scan otherwise. A real database estimates the cost of different access paths and picks the cheapest one. Sometimes a full scan beats an index scan (e.g., when selecting most of the table).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concurrency control.&lt;/strong&gt; We support one connection at a time. Real databases handle many concurrent readers and writers using locks, multiversion concurrency control (MVCC), or both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A proper SQL parser.&lt;/strong&gt; Our parser uses &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strcmp&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;strtok&lt;/code&gt;. A real parser uses a grammar (often generated by tools like Lemon or Bison) to handle the full SQL syntax.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Page compaction and free space management.&lt;/strong&gt; When we delete rows, the space isn’t reclaimed for reuse. A real database maintains a free page list and compacts pages to avoid fragmentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recovery beyond WAL.&lt;/strong&gt; Our WAL is simple redo logging. Real databases combine redo and undo logging (ARIES protocol), support checkpoints that bound recovery time, and handle partial page writes.&lt;/p&gt;

&lt;h2 id=&quot;what-we-learned&quot;&gt;What We Learned&lt;/h2&gt;

&lt;p&gt;Building a database from scratch taught us:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Why B-trees?&lt;/strong&gt; Because disk I/O is expensive, and B-trees minimize it. A tree with a branching factor of 500 can index a billion rows in 3 levels – 3 page reads to find any row.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Why pages?&lt;/strong&gt; Because disks read in fixed-size blocks. By aligning our data structures to page boundaries, we make every I/O operation useful.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Why write-ahead logging?&lt;/strong&gt; Because writes can fail. By logging before applying, we ensure that committed data survives crashes.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Why indexes?&lt;/strong&gt; Because scanning every row is O(n). An index makes point queries O(log n) – the difference between milliseconds and minutes.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Why transactions?&lt;/strong&gt; Because partial updates are worse than no update. Atomicity ensures all-or-nothing semantics.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These aren’t just database concepts. They’re fundamental computer science – the trade-offs between memory and disk, consistency and performance, simplicity and scalability.&lt;/p&gt;

&lt;h2 id=&quot;thank-you&quot;&gt;Thank You&lt;/h2&gt;

&lt;p&gt;If you’ve followed along this far, you’ve done something remarkable. You’ve read thousands of lines of C, understood B-tree splitting and merging, implemented crash recovery, and built something that actually stores and retrieves data reliably. That’s not trivial.&lt;/p&gt;

&lt;p&gt;The source code is yours to explore, extend, and break. Add multiple tables. Implement joins. Build a proper parser. Or just read through the code and see how it all fits together. The best way to learn is to build, and now you have a foundation to build on.&lt;/p&gt;

&lt;p&gt;Until then!&lt;/p&gt;
</description>
        <pubDate>Thu, 01 Aug 2024 00:00:00 +0000</pubDate>
        <link>https://ibra.github.io/db_tutorial/parts/part23.html</link>
        <guid isPermaLink="true">https://ibra.github.io/db_tutorial/parts/part23.html</guid>
      </item>
    
  </channel>
</rss>
