Let's Build a Simple Database

Writing a sqlite clone from scratch in C

Overview

View on GitHub (pull requests welcome)

Part 16 - Rebalancing the B-Tree After Deletion



Last time we added a simple delete command. It works, but it leaves the tree in a potentially invalid state. In a B+ tree, every non-root node must maintain a minimum number of keys. When deletion causes a node to drop below that minimum, the node “underflows” and the tree needs to be rebalanced.

This is the deletion counterpart to the splitting we implemented for insertion. Splitting handles overflow; rebalancing handles underflow.

Minimum Occupancy

First, let’s define how few cells a node is allowed to have. The standard rule is half the maximum:

+/*
+ * Minimum occupancy for non-root nodes
+ */
+const uint32_t LEAF_NODE_MIN_CELLS = LEAF_NODE_MAX_CELLS / 2;
+const uint32_t INTERNAL_NODE_MIN_KEYS = INTERNAL_NODE_MAX_KEYS / 2;

With LEAF_NODE_MAX_CELLS at 13, LEAF_NODE_MIN_CELLS is 6. With INTERNAL_NODE_MAX_KEYS at 3, INTERNAL_NODE_MIN_KEYS is 1. The root is exempt from this rule – it can have as few as zero cells.

The Strategy

When a leaf underflows, we have two strategies:

  1. Borrow from a sibling that has more than the minimum. Shift one cell over.
  2. Merge with a sibling if neither has cells to spare. Combine both nodes into one and remove the separator key from the parent.

If merging causes the parent to underflow, the same logic applies recursively up the tree. If the root ends up with zero keys, we promote its only child to be the new root, reducing the tree’s height.

Finding a Child’s Position in its Parent

To rebalance, we need to know which position a node occupies among its parent’s children. This helper scans the parent to find it:

+uint32_t find_child_index(void* parent, uint32_t child_page_num) {
+  uint32_t num_keys = *internal_node_num_keys(parent);
+  for (uint32_t i = 0; i <= num_keys; i++) {
+    if (*internal_node_child(parent, i) == child_page_num) {
+      return i;
+    }
+  }
+  printf("Could not find child in parent node.\n");
+  exit(EXIT_FAILURE);
+}

This iterates through all children (including the right child at index num_keys) until it finds a match.

Leaf Node Rebalancing

Here’s the main rebalancing function for leaf nodes. It checks for underflow, then tries borrowing before falling back to merging:

+void leaf_node_rebalance(Table* table, uint32_t page_num) {
+  void* node = get_page(table->pager, page_num);
+
+  if (is_node_root(node)) {
+    return;
+  }
+
+  uint32_t num_cells = *leaf_node_num_cells(node);
+  if (num_cells >= LEAF_NODE_MIN_CELLS) {
+    return;
+  }
+
+  uint32_t parent_page_num = *node_parent(node);
+  void* parent = get_page(table->pager, parent_page_num);
+  uint32_t child_index = find_child_index(parent, page_num);
+  uint32_t parent_num_keys = *internal_node_num_keys(parent);

The root can have any number of cells, and if the leaf has at least the minimum, there’s nothing to do.

Borrowing from the Right Sibling

If the right sibling has more than the minimum, we take its first cell:

+  if (child_index < parent_num_keys) {
+    uint32_t right_page = *internal_node_child(parent, child_index + 1);
+    void* right_sibling = get_page(table->pager, right_page);
+
+    if (*leaf_node_num_cells(right_sibling) > LEAF_NODE_MIN_CELLS) {
+      memcpy(leaf_node_cell(node, num_cells),
+             leaf_node_cell(right_sibling, 0), LEAF_NODE_CELL_SIZE);
+      *(leaf_node_num_cells(node)) += 1;
+
+      uint32_t right_cells = *leaf_node_num_cells(right_sibling);
+      for (uint32_t i = 0; i < right_cells - 1; i++) {
+        memcpy(leaf_node_cell(right_sibling, i),
+               leaf_node_cell(right_sibling, i + 1), LEAF_NODE_CELL_SIZE);
+      }
+      *(leaf_node_num_cells(right_sibling)) -= 1;
+
+      *internal_node_key(parent, child_index) =
+          get_node_max_key(table->pager, node);
+      return;
+    }
+  }

The borrowed cell goes at the end of the current node (it has a higher key). Then we shift the right sibling’s remaining cells left and update the parent’s key for this node.

Borrowing from the Left Sibling

If there’s no right sibling to borrow from, try the left:

+  if (child_index > 0) {
+    uint32_t left_page = *internal_node_child(parent, child_index - 1);
+    void* left_sibling = get_page(table->pager, left_page);
+
+    if (*leaf_node_num_cells(left_sibling) > LEAF_NODE_MIN_CELLS) {
+      for (uint32_t i = num_cells; i > 0; i--) {
+        memcpy(leaf_node_cell(node, i), leaf_node_cell(node, i - 1),
+               LEAF_NODE_CELL_SIZE);
+      }
+
+      uint32_t left_cells = *leaf_node_num_cells(left_sibling);
+      memcpy(leaf_node_cell(node, 0),
+             leaf_node_cell(left_sibling, left_cells - 1),
+             LEAF_NODE_CELL_SIZE);
+      *(leaf_node_num_cells(node)) += 1;
+      *(leaf_node_num_cells(left_sibling)) -= 1;
+
+      *internal_node_key(parent, child_index - 1) =
+          get_node_max_key(table->pager, left_sibling);
+      return;
+    }
+  }

This time the borrowed cell goes at the beginning of the current node (it has a lower key), so we have to shift our existing cells right first. Then update the parent’s key for the left sibling, since its max key has changed.

Merging

If neither sibling can lend a cell, we merge. If we can merge with the right sibling, we absorb its cells into the current node:

+  if (child_index < parent_num_keys) {
+    uint32_t right_page = *internal_node_child(parent, child_index + 1);
+    void* right_sibling = get_page(table->pager, right_page);
+    uint32_t right_cells = *leaf_node_num_cells(right_sibling);
+
+    for (uint32_t i = 0; i < right_cells; i++) {
+      memcpy(leaf_node_cell(node, num_cells + i),
+             leaf_node_cell(right_sibling, i), LEAF_NODE_CELL_SIZE);
+    }
+    *(leaf_node_num_cells(node)) = num_cells + right_cells;
+    *leaf_node_next_leaf(node) = *leaf_node_next_leaf(right_sibling);
+
+    *internal_node_key(parent, child_index) =
+        get_node_max_key(table->pager, node);
+    internal_node_remove_child(table, parent_page_num, child_index + 1);

We copy all cells from the right sibling, fix the next_leaf pointer chain, update the parent key, and remove the right sibling’s entry from the parent. If the current node is the rightmost child, we merge into the left sibling instead, using the same logic in reverse.

Removing a Child from an Internal Node

When two leaves merge, one of them disappears and we need to remove its entry from the parent. This function handles removal of a child at a given index:

+void internal_node_remove_child(Table* table, uint32_t page_num,
+                                uint32_t child_index) {
+  void* node = get_page(table->pager, page_num);
+  uint32_t num_keys = *internal_node_num_keys(node);
+
+  if (child_index == num_keys) {
+    if (num_keys > 0) {
+      *internal_node_right_child(node) =
+          *internal_node_child(node, num_keys - 1);
+      *(internal_node_num_keys(node)) = num_keys - 1;
+    } else {
+      *internal_node_right_child(node) = INVALID_PAGE_NUM;
+    }
+  } else {
+    for (uint32_t i = child_index; i < num_keys - 1; i++) {
+      memcpy(internal_node_cell(node, i), internal_node_cell(node, i + 1),
+             INTERNAL_NODE_CELL_SIZE);
+    }
+    *(internal_node_num_keys(node)) = num_keys - 1;
+  }

If we’re removing the rightmost child, the last regular cell’s child gets promoted to right child. Otherwise, we shift cells left to fill the gap.

After removing, this function also checks whether the internal node underflows, and if so, kicks off internal_node_rebalance().

Tree Height Reduction

The most satisfying part: when merges cascade up to the root and the root has zero keys left, its only child is promoted to be the new root. The tree gets shorter:

+  if (is_node_root(node) && *internal_node_num_keys(node) == 0) {
+    uint32_t child_page = *internal_node_right_child(node);
+    if (child_page == INVALID_PAGE_NUM) {
+      return;
+    }
+    void* child = get_page(table->pager, child_page);
+    memcpy(node, child, PAGE_SIZE);
+    set_node_root(node, true);
+
+    if (get_node_type(node) == NODE_INTERNAL) {
+      uint32_t promoted_keys = *internal_node_num_keys(node);
+      for (uint32_t i = 0; i < promoted_keys; i++) {
+        void* c = get_page(table->pager, *internal_node_child(node, i));
+        *node_parent(c) = table->root_page_num;
+      }
+      void* rc = get_page(table->pager, *internal_node_right_child(node));
+      *node_parent(rc) = table->root_page_num;
+    }
+  }

We copy the child’s contents into the root page (keeping page 0 as the root), mark it as the root, and update the parent pointers of the promoted node’s children. This is the mirror image of create_new_root() – one grows the tree, the other shrinks it.

Triggering Rebalancing

Finally, we update execute_delete() to update the parent key when the max key changes and to call leaf_node_rebalance() after every deletion:

+  uint32_t leaf_page_num = cursor->page_num;
+  uint32_t old_max = get_node_max_key(table->pager, node);
+
   leaf_node_delete(cursor);
-
-  free(cursor);
+  free(cursor);
+
+  node = get_page(table->pager, leaf_page_num);
+  if (!is_node_root(node) && *leaf_node_num_cells(node) > 0) {
+    uint32_t new_max = get_node_max_key(table->pager, node);
+    if (new_max != old_max) {
+      uint32_t parent_page = *node_parent(node);
+      void* parent = get_page(table->pager, parent_page);
+      update_internal_node_key(parent, old_max, new_max);
+    }
+  }
+
+  leaf_node_rebalance(table, leaf_page_num);

   return EXECUTE_SUCCESS;

Testing

Let’s make sure we can delete from multi-level trees:

+  it 'deletes rows from a multi-level tree' do
+    script = (1..15).map do |i|
+      "insert #{i} user#{i} person#{i}@example.com"
+    end
+    script << "delete 7"
+    script << ".btree"
+    script << "select"
+    script << ".exit"
+    result = run_script(script)
+
+    expect(result).to include("Executed.")
+    expect(result).not_to include("(7, user7, person7@example.com)")
+    expect(result).to include("(1, user1, person1@example.com)")
+    expect(result).to include("(15, user15, person15@example.com)")
+  end

And verify that we can delete every row without crashing:

+  it 'handles deleting all rows' do
+    script = [
+      "insert 1 user1 person1@example.com",
+      "insert 2 user2 person2@example.com",
+      "insert 3 user3 person3@example.com",
+      "delete 1",
+      "delete 2",
+      "delete 3",
+      "select",
+      ".exit",
+    ]
+    result = run_script(script)
+    expect(result).to match_array([
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > Executed.",
+      "db > ",
+    ])
+  end

The tree grows when we insert enough rows, and now it can shrink back down when we delete them. That’s the full B-tree lifecycle.

Next time we’ll add the ability to search for specific rows with a WHERE clause, putting our B-tree index to real use.