Writing a sqlite clone from scratch in C
Part 18 - A Page Cache and Buffer Pool
Up until now, every row in our database takes the same amount of space on disk: 293 bytes. A username of “a” takes 33 bytes. A username of “abcdefghijklmnopqrstuvwxyz012345” also takes 33 bytes. That’s a lot of wasted space for short strings.
Real databases store strings using a variable-length format. Instead of allocating the maximum possible size for every string, they store the actual length followed by only the bytes that are used. Let’s implement that.
The standard approach for serializing variable-length data is length-prefixing: write the number of bytes first, then the actual data. Our new serialized row format looks like this:
| field | size |
|---|---|
| id | 4 bytes |
| username_len | 4 bytes |
| username | 32 bytes (max) |
| email_len | 4 bytes |
| 255 bytes (max) |
We’re adding 8 bytes of overhead for the two length fields. That changes our constants:
const uint32_t ID_SIZE = size_of_attribute(Row, id);
-const uint32_t USERNAME_SIZE = size_of_attribute(Row, username);
-const uint32_t EMAIL_SIZE = size_of_attribute(Row, email);
+const uint32_t VARCHAR_LEN_SIZE = sizeof(uint32_t);
+
+/*
+ * Serialized Row Layout (length-prefixed strings)
+ *
+ * | id (4) | username_len (4) | username (32) | email_len (4) | email (255) |
+ */
const uint32_t ID_OFFSET = 0;
-const uint32_t USERNAME_OFFSET = ID_OFFSET + ID_SIZE;
-const uint32_t EMAIL_OFFSET = USERNAME_OFFSET + USERNAME_SIZE;
-const uint32_t ROW_SIZE = ID_SIZE + USERNAME_SIZE + EMAIL_SIZE;
+const uint32_t USERNAME_LEN_OFFSET = ID_OFFSET + ID_SIZE;
+const uint32_t USERNAME_OFFSET = USERNAME_LEN_OFFSET + VARCHAR_LEN_SIZE;
+const uint32_t EMAIL_LEN_OFFSET = USERNAME_OFFSET + COLUMN_USERNAME_SIZE;
+const uint32_t EMAIL_OFFSET = EMAIL_LEN_OFFSET + VARCHAR_LEN_SIZE;
+const uint32_t ROW_SIZE =
+ ID_SIZE + VARCHAR_LEN_SIZE + COLUMN_USERNAME_SIZE + VARCHAR_LEN_SIZE +
+ COLUMN_EMAIL_SIZE;
ROW_SIZE goes from 293 to 299 bytes. LEAF_NODE_CELL_SIZE goes from 297 to 303. But LEAF_NODE_MAX_CELLS stays at 13 because 4082 / 303 = 13.
Here’s the new serialize_row(). We memset the entire destination to zero first – this ensures the unused bytes after each string are clean:
void serialize_row(Row* source, void* destination) {
- memcpy(destination + ID_OFFSET, &(source->id), ID_SIZE);
- memcpy(destination + USERNAME_OFFSET, &(source->username), USERNAME_SIZE);
- memcpy(destination + EMAIL_OFFSET, &(source->email), EMAIL_SIZE);
+ memset(destination, 0, ROW_SIZE);
+ memcpy(destination + ID_OFFSET, &(source->id), ID_SIZE);
+ uint32_t username_len = strlen(source->username);
+ memcpy(destination + USERNAME_LEN_OFFSET, &username_len, VARCHAR_LEN_SIZE);
+ memcpy(destination + USERNAME_OFFSET, source->username, username_len);
+ uint32_t email_len = strlen(source->email);
+ memcpy(destination + EMAIL_LEN_OFFSET, &email_len, VARCHAR_LEN_SIZE);
+ memcpy(destination + EMAIL_OFFSET, source->email, email_len);
}
And deserialize_row() reads the length first, then copies only that many bytes:
void deserialize_row(void* source, Row* destination) {
- memcpy(&(destination->id), source + ID_OFFSET, ID_SIZE);
- memcpy(&(destination->username), source + USERNAME_OFFSET, USERNAME_SIZE);
- memcpy(&(destination->email), source + EMAIL_OFFSET, EMAIL_SIZE);
+ memcpy(&(destination->id), source + ID_OFFSET, ID_SIZE);
+ uint32_t username_len;
+ memcpy(&username_len, source + USERNAME_LEN_OFFSET, VARCHAR_LEN_SIZE);
+ memset(destination->username, 0, COLUMN_USERNAME_SIZE + 1);
+ memcpy(destination->username, source + USERNAME_OFFSET, username_len);
+ uint32_t email_len;
+ memcpy(&email_len, source + EMAIL_LEN_OFFSET, VARCHAR_LEN_SIZE);
+ memset(destination->email, 0, COLUMN_EMAIL_SIZE + 1);
+ memcpy(destination->email, source + EMAIL_OFFSET, email_len);
}
The memset to zero before memcpy ensures the destination string is properly null-terminated. This is important because strlen() relies on that null byte.
We can compute how much space a row actually uses versus how much it’s allocated:
+uint32_t row_data_size(Row* row) {
+ return ID_SIZE + VARCHAR_LEN_SIZE + strlen(row->username) + VARCHAR_LEN_SIZE +
+ strlen(row->email);
+}
For a row like (1, a, b@c.com), the actual data is only 4 + 4 + 1 + 4 + 6 = 19 bytes. But we allocate 299 bytes for it. That’s 280 bytes of wasted space!
Our cells still occupy fixed-size slots in the leaf node. Even though we serialize the data more efficiently, we pad the slot to ROW_SIZE. A real database solves this with a slotted page format:
+-------------------------------------------+
| Header | Ptr1 | Ptr2 | Ptr3 | ... |
+-------------------------------------------+
| Free Space |
+-------------------------------------------+
| ... | Cell3 data | Cell2 data | Cell1 data |
+-------------------------------------------+
The cell pointer directory grows downward from the header. Actual cell data is packed upward from the bottom of the page. Each cell takes only as much space as it needs. The free space in the middle shrinks as you add cells.
This is how SQLite, PostgreSQL, and most real databases lay out their pages. We could implement this, but it would require rewriting every function that accesses leaf node cells. For now, our length-prefixed format gives us the serialization story without that complexity.
What about a TEXT column that holds a 10 KB blog post? It doesn’t fit in a single 4 KB page. Real databases use overflow pages (sometimes called TOAST in PostgreSQL). The cell stores a pointer to a separate chain of pages that hold the large value. SQLite uses overflow pages when a record exceeds about 25% of the page size.
We don’t need overflow pages because our column sizes are bounded (32 and 255 bytes), but it’s worth knowing the pattern.
Short strings, long strings – they all work:
+ it 'handles variable-length strings correctly' do
+ script = [
+ "insert 1 a b@c.com",
+ "insert 2 longername longemail@example.com",
+ "select",
+ ".exit",
+ ]
+ result = run_script(script)
+ expect(result).to include("(1, a, b@c.com)")
+ expect(result).to include("(2, longername, longemail@example.com)")
+ end
And our updated constants:
Constants:
ROW_SIZE: 299
COMMON_NODE_HEADER_SIZE: 6
LEAF_NODE_HEADER_SIZE: 14
LEAF_NODE_CELL_SIZE: 303
LEAF_NODE_SPACE_FOR_CELLS: 4082
LEAF_NODE_MAX_CELLS: 13
VARCHAR_LEN_SIZE: 4
Next time we’ll add secondary indexes – separate B-trees that let you look up rows by columns other than the primary key.