Let's Build a Simple Database

Writing a sqlite clone from scratch in C

Overview

View on GitHub (pull requests welcome)

Part 22 - Write-Ahead Logging



In the last part we added transactions with rollback. But there’s still a durability problem: if the program crashes while writing a page to disk, the database file could be left in a corrupted state – half-written data where a valid page used to be.

The solution is the write-ahead log (WAL). The rule is simple: before writing a page to the main database file, first write it to a separate log file. If the program crashes mid-write, the log has a complete copy of the page that can be replayed on the next startup.

This is the “D” in ACID – Durability.

The WAL File

We store the WAL as a separate file alongside the database, named <dbfile>-wal. Each record in the WAL is a (page_number, page_data) pair:

| page_num (4 bytes) | page_data (4096 bytes) | page_num | page_data | ...

We add a WAL file descriptor and filename to the Pager:

 typedef struct {
   ...
+  int wal_fd;
+  char wal_filename[256];
 } Pager;

Writing to the WAL

Before every write to the main database file, we append the page to the WAL:

+void wal_write(Pager* pager, uint32_t page_num) {
+  if (pager->wal_fd == -1) return;
+  lseek(pager->wal_fd, 0, SEEK_END);
+  write(pager->wal_fd, &page_num, sizeof(uint32_t));
+  write(pager->wal_fd, pager->pages[page_num], PAGE_SIZE);
+}

And we add a call to wal_write() at the beginning of pager_flush():

 void pager_flush(Pager* pager, uint32_t page_num) {
+  wal_write(pager, page_num);
   ...
   // then write to the database file as before

Now the WAL contains a complete copy of every page before it hits the database file. If the database file gets corrupted, the WAL has what we need.

Crash Recovery

On startup, we check if the WAL file has any records. If it does, we replay them – writing each page from the WAL into the correct location in the database file:

+void wal_replay(Pager* pager) {
+  if (pager->wal_fd == -1) return;
+  off_t wal_size = lseek(pager->wal_fd, 0, SEEK_END);
+  if (wal_size <= 0) return;
+
+  printf("Replaying WAL (%d records)...\n",
+         (int)(wal_size / (sizeof(uint32_t) + PAGE_SIZE)));
+  lseek(pager->wal_fd, 0, SEEK_SET);
+
+  while (1) {
+    uint32_t page_num;
+    ssize_t n = read(pager->wal_fd, &page_num, sizeof(uint32_t));
+    if (n <= 0) break;
+    void* page_data = malloc(PAGE_SIZE);
+    n = read(pager->wal_fd, page_data, PAGE_SIZE);
+    if (n < (ssize_t)PAGE_SIZE) {
+      free(page_data);
+      break;
+    }
+    lseek(pager->file_descriptor, page_num * PAGE_SIZE, SEEK_SET);
+    write(pager->file_descriptor, page_data, PAGE_SIZE);
+    free(page_data);
+  }
+
+  /* Clear the WAL */
+  close(pager->wal_fd);
+  pager->wal_fd =
+      open(pager->wal_filename, O_RDWR | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
+}

If the WAL replay finds a truncated record (from a crash during WAL write), it stops. The partially-written WAL record is discarded, but all complete records before it are applied. This guarantees that any page write that completed in the WAL will be recovered.

Checkpointing

When the database is closed cleanly, we’ve already flushed all dirty pages. The WAL is no longer needed, so we clear it:

+void wal_checkpoint(Pager* pager) {
+  if (pager->wal_fd == -1) return;
+  close(pager->wal_fd);
+  pager->wal_fd =
+      open(pager->wal_filename, O_RDWR | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
+}

This is called in db_close() after all dirty pages are flushed:

+  wal_checkpoint(pager);
+  if (pager->wal_fd != -1) {
+    close(pager->wal_fd);
+  }
   int result = close(pager->file_descriptor);

The Write Path

Let’s trace what happens on an insert now:

  1. leaf_node_insert() modifies the page in memory
  2. pager_mark_dirty() marks it for write-back (and saves an undo copy if in a transaction)
  3. On db_close(), pager_flush() is called for each dirty page
  4. pager_flush() first calls wal_write() – the page goes to the WAL file
  5. Then pager_flush() writes the page to the database file
  6. After all pages are flushed, wal_checkpoint() clears the WAL

If the program crashes between steps 4 and 5, the WAL has the page. On the next startup, wal_replay() writes it to the database file, and the data is not lost.

How SQLite Does It

SQLite’s WAL mode is more sophisticated. Instead of writing to the WAL before the database file, it writes only to the WAL during normal operation. Readers check both the WAL and the database file. Periodically, a checkpoint operation transfers WAL pages to the database file. This allows concurrent readers and writers, which our simple implementation doesn’t support.

But the core principle is the same: write changes to a log first, ensure the log is durable, then apply the changes. If anything goes wrong, the log tells you how to recover.

That wraps up our implementation of the ACID properties. We have Atomicity (transactions), and now Durability (WAL). We’ll talk about what we’ve built and where to go from here in the next and final part.