Writing a sqlite clone from scratch in C
In the last part we added transactions with rollback. But there’s still a durability problem: if the program crashes while writing a page to disk, the database file could be left in a corrupted state – half-written data where a valid page used to be.
The solution is the write-ahead log (WAL). The rule is simple: before writing a page to the main database file, first write it to a separate log file. If the program crashes mid-write, the log has a complete copy of the page that can be replayed on the next startup.
This is the “D” in ACID – Durability.
We store the WAL as a separate file alongside the database, named <dbfile>-wal. Each record in the WAL is a (page_number, page_data) pair:
| page_num (4 bytes) | page_data (4096 bytes) | page_num | page_data | ...
We add a WAL file descriptor and filename to the Pager:
typedef struct {
...
+ int wal_fd;
+ char wal_filename[256];
} Pager;
Before every write to the main database file, we append the page to the WAL:
+void wal_write(Pager* pager, uint32_t page_num) {
+ if (pager->wal_fd == -1) return;
+ lseek(pager->wal_fd, 0, SEEK_END);
+ write(pager->wal_fd, &page_num, sizeof(uint32_t));
+ write(pager->wal_fd, pager->pages[page_num], PAGE_SIZE);
+}
And we add a call to wal_write() at the beginning of pager_flush():
void pager_flush(Pager* pager, uint32_t page_num) {
+ wal_write(pager, page_num);
...
// then write to the database file as before
Now the WAL contains a complete copy of every page before it hits the database file. If the database file gets corrupted, the WAL has what we need.
On startup, we check if the WAL file has any records. If it does, we replay them – writing each page from the WAL into the correct location in the database file:
+void wal_replay(Pager* pager) {
+ if (pager->wal_fd == -1) return;
+ off_t wal_size = lseek(pager->wal_fd, 0, SEEK_END);
+ if (wal_size <= 0) return;
+
+ printf("Replaying WAL (%d records)...\n",
+ (int)(wal_size / (sizeof(uint32_t) + PAGE_SIZE)));
+ lseek(pager->wal_fd, 0, SEEK_SET);
+
+ while (1) {
+ uint32_t page_num;
+ ssize_t n = read(pager->wal_fd, &page_num, sizeof(uint32_t));
+ if (n <= 0) break;
+ void* page_data = malloc(PAGE_SIZE);
+ n = read(pager->wal_fd, page_data, PAGE_SIZE);
+ if (n < (ssize_t)PAGE_SIZE) {
+ free(page_data);
+ break;
+ }
+ lseek(pager->file_descriptor, page_num * PAGE_SIZE, SEEK_SET);
+ write(pager->file_descriptor, page_data, PAGE_SIZE);
+ free(page_data);
+ }
+
+ /* Clear the WAL */
+ close(pager->wal_fd);
+ pager->wal_fd =
+ open(pager->wal_filename, O_RDWR | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
+}
If the WAL replay finds a truncated record (from a crash during WAL write), it stops. The partially-written WAL record is discarded, but all complete records before it are applied. This guarantees that any page write that completed in the WAL will be recovered.
When the database is closed cleanly, we’ve already flushed all dirty pages. The WAL is no longer needed, so we clear it:
+void wal_checkpoint(Pager* pager) {
+ if (pager->wal_fd == -1) return;
+ close(pager->wal_fd);
+ pager->wal_fd =
+ open(pager->wal_filename, O_RDWR | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
+}
This is called in db_close() after all dirty pages are flushed:
+ wal_checkpoint(pager);
+ if (pager->wal_fd != -1) {
+ close(pager->wal_fd);
+ }
int result = close(pager->file_descriptor);
Let’s trace what happens on an insert now:
leaf_node_insert() modifies the page in memorypager_mark_dirty() marks it for write-back (and saves an undo copy if in a transaction)db_close(), pager_flush() is called for each dirty pagepager_flush() first calls wal_write() – the page goes to the WAL filepager_flush() writes the page to the database filewal_checkpoint() clears the WALIf the program crashes between steps 4 and 5, the WAL has the page. On the next startup, wal_replay() writes it to the database file, and the data is not lost.
SQLite’s WAL mode is more sophisticated. Instead of writing to the WAL before the database file, it writes only to the WAL during normal operation. Readers check both the WAL and the database file. Periodically, a checkpoint operation transfers WAL pages to the database file. This allows concurrent readers and writers, which our simple implementation doesn’t support.
But the core principle is the same: write changes to a log first, ensure the log is durable, then apply the changes. If anything goes wrong, the log tells you how to recover.
That wraps up our implementation of the ACID properties. We have Atomicity (transactions), and now Durability (WAL). We’ll talk about what we’ve built and where to go from here in the next and final part.