Balancing SQLite’s WAL, SYNCHRONOUS=OFF, and fsync for fast rqlite recovery

rqlite is a lightweight, user-friendly, distributed relational database. It’s written in Go, employs Raft for distributed consensus, and uses SQLite as its storage engine.

The newly released rqlite 9.2 introduces a major improvement to startup performance – nodes can now resume from where they left off, instead of rebuilding their state from scratch on every restart. This change means that even if a node manages gigabytes of SQLite data, it can come back online almost instantly, with startup time no longer proportional to dataset size.

In this post, I’ll explore why this change matters, how it was implemented on top of the existing Raft system and SQLite WAL, and what it says about rqlite’s evolution.

From rebuilding to resuming

Since its inception over a decade ago, rqlite took a very conservative approach to correctness on restart. Each restart discarded the local SQLite database (both the main file and any WAL file) and rebuilt state by replaying both the Raft log and applying any SQLite snapshot previously triggered by the Raft system. In other words since the Raft system is the source-of-truth for rqlite, use the information contained in the Raft system in its entirety to rebuild state.

This “always rebuild from scratch” strategy ensured the node started in a guaranteed-correct state, free from any potential corruption that might have occurred during a previous run. And doing so meant rqlite could take advantage of disabling SQLite SYNCHRONOUS, normally risky, but which resulted in great write performance.

It was simple and robust – but as rqlite users clusters managed more data over the years, the cost of that simplicity became apparent. Restarting a node with hundreds of megabytes or even gigabytes of data could take minutes, as the node laboriously reconstructed the SQLite database from Raft.

Well, no longer. A node no longer rebuilds from scratch – it resumes. This is the most significant change to rqlite’s architecture since it switched to WAL mode almost 2 years ago. It fundamentally alters how a node comes back online after a shutdown or crash. Instead of blindly throwing away any existing SQLite file, rqlite will try to pick up right where it left off. The result is dramatic – startup times drop from minutes to seconds when working with multi-gigabyte databases. A rqlite node has essentially learned to wake up remembering its state, rather than reconstructing it from the ground up.

How rqlite ensures a safe resume

You might wonder: how can rqlite skip the rebuild safely, given that it historically ran SQLite in a mode that doesn’t fully guarantee on-disk durability? How can we ensure the SQLite database is consistent and safe on disk — but without flushing SQLite to disk on every write? The answer lies in a careful balance between performance and safety, implemented through dynamically changing SQLite’s synchronization settings and some additional metadata.

High-Speed WAL Mode with Periodic fsync: Normally rqlite runs SQLite in WAL mode with SYNCHRONOUS=OFF for maximum write throughput. This speeds up inserts and updates by avoiding blocking fsync calls on each transaction, but it carries a risk – if the OS crashes, the SQLite database might not be fully flushed to disk, potentially leaving it in an inconsistent state. It is was this issue that prevented the SQLite database being a source of truth across restarts — until now.

With release of rqlite 9.2, rqlite periodically performs a full fsync of the SQLite database. Specifically, whenever rqlite takes a snapshot of the SQLite database for storage in Raft, it temporarily switches SQLite to SYNCHRONOUS=FULL, checkpoints the WAL, and flushes all data to disk. This ensures that the SQLite file on disk represents a fully consistent checkpoint of the database at that point in time. After the snapshot is done, rqlite immediately switches SQLite back to SYNCHRONOUS=OFF mode to keep write performance high.

Each successful snapshot now guarantees a durable point-in-time state on disk. In fact, during a clean shutdown, rqlite will proactively trigger a final snapshot so that the database file is completely synced before the process exits – meaning startup time will be at a minimum on restart. In short, rqlite’s use of WAL + SYNCHRONOUS=OFF writes gives great performance, and periodic sync points (the snapshots) provides a fully consistent copy on disk every so often. Because minutes normally pass between snapshots (though it depends on write load) this occasional flush-to-disk is effectively amortized over many, many writes. In this way write performance is not materially impacted.

Writes alter the WAL, never the main database: It’s important to understand that when snapshotting completes, write requests can be serviced again. As usual, once write requests reach consensus via Raft (and are stored safely in the on-disk Raft log), they are written to the SQLite WAL. The main SQLite file, which has been safely fsynced to disk, is never altered until the next snapshot, so its ready for use by any subsequent restart operation. On restart rqlite always deletes any WAL file it finds, as that data will always be rebuilt from the Raft log.

Recording a “Clean Snapshot” Fingerprint: How does rqlite know on restart that an existing SQLite file is safe to use? The trick is a small metadata file called a clean snapshot marker. Whenever a Raft snapshot completes, rqlite writes out a fingerprint of the SQLite database at that moment. This fingerprint (stored as a JSON file on disk) contains the database file’s last modification timestamp, size, and its CRC32. Writing this file is reliable – it’s fsync’ed to disk as well, so rqlite knows that if the file exists, it accurately reflects a synced state of the DB. (This marker file is deleted as the very first operation of the snapshotting process so any interrupted snapshot will be clear the next time the node restarts.)

On node startup, rqlite 9.2 performs a check: if there are any snapshots available, and this clean snapshot marker file is found, rqlite reads the expected modification time, size, and CRC32 from the file. It then checks those same attributes of the SQLite database file’s on disk. If they all match exactly, then it knows that the SQLite file was the one produced by the last successful snapshot. In this case, the node can trust the on-disk SQLite database and skip the usual restore from the Raft Snapshot.

rqlite then simply opens the existing SQLite file and resumes normal operations. Any Raft logs from after any snapshot are then applied to the database (they will actually be written to the WAL in practise). The end result is node is ready much faster than before.

If for some reason this check fails – perhaps that the SQLite database may have been checkpointed but the Raft Snapshot operation failed to complete – then rqlite falls back to the old behavior. It will treat the existing file as suspect, delete all existing SQLite state, and restore it by applying the latest known-good Raft snapshot or replaying the log from scratch.

This approach means rqlite never risks starting from a potentially inconsistent database. It will only resume from the existing SQLite database when it’s confident the file is intact and current; otherwise, correctness takes priority and it rebuilds the state the slow way. In practice, with 9.2’s changes, the slow path will be very rarely needed, unless an abrupt crash occurs in very specific point during rqlite operation.

But doesn’t a larger file mean more time to check the CRC32?

There’s one small but important detail. When a node starts up, it synchronously checks the SQLite file’s modification time and size. The CRC32 check, however, runs in a separate goroutine. If the modification time and size look right, rqlite assumes the database is good and starts serving reads and writes right away. A few seconds later, the CRC32 result comes in. If it matches the value stored in the marker file, nothing more happens. If it doesn’t, the process exits, alerting the operator to a problem.

This is safe because any new writes during those few seconds while the CRC is being calculated live in the Raft log and the SQLite WAL, not in the main database file. And since rqlite always deletes any WAL file at startup, exiting here is fine — the node can replay those writes from the Raft log when it restarts, probably in combination with a full restore from Raft next time round.

What is the result?

The impact of this change is immediately noticeable. Startup times are now independent of data size – whether your node has 10 MB or 10 GB of data, a restart will be on the order of a second or two (basically the time to open the SQLite file), rather than scaling with the amount of data. Previously, a multi-gigabyte dataset probably meant many thousands of Raft log entries to replay and a huge snapshot to restore, leading to start times that could stretch into minutes. Now, as long as the node has a recent fsynced SQLite file, it just opens it and is ready to serve requests immediately. This reduces downtime for maintenance restarts or node reboots. Your rqlite cluster can come back from upgrades or reconfiguration much more quickly, improving overall availability.

To give a sense of the improvement: in one test, a rqlite node managing ~5 GB of SQLite tables used to take over a minute to fully come online after a restart. With rqlite 9.2, that same node restarts and begins serving reads in under a second. The only delay is opening the database file and verifying the snapshot modification time and file size – a constant-time operation that doesn’t grow with the data. Smaller datasets that might have taken, say, 30 seconds to recover now feel almost instantaneous.

It’s an obvious change, right?

A fair question at this point is, why only now? If the solution is to periodically fsync and check a file’s state on restart, why did it take over a decade to implement what sounds like a straightforward optimization? The answer, like rqlite itself, is about prioritizing correctness and the long, slow evolution of a stable system.

For years, the “always rebuild” strategy was the right one. It was simple, robust, and provably correct. The Raft log was the single source of truth, and the on-disk SQLite file was just a disposable cache. This approach eliminated an entire class of potential bugs, and in a database, correctness is the one thing you can never compromise.

Introducing a “fast resume” path meant fundamentally changing that model—it meant trusting the SQLite file. That’s a change I don’t take lightly. As rqlite isn’t driven by commercial pressures, I had the freedom to let this idea percolate for months, even years. I could think through every edge case, often ruling out entire design ideas before a single line of code was written.

Part of the delay was also a search for a more general, “perfect” solution—one that would work in all circumstances, regardless of the nature of the crash. It took years to truly understand how the system was evolving, learning the deep nuances of SQLite’s WAL behavior under real-world situation. It also took time to become convinced that a pragmatic solution—one that optimizes for the common case (deliberate restarts) while falling back to the old, safe method for the exceptional case (an abrupt crash)—was the right trade-off.

But the most important reason: such a fundamental change could only be made once the rest of the system was rock-solid. This new feature rests on a foundation of 12 years of work. It depends on a battle-tested Raft implementation, a trusted snapshotting and log-truncation process, and a reliable recovery mechanism. Waiting this long meant that when this new logic was finally added, it was landing on mature, deeply stable software. It’s an evolution, not a revolution, and that’s what gives me confidence that it’s just as correct as the old way, only much, much faster.

This change feels like another turning point for rqlite. The project began as a experiment in distributed systems – essentially a demonstration that you could add Raft consensus to SQLite and get a fault-tolerant, consistent database. Over the years it grew in features and stability, but that original “always rebuild” approach was a vestige of its experimental origins. Now rqlite has matured to the point where it can wake up remembering its state rather than reconstructing it from scratch. It’s a small conceptual step, but one that signals a new level of practicality for the system.

The most obvious ideas sometimes arrive late, but when they do, they highlight how much groundwork was needed to make them possible.

Next steps – and the possible path to rqlite 10.0

With rqlite 9.2, operational life gets a bit easier for anyone running rqlite – especially those with large datasets. Restarts are no longer an ordeal or something to dread in your maintenance window. You can upgrade nodes or bounce a cluster member and have confidence it will be back online almost immediately. All this comes without compromising rqlite’s core promise of correctness. If a node can’t guarantee its on-disk state is perfect, it will simply do what it always did and recover from the canonical Raft log. But in the common case, rqlite now combines the performance of in-place restarts with the safety of Raft’s consistency guarantees.

If you’re upgrading from an earlier version, the transition is seamless – just update to 9.2, and the next time each node restarts, you’ll notice the difference. There’s nothing new to configure; the feature is automatic.

A closing thought. This work is making the path much clearer to a time when only the SQLite file, a WAL file, and a Raft log exists — there will be no need for a second copy of the database in the Raft Snapshot store, and disk requirements will drop by half. But, like all major changes to rqlite, that design will take time — time to mature, time to validate, and time to develop. But the way ahead is much clearer now.

Leave a Reply

Your email address will not be published. Required fields are marked *