zero staleness

frankmcsherry · Aug 14, 2024 · a31ba2d · a31ba2d
1 parent 047cf78
commit a31ba2d
Showing 1 changed file with 63 additions and 0 deletions.
diff --git a/posts/2024-08-13.md b/posts/2024-08-13.md
@@ -0,0 +1,63 @@
+## Zero-Staleness: Like using your primary, but faster.
+
+In this post we'll talk about how a new feature in Materialize can make working with data as fresh as if you were using your primary database.
+And perhaps counter-intuitively, Materialize's reaction time, from data change to outcome reported, can even be *faster* than when using your primary.
+
+Let's start with unpacking freshness, responsiveness, and reaction time.
+In the course of you asking questions of your data, there are three consequential moments we will look at:
+1. C: The moment you issue the command,
+2. R: The moment you receive a response,
+3. V: The moment reflected by the response.
+
+You probably feel the C and R moments most viscerally.
+They are respectively when you press return, and when the answer shows up in front of you.
+The V moment is also critical for reaction time, though, and it's not always related to C and R.
+
+I break these down a few ways:
+1. **Replication lag** (from V to C). How long does it take data to reach your system?
+2. **Response time** (from C to R). How long does it take to answer your question?
+3. **Reaction time** (from V to R). How long does it take for data to influence an answer?
+
+When you use a [strict serializable](https://jepsen.io/consistency/models/strict-serializable) system, V will lie somewhere between C and R.
+This means that if the response time (C to R) is small, your reaction time (V to R) is great too.
+If you aren't using your primary database, you probably aren't getting strict serializability.
+
+When you use a (non-strict) [serializable](https://jepsen.io/consistency/models/serializable) system, V may come before C and R.
+In this case a fast response time may *not* indicate a fast reaction time.
+You may get results quickly, but if they don't reflect reality you'll need to ask again.
+And of course, by the time you get those answers they are already out of date.
+The time you need in order to *react* to input changes is large, even in a responsive system.
+
+When you use your primary database you may have the option of strict serializability, serializability, or even weaker isolation levels.
+Most other solutions provide non-strict serializability.
+The classic example is a read replica, which uses the replication log of the primary to populate and maintain a secondary, with some amount of replication lag.
+Farther out there, you could replica data out to a data warehouse, which usually introduces enough replication lag that the concept of "reaction time" shows up mostly in post-mortems: times are in hours, or days.
+
+Materialize also replicates your data off of the primary's replication log, but it has a few tricks.
+
+Materialize's first trick is the subject of this post.
+Materialize is able to get the replication lag down to zero.
+It does this by ensuring that V comes after C.
+When you issue a command at C, Materialize can transact against the upstream primary to learn the current state of the replication log V, and then ensure that its response at R reflects at least everything through V.
+
+It's a surprisingly simple strategy to remove replication lag: just .. wait out the lag.
+
+It's not as popular a strategy as you might think.
+In most systems you first wait out the replication lag (V to C) and then the response time (C to R), meaning you end up with no better reaction time, and a worse response time to boot.
+You do see a form of this approoach with "read your writes" and "causal" consistency: you can use a moment in the replication log you have heard of to insist that your reads reflect at least that moment.
+
+Materialize's second trick is what turns things on their head.
+Materialize computes and then *incrementally maintains* query results.
+It does not have to *first* wait out the replication lag, and then start query processing.
+Materialize can start the query immediately with what data it has, and update the results as the necessary bits of the replication log stream in.
+The time taken is roughly the *maximum* of the time from V to C and the time from C to R, rather than their sum.
+
+This change becomes more dramatic the more of your business logic you move to SQL views.
+Like with queries, Materialize can compute and incrementally maintain views.
+However, it does this continually, and doesn't wait for your queries that use the views.
+When those queries arrive, so much of the work is done already that often its just a matter of waiting out the replication lag.
+This gets the reaction time down to as little as the time to confer with the primary to confirm the maintained result is correct.
+
+This brings us to a perhaps surprising conclusion: Materialize can provide a faster reaction time than the primary itself.
+While the primary has no replication lag, the response time of OLTP databases is not always great, especially for complex queries.
+Although Materialize has some replication lag, its response time can be so much better that while the primary is still working on the query Materialize has already updated its maintained results and responded.