Skip to content

Commit

Permalink
Update MZ post
Browse files Browse the repository at this point in the history
  • Loading branch information
frankmcsherry committed Jan 13, 2025
1 parent 67af48b commit cfb63c6
Showing 1 changed file with 46 additions and 29 deletions.
75 changes: 46 additions & 29 deletions posts/2024-11-25.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,27 @@
# Understanding Consistency in Materialize
# Strong Consistency in Materialize

[Materialize](https://materialize.com) draws data in from multiple external transactional sources of data, and provides a "consistent" view over the ensemble of data.
In fact, one of its primary contributions is the introduction of structure such that it both:
1. faithfully reflects the transactional state of each input, and
2. brings together multiple transactional inputs into a single timeline.
[Materialize]((https://materialize.com)) is a system that makes it easier to work with continually changing data.

The most common challenge with continually changing data is the continual change.
It's hard to be certain that the output you are looking at reflects the current reality, or even *any* reality.
Many other systems provide [eventual consistency](https://en.wikipedia.org/wiki/Eventual_consistency), the promise that if the changes stop you'll settle at the right answer, but until that happens no guarantees.
That's bad news when the change is continual: the outputs may be always nonsense.

Materialize provides a much clearer experience.
Every output Materialize produces is correct for some recent time, and we can even tell you what that time is.
These times are usually "within a second", and you can dial around your requirements here, but the results are always correct.

Many folks have been surprised about this guarantee when working with multiple upstream sources of data.
Indeed, one of Materialize's primary contributions is the introduction of structure such that it:
1. faithfully reflects the transactional transitions of each input source,
2. brings together multiple transactional inputs into one common timeline,
3. produces results that are exactly correct for each moment on that timeline.

How Materialize pulls this off is both subtle and at the same time surprisingly straight-forward.
We'll unpack how this unfolds, starting from your transactional source of truth, on through integrations with other sources, and across many independently authored and maintained SQL views.
It is not magic beans that violate fundamental theorems of distributed systems, but a relatively direct and potentially unsurprising combination of [virtual time](https://dl.acm.org/doi/10.1145/3916.3988) and [incremental computation](https://en.wikipedia.org/wiki/Incremental_computing).

By the end of the post, you should have a clear understanding of how [Virtual Time](https://dl.acm.org/doi/10.1145/3916.3988) can provide concurrency control mechanisms that compose with each other, and how Materialize uses Virtual Time to align its input transactional data, and provide always consistent outputs.
We'll unpack how this unfolds, starting from your transactional source of truth, extended to other upstream sources, and across many independently authored and maintained SQL views.
By the end of the post, you should have a clear understanding of how Materialize aligns its input transactional data, and provides always consistent outputs.

## Consistency and Change Data Capture (CDC)

Expand All @@ -27,8 +40,8 @@ product 0----------*----*-----**-------->
What we've drawn here for each is a line going from left to right.
Each starts at some initial moment `0--`, experiences updates at each `-*-`, up to its current state indicated by `-->`.

The vertical stacking of the lines means to suggest transactional consistency: tables may update at exactly the same time.
A "serializable" database is one where there is such a linear timeline: each transaction occurs atomically, in some total order.
The vertical stacking of the lines means to suggest transactional consistency: tables that update at exactly the same time.
A "serializable" database is one where there is such a linear timeline: each transaction appears to occurs instantaneously, in some total order.
Moreover, anyone looking at the data sees it at some moment in this timeline.
Let's represent this with a vertical line to indicate a transactionally consistent view.

Expand All @@ -42,8 +55,8 @@ product 0----------*----*--|--**-------->
I've dropped a vertical line at an arbitrary aesthetically appealing location, but everything we'll discuss checks out for *any* vertical line.
The vertical lines will define what it means to be transactionally consistent, for this post at least.

One of the most appealing properties of a database is that it masks the complexity of continually updating data, and presents as if your data moves through a sequence of consistent states.
If you were to drop in to the OLTP database and issue a query, the answer would be as if we stopped the world for long enough to get the precise answer at some moment.
One of the most appealing properties of a database is that it masks the complexity of continually and concurrently updating data, and presents as if your data moves through a sequence of consistent states.
If you were to drop in to a serializable OLTP database and issue a query, the answer would be as if we stopped the world for long enough to get the precise answer at some moment.

Consider for example, a reporting query like so:
```sql
Expand All @@ -54,18 +67,19 @@ Consider for example, a reporting query like so:
AND sales.amount > 100;
GROUP BY client.name
```
Although this brings together information from `client` and `sales`, with each record potentially altering some result, the output would be as if executed atomically at some vertical line dropped through the timelines of the tables.
Although this brings together information from `client` and `sales`, with each record potentially altering some result, the output would be as if executed instantly at some vertical line dropped through the timelines of the tables.
If every `sales.c_id` has a corresponding `client.c_id`, we will be sure to incorporate each of them.
If multiple sales were part of the same transaction, we'll see either all of them or none of them.

However, providing the appearance of transactional updates is taxing for an OLTP database.
Ad-hoc query processing interferes with the continual updates to the source tables, and the longer a query needs to run the greater the skew between its results and reality.
This is where Materialize comes in.
And we haven't even gotten to the multiple OLTP sources that don't know how to talk to each other.
This is where Materialize steps in.

## Differential Dataflow and Virtual Time

The timelines we've drawn are not only a helpful way of thinking about transactional systems, they are also a tool for ensuring consistency.
Specifically, [Virtual Time](https://dl.acm.org/doi/10.1145/3916.3988) is a concurrency control mechanism that asks for all updates to be explicitly timestamped, where the stamped times fully spell out the order in which commands are applied.
Specifically, [virtual time](https://dl.acm.org/doi/10.1145/3916.3988) is a concurrency control mechanism that asks for all updates to be explicitly timestamped, where the stamped times fully spell out the order in which commands are applied.
In Materialize, and [Differential Dataflow](https://github.com/TimelyDataflow/differential-dataflow) (the engine on which it builds), these timestamps are *the* tool for ensuring consistency.

Recall our sparkline from above, annotated with `OLTP` to remind us where that comes from.
Expand All @@ -76,13 +90,14 @@ product 0----------*----*-----**--------> \
sales 0----------*----*----***--------> /
```
Although not necessarily the case, imagine that each update `-*-` happens at an explicitly recorded moment in time.
Databases do not necessarily record updates by time, perhaps instead using say sequence numbers, but we will.
Materialize will assign explicit times to each update to ensure transactional consistency: all updates for any one transaction get an identical timestamp.
Databases do not necessarily record updates by time, perhaps instead using say sequence numbers, or no numbers at all, but we will use times.
Materialize will assign explicit times to each inbound update to ensure transactional consistency: all updates for any one transaction get an identical timestamp.

Concretely, Materialize represents all updates as triples `(data, time, diff)`.
The `data` component is the row that experiences a change.
The `time` component is the timeline position of the moment the update occurs.
The `diff` copmonent is best thought of as either "insert" or "delete".
* The `data` component is the row that experiences a change.
* The `time` component is the moment on the timeline when the update occurs.
* The `diff` copmonent is best thought of as either "insert" or "delete".

Transactional consistency is provided by having updates in a transaction use identical `time` coordinates.

These times are not just a helpful consistency idiom, but they tell us *exactly what we need to compute* to respond to a query at a time.
Expand Down Expand Up @@ -120,7 +135,7 @@ This is the "subtle, but also simple" moment.

Materialize sets up a framework that tells us what the correct answer needs to be for every time.
It then uses distributed, streaming, scale-out infrastructure to determine these correct answers.
And although the system internals are fanscinating and nuanced, the user experience and outcomes are meant to be simple and clear.
Although the system internals are fascinating and nuanced, the user experience and outcomes are meant to be simple and clear.
Your query results will be as if we stopped the world to compute them for you, and we'll shoulder the burden of doing it more efficiently than that.

---
Expand All @@ -147,8 +162,10 @@ As before, each timeline is exactly determined from its input timelines and the
Also as before, the exact correspondence is a basis for consistency.
If we drop a vertical line, we are able to align a consistent view over the inputs and their corresponding outputs.
This consistency comes despite the OLTP inputs and the SQL view computation being on two potentially independent systems.
The explicit timelines are the only mechanism coordinating the two systems.
They are nonetheless powerful enough to exactly correlate input data and output results.
The explicit timelines are the only mechanism coordinating the two systems, but they are nonetheless powerful enough to exactly line up input data and output results.

Virtual time (and SQL's semantics) tells us exactly what outputs we need to produce at each time.
Differential dataflow is the tool we use to computate and maintain these outputs.

## Materialize

Expand All @@ -158,8 +175,8 @@ Stepping back, there are several tasks Materialize performs that we'll want to c
1. Ingest each OLTP input as transitions on a common timeline.

Our examples above used a single OLTP input, with multiple tables, but you may have tables from multiple independent sources you are bringing together.
Materialize cannot make independent sources become consistent (a very hard distributed systems problem), but it can place all of them on a **common timeline**.
Each input will be internally consistent (i.e., transactions respected by Materialize), with an opinionated take about how their timelines interleave.
Materialize cannot make independent sources become mutually consistent (a very hard, perhaps ill-specified distributed systems problem), but it can place all of them on a *common timeline*.
Each input will be internally consistent (i.e., its transactions respected by Materialize), with an opinionated but invented take about how their timelines interleave.

2. Maintain the consistent timelines for any composition of derived views.

Expand All @@ -184,7 +201,7 @@ Let's unpack these tasks.

Materialize's [`CREATE SOURCE`](https://materialize.com/docs/sql/create-source/) command allows you to bring in a collection of transactionally consistent tables from an external upstream source.
The source is Materialize's unit of internal consistency: all tables from the same source will update in lock-step with the transitions of their input tables, always consistent with one another.
Updates to tables from different sources will be put in *an* order, by virtue of being put in a timeline, but that order may not reflect external causal constraints.
Updates to tables from different sources will be put in *an* order, by virtue of being put in a timeline, but that interleaving is something Materialize invents for you.

```text
<- consistent view ->
Expand Down Expand Up @@ -380,12 +397,12 @@ Nonetheless, Materialize uses change data capture to present the data as if you
Updates are always consistent, and the state of the system moves continually forward.

Your SQL business logic is potentially highly complex, and may rely on multiple sources of data.
Materialize uses the structure of Virtual Time to get a head start on your queries, precomputing their results and keeping them up to date as time advances.
Virtual Time also allows the integration of multiple upstream sources: once brought on to the same timeline, SQL queries across multiple inputs have specific answers Materialize can compute and incrementally maintain.
Materialize uses the structure of virtual time to get a head start on your queries, precomputing their results and keeping them up to date as time advances.
Virtual time also allows the integration of multiple upstream sources: once brought on to the same timeline, SQL queries across multiple inputs have specific answers Materialize can compute and incrementally maintain.

Your interactions with Materialize, queries specifically, also inhabit the same timeline, and result in precisely correct answers at the chosen times.
The way in which Materialize choose query times reflects the isolation guarantees you've requested, trading off responsiveness, freshness, and consistency.
The way in which Materialize choose query times reflects the isolation guarantees you've requested, trading off responsiveness and freshness, while maintaining consistency.
The timeline also provides a useful idiom for Materialize to report progress back to you, as a sequence of tasks that "complete" as they pass the query timestamp.

Although Materialize is complex under the hood, fascinatingly complex, it fundamentally aims to provide simplicity back to you.
Virtual Time and the consistent timelines it produces are the backbone of this simplicity.
Virtual time and the consistent timelines it produces are the backbone of this simplicity.

0 comments on commit cfb63c6

Please sign in to comment.