Skip to content

Commit

Permalink
Update 2024-11-25.md
Browse files Browse the repository at this point in the history
Minor patches as I read through it.
  • Loading branch information
morsapaes authored Nov 26, 2024
1 parent 61703cc commit 171c575
Showing 1 changed file with 21 additions and 21 deletions.
42 changes: 21 additions & 21 deletions posts/2024-11-25.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ We'll unpack how this unfolds, starting from your transactional source of truth,

## Consistency and Change Data Capture (CDC)

Let's start with a hypothetical transactional source of business data.
Let's start with a hypothetical transactional source of business data (e.g., PostgreSQL).
It will contain three tables, `product`, `client`, and `sales`, each containing the current state of the relevant information.
As time passes, these tables may change.
Let's draw a sparkline indicating the moments at which these tables change.
Expand All @@ -35,11 +35,11 @@ product 0----------*----*--|--**-------->
client 0----------*----*--|-*-*-------->
sales 0----------*----*--|-***-------->
```
I've dropped an vertical line at an arbitrary aesthetically appealing location, but everything we'll discuss checks out for *any* vertical line.
I've dropped a vertical line at an arbitrary aesthetically appealing location, but everything we'll discuss checks out for *any* vertical line.
The vertical lines will define what it means to be transactionally consistent, for this post at least.

One of the most appealing properties of a database is that it masks the complexity of continually updating data, and presents as if it goes through a sequence of consistent states.
If you were to drop in to the OLTP database and issue a query, the answer would be as if we stopped the world stopped for long enough to get the precise answer at some moment.
If you were to drop in to the OLTP database and issue a query, the answer would be as if we stopped the world for long enough to get the precise answer at some moment.

Consider for example, a reporting query like so:
```sql
Expand All @@ -61,7 +61,7 @@ This is where Materialize comes in.
## Differential Dataflow and Virtual Time

The timelines we've drawn are not only a helpful way of thinking about transactional systems, they are also a tool for ensuring consistency.
In Materialize, and Differential Dataflow on which it builds, they are *the* tool for ensuring consistency.
In Materialize, and Differential Dataflow (the engine on which it builds), they are *the* tool for ensuring consistency.

Recall our sparkline from above, annotated with `OLTP` to remind us where that comes from.

Expand All @@ -83,7 +83,7 @@ product 0----------*----*-----**--------> \
WHERE sales.amount > 100 \ Differential
0----------*-----------*--------> / Dataflow (DD)
```
The `WHERE` term gets its own timeline, consistent with all the other timelines.
The `WHERE` clause gets its own timeline, consistent with all the other timelines.
This timeline is *exactly determined from* the timeline of the `sales` table.
Each `-*-` update in `sales` may (or may not) result in a corresponding update in the result.
We can determine the exact timeline, conceptually at least, by moving through time moment by moment, and observing how the output must change as a function of the input and the query logic.
Expand All @@ -99,10 +99,10 @@ product 0----------*----*--|--**--------> \
WHERE sales.amount > 100 | \ Differential
0----------*-------|---*--------> / Dataflow (DD)
```
Although the OLTP database and differential dataflow are not even running on the same system, we can still make specific statements about the consistency of the results.
Although the OLTP database and Differential Dataflow are not even running on the same system, we can still make specific statements about the consistency of the results.
Each moment in the output timeline corresponds to a specific moment in the input timelines.

Differential dataflow is fundamentally a tool for transforming input timelines to the exactly corresponding output timelines, for a small set of building-block operators.
Differential Dataflow is fundamentally a tool for transforming input timelines to the exactly corresponding output timelines, for a small set of building-block operators.
In addition to `WHERE` (filtering), there are operators for `JOIN`, `GROUP BY`, and other primitives out of which one can build SQL.

Let's add the operators that correspond to our SQL view into the stack of timelines:
Expand Down Expand Up @@ -144,14 +144,14 @@ They are nonetheless powerful enough to exactly correlate input data and output

## Materialize

Differential dataflow provides the building blocks for transforming timelines, but Materialize is what assembles those blocks into a full SQL experience.
Differential Dataflow provides the building blocks for transforming timelines, but Materialize is what assembles those blocks into a full SQL experience.

Stepping back, there are three tasks Materialize performs that we'll want to call out in order to build a fuller system.
1. Ingest each OLTP input as transitions on a common timeline.

Our examples above used a single OLTP input, with multiple tables, but you may have tables from multiple independent sources you are bringing together.
Materialize cannot make independent sources become consistent, a very hard distributed systems problem, but it can place all of them on a common timeline.
Each input will be internally consistent (transactions respected by MZ), with an opinionated take about how their timelines interleave.
Materialize cannot make independent sources become consistent (a very hard distributed systems problem), but it can place all of them on a **common timeline**.
Each input will be internally consistent (i.e., transactions respected by Materialize), with an opinionated take about how their timelines interleave.

2. Maintain the consistent timelines for any composition of derived views.

Expand Down Expand Up @@ -190,12 +190,12 @@ Materialize resolves and locks down one source of ambiguity, so that all downstr

The problem of putting multiple unrelated sources in a consistent order is fundamentally hard.
While you may know that you update your MySQL before your PostgreSQL, no one else knows this.
Database systems don't yet have great hooks for exposing these levels of cross-system constraints, and most solutions are bespoke (e.g. causality tokens).
Database systems don't yet have great hooks for exposing these levels of cross-system constraints, and most solutions are bespoke (e.g., causality tokens).
Materialize's common timelines are one way to *introduce* this structure, and make it available going forward.

### View Maintenance

Materialize maintains views using Differential Dataflow (DD), which as sketched above translates input timelines to output timelines.
Materialize maintains views using Differential Dataflow (DD), which - as sketched above - translates input timelines to output timelines.
While DD ensures that the input and output timelines align perfectly, this comes at a cost: the output timelines are likely not immediately available.

Let's return to our example from before, but pay attention to the arrowheads `-->` indicating the extent of completed work.
Expand All @@ -222,7 +222,7 @@ At the same time, these are also where the lag, however slight, prevents you fro
Materialize, and DD underlying it, are optimized around reducing the lag of these arrowheads.
As much work is done ahead of time as is possible, so that the moment you say "show me now!" we are ideally just a few confirmations away from having the correct answer in hand.
Importantly, when we say "correct answer" we mean it.
The dropped vertical means we will show you a result that corresponds exactly to your inputs at the same moment.
The dropped vertical means we will show you a result that corresponds **exactly** to your inputs at the same moment.

Everything we've said about DD operators generalizes to SQL views.
```text
Expand Down Expand Up @@ -257,7 +257,7 @@ Each dropped vertical line corresponds to a "timestamp" on the common timeline.
How we choose timestamps reflects the core product principles behind Materialize: responsiveness, freshness, and consistency.
These three are often in tension, but let's see what each corresponds to in isolation:
1. **Responsiveness**: Always choose a timestamp to the left of (before) the arrowhead of the query.
This ensures that Materialize are always able to immediately answer your question; no waiting!
This ensures that Materialize is always able to immediately answer your question; no waiting!
2. **Freshness**: Always choose a timestamp to the right of (after) all input arrowheads.
This ensures that Materialize only responds with results that reflect the most recent input.
3. **Consistency**: Always choose a timestamp to the right of (after) all previously chosen timestamps.
Expand All @@ -266,7 +266,7 @@ These three are often in tension, but let's see what each corresponds to in isol
You can now see how these might be in tension.

Let's look more closely at the potential interactions of three chosen timestamps.
Recall that multiple people are using Materialize at the same time, and they may have different goals.
Recall that multiple people may be using Materialize at the same time, and they may have different goals.
```text
T0 T1 T2
| | |
Expand Down Expand Up @@ -296,10 +296,10 @@ While also not immediately up to date, it is available at a relatively recent ti
Combined with the `T0` use case, it should be clear how ensuring consistency (always go right) puts `T1`'s freshness in conflict with `T0`'s responsiveness.
They can't both get what they want at the same time, without some give.

The `T2` timestamp is for a freshness absolutist, who needs to be sure that they are seeing results that reflect reality as of when the query was submitted.
The `T2` timestamp is for a freshness absolutist, who needs to be sure that they are seeing results that reflect reality _as of_ when the query was submitted.
Imagine presenting a bank balance back to a customer, or checking inventory levels before confirming a purchase.
While the freshness is great, as good as it gets, there are significant responsiveness limitations.
This level of freshness can be ensured by the "real time recency" or "zero staleness" feature.
This level of freshness can be ensured by the ["zero-staleness"](https://materialize.com/blog/zero-staleness-faster-primary/) feature, which provides "real-time recency" guarantees.

### The Query Lifecycle

Expand All @@ -312,10 +312,10 @@ The timestamp corresponds to the vertical line, and its choice is a reflection o
There is some explaining to do about how your timestamp is chosen, which you can consult as you wait for your results.

But why are you waiting?
We've chosen a timestamp; what prevents the immediate presentation of that information.
The information you are looking for is essentially the progress bar of which arrowheads have passed your dropped vertical.
We've chosen a timestamp; what prevents the immediate presentation of that information?
The information you are looking for is essentially the progress bar for which arrowheads have passed the dropped vertical line.

Let's return to the example above, and experience of a user assigned the `T1` timestamp.
Let's return to the example above, and the experience of a user assigned the `T1` timestamp.
```text
T1
|
Expand Down Expand Up @@ -348,4 +348,4 @@ view big_sales refreshing
view analysis pending
```
As time advances, and arrowheads move rightwards, the arrowhead of `big_sales` will pass `T1`, changing to `ready` and moving `analysis` to the `refreshing` state, until it too advances to the right.
As time advances, more and more of the query steps transition to `ready`, until they are all ready and you should have your response imminently.
As time advances, more and more of the query steps transition to `ready`, until they are all ready - you should have your response imminently.

0 comments on commit 171c575

Please sign in to comment.