Materialize is a SQL database that uses Differential Dataflow for its computational layer.
When Differential Dataflow got invented, it introduced one fundamental novelty: incrementally updated iterative computation.
You haven't been able to use this in Materialize yet though, for various reasons not the least of which is that SQL's WITH RECURSIVE
clause is a bit of a mess.
The good news is that as of quite recently, Materialize has preliminary (behind the --unsafe-mode
flag) support for a tentatively named WITH MUTUALLY RECURSIVE
clause.
This clause differs from SQL's WITH RECURSIVE
in some important ways, and I'll explain what those are and why I'm excited about them.
UPDATE (2024): Materialize's recursive SQL support landed and has now been live for all users for over a year. The WITH MUTUALLY RECURSIVE
syntax has worked out great, users have responded positively, and you can read more about it and try it out yourself, both in the cloud and locally, as you prefer.
SQL99 introduced the very useful common table expressions (CTEs), and with them the RECURSIVE
modifier that allowed recursive common table expressions.
A common table expression allows you to use the WITH
clause to name some expressions and then use them multiple times in your query, without resorting to copy/paste:
-- Form the triangles (a, b, c) in a graph.
WITH
-- symmetrize directed edges
symm (a, b) AS (
SELECT a, b FROM edges
UNION
SELECT b, a FROM edges
),
-- use `symm` to find length-two paths.
path2 (a, b, c) AS (
SELECT DISTINCT e1.a, e1.b, e2.b as c
FROM symm e1, symm e2
WHERE e1.b = e2.a
)
-- Produce triples (a, b, c) where symm(a, c) and path2(a, b, c) exist.
SELECT DISTINCT path2.a, path2.b, path2.c
FROM path2, symm
WHERE path2.a = symm.a
AND path2.c = symm.b;
You can even use the bindings in subsequent expressions, as we did with symm
in path2
.
Excitingly, the SQL folks realized that something really neat happens if you allow a binding to refer to itself.
Hearts full of excitement (one imagines) they introduced the RECURSIVE
modifier that allows this.
WITH RECURSIVE
-- symmetrize directed edges
symm (a, b) AS (
SELECT a, b FROM edges
UNION
SELECT b, a FROM edges
),
-- LOOK THIS IS RECURSIVE!!!
reach (a, b) AS (
SELECT * FROM symm
UNION
SELECT symm.a, reach.b
FROM symm, reach
WHERE symm.b = reach.a
)
SELECT * FROM reach;
This is the classic example of recursion that you see in languages like Datalog, and StackOverflow pages discussing WITH RECURSIVE
, but relatively rarely in actual SQL queries.
Why is that?
As it turns out, WTIH RECURSIVE
has a bevy of limitations and mysterious semantics (four pages of limitations in the version of the standard I have, and I still haven't found the semantics yet).
I certainly cannot enumerate, or even understand the full list, and will defer to the likes of @teggy to expound upon the issues.
Fortunately, @teggy
does provide a worked example that encapsulates my confusion, that (in PostgreSQL at least)
mcsherry=# WITH RECURSIVE
t(n) AS (
VALUES (1)
UNION ALL
(
WITH t AS (SELECT * FROM t)
SELECT t1.n + t2.n AS n
FROM t AS t1, t AS t2
WHERE t1.n < 256
)
)
SELECT * FROM t;
n
-----
1
2
4
8
16
32
64
128
256
(9 rows)
mcsherry=#
There are so many things I don't understand here.
Why only powers of two rather than any of their sums?
Why no requirement that t2.n
be bounded?
Why isn't the result a fixed-point of the query that defines t
?
The above is an example of "non-linear recursion" (t
is used twice in the recursive term), which is both defined and forbidden in the SQL standard.
Except that the SQL standard defines linear recursion to be a query that references the recursive term only once, which is a syntactic rather than semantic constraint.
They seemed to forget that this was in the part of the standard (WITH
clauses) used to rebind names.
So according to the SQL standard the above query should be accepted as "linear recursion", and just has the crazy-pants semantics of "evaluate as if linearly recursive".
Materialize doesn't support SQL's WITH RECURSIVE
and based on the complexity of the spec may never support it.
Instead, it supports what I (naively?) think is a simpler, and yet more expressive fragment.
I'm a bit worried that I don't understand thet rationale behind the complexity of WITH RECURSIVE
, and I both expect and will be delighted to have holes poked in what Materialize does instead.
Materialize's WITH MUTUALLY RECURSIVE
clause allows a sequence of bindings, each of which can reference any binding in their body, followed by a body that can also reference any binding.
WITH MUTUALLY RECURSIVE
-- A sequence of bindings, all of which are in scope for all.
name1 (col1a type1a, col1b type1b, ..) AS ( select_clause1 ),
name2 (col2a type2a, col2b type2b, ..) AS ( select_clause2 ),
...
body_select_clause
The results of the clause are as if you start each binding from an empty collection, then update the definition of each binding in sequence, repeating the list of all bindings until no changes remain, and then evaluate the body with these final bindings. The computation may never stop, in which case .. there is no result and your computer will probably be busy for a while determining that. But if it does stop, the configuration of bindings will be a fixed point, and the clause returns some query over that fixed point.
The mystifying-to-me WITH RECURSIVE
query above can also be expressed using MUTUALLY
, as
materialize => materialize=> WITH MUTUALLY RECURSIVE
t (n int) AS (
VALUES (1)
UNION ALL
(
WITH t AS (SELECT * FROM t)
SELECT DISTINCT t1.n + t2.n
FROM t AS t1, t AS t2
WHERE t1.n < 256 AND t2.n < 256
)
)
SELECT * FROM t ORDER by n;
n
-----
1
2
3
4
[...]
507
508
509
510
(510 rows)
materialize=>
This produces what is in my opinion the expected fixed point of the query above: all values from 1 through 510. Rather than just the powers of two strictly less than 512. Which isn't even a fixed point of the update rule.
Let's discuss a few differences from SQL's WITH RECURSIVE
:
- We had to specify the type of the column of
numbers
. We require this to make the SQL type resolution substantially easier, and not involve a recursive fixed-point problem when coercable types are used. I can imagine we could relax this in the future, bit it isn't meant to be the most important difference. - We had to add the constraint
t2.n < 256
. The absence of this constraint from the SQL version, and its termination nonetheless, still blows my mind. Of course you have to bound this, otherwise we would continue increasingnumbers
through the contributions oft2
even with a boundedt1
. - We had to type
MUTUALLY
. We aren't implementingWITH RECURSIVE
correctly, so we have to call it something else. MySQL has a flag you can set to step away from SQL's semantics, but adding a new keyword seems easier for us at the moment.
The main other difference is in the limitations.
Whereas SQL has some four pages of restrictions, Materialize has none.
Put whatever query you want in the definition of a recursive thing.
Don't want to use a UNION
or UNION ALL
? Don't.
Don't want to use linear recursion? Me neither!
Want to put another WITH MUTUALLY RECURSIVE
clause in definition of a binding? Go right ahead, you devious villain!
Materialize having no restrictions has the comic potential to be a massive dumpster fire once we learn the very important reasons why SQL introduced the constraints. However, it seems the best way to elicit that information is with this sort of post.
Yes.
Maybe not to you, maybe not to people you work with, or whose work you follow, and that is fine. But yes.
Recursion or iteration are fundamental to programming languages. Languages without them are hobbled in their expressive power. Languages with restricted implementations of them can prevent the description of efficient computation. Languages either without, or with only limited forms, prevent their users from applying the full force of computer science.
I spent a fair few years needling folks in the Big Data and Dabatases spaces, pitting my laptop against their large and powerful computers. The secret (shhh!) was that I had access to more computer science than they did. Differential dataflow could express algorithms that they could not (or did not, because of pain). Perhaps their systems could, with human effort, effect the same computation, but why use a system or language that makes computer science hard?
Let's take a first example from the recent and readable A Fix for the Fixation on Fixpoints: undirected connectivity.
The algorithm they use is "label propagation": each graph node tracks the smallest identifier it knows of, starting with its own identifier and repeatedly consulting with its neighbors.
You can write this in SQL using WITH RECURSIVE
the same way we did reach
above, followed by a MIN
over the reachable nodes.
WITH RECURSIVE
-- symmetrize directed edges
symm (a, b) AS (
SELECT a, b FROM edges
UNION
SELECT b, a FROM edges
),
-- LOOK THIS IS RECURSIVE!!!
reach (a, b) AS (
SELECT * FROM symm
UNION
SELECT symm.a, reach.b
FROM symm, reach
WHERE symm.b = reach.a
)
-- Report the smallest reachable node.
SELECT a, MIN(b) FROM reach GROUP BY a
The paper observes that this query is frustrating because you cannot clearly communicate that as you develop reach
you can discard all but the smallest b
for each a
.
You could rely on a sophisticated query optimizer to determine that it can push the MIN
into the recursive definition.
However, if you and that optimizer disagree on what passes for "sophisticated", you are out of luck.
The paper proposes a WITH ITERATIVE
construct that makes some different choices than we did, but it also allows you to communicate what data are not required.
In Materialize we can write label propagation as
WITH MUTUALLY RECURSIVE
-- symmetrize edges
symm (a int, b int) AS (
SELECT a, b FROM edges
UNION
SELECT b, a FROM edges
),
-- iteratively improve all labels
label (a int, comp int) AS (
SELECT a, MIN(comp)
FROM (
SELECT a, a AS comp FROM symm
UNION ALL
SELECT symm.a, label.comp
FROM symm, label
WHERE symm.b = label.a
)
GROUP BY a
)
SELECT * FROM label;
You just describe how you should update label
each iteration, in this case by grouped by a
keeping the smallest comp
.
You don't need to end the definition with a UNION
especially if that isn't what you want cc
to have each iteration.
And indeed, in Materialize the memory footprint of this query will stay bounded as the iterations proceed.
A proponent of declarative languages might prefer the WITH RECURSIVE
version as "more declarative": you say what you want rather than how to get it.
A proponent of imperative languages might counter that at the end of the day someone has to implement this efficiently, and if you won't do it at least don't prevent me.
Fortunately, you can just write whichever you prefer.
A second example from A Fix for the Fixation on Fixpoints is the CYK algorithm for parsing context-free grammars. There they make the point that it (like other dynamic programming algorithms) are great examples where non-linear recursion is crucial.
-- Symbols, and literals each produces.
CREATE TABLE grammar_terms (lhs int, lit int);
-- Symbols, and two symbols each produces.
CREATE TABLE grammar_nonts (lhs int, rhs1 int, rhs2 int);
-- An input string with literals at positions.
CREATE TABLE input (pos int, lit int);
WITH MUTUALLY RECURSIVE
-- Ranges `[lower, upper)` that can be produced by `symbol`.
parses (lower int, upper int, symbol int) AS (
-- Base case: each literal is produced by some symbols.
SELECT pos, pos+1, lhs
FROM input, grammar_terms
WHERE input.lit = grammar_terms.lit
UNION ALL
-- Recursive case: two adjacent parses that follow the grammar.
SELECT p1.lower, p2.upper, lhs
FROM parses p1, parses p2, grammar_nonts
WHERE p1.upper = p2.lower
AND p1.symbol = grammar_nonts.rhs1
AND p2.symbol = grammar_nonts.rhs2
)
SELECT * FROM parses;
We use parses
twice in the recursive branch, and it is important for correctness that we do so.
It sounds like the "Fix" authors think you can get SQL's WITH RECURSIVE
to implement this with some head-balancing, but neither they nor I think that is a good idea.
For bonus points, imagine you want to know how to parse the input, rather than only if it parses.
You'd have to tweak the query to add to parses
breadcrumb columns about how to find derivations for each parses
row, for example columns via
, rhs1
, and rhs2
for the columns equated in the join.
However, you don't need to keep all the derivations for each row in parses
; one will suffice.
This is again a data reduction we could explain in the language, as with undirected connectivity, without which we risk a much less efficient implementation.
Turing completeness is the property of a language, framework, or system that it can simulate a Turing machine, the standard for "things a computer could possibly do". If your platform is Turing complete you can do all the things a computer can do, and if it is not Turing complete there is some class of things your platform just cannot do. This is usually worrying because if you end up needing to do any of those things, you are just out of luck.
Datalog, for example, is a recursion-friendly language that is not Turing complete.
SQL is Turing complete via WITH RECURSIVE
, but woe betide the casual person who needs to understand this (start reading here about cyclic tag systems).
Materialize is Turing complete via WITH MUTUALLY RECURSIVE
because you can just implement a Turing machine.
Let's implement a Turing machine!
We'll start with the configuration of the machine, its tape, and its transitions.
-- The head will hold the read position and machine state.
CREATE TABLE initial_head (pos int, state int);
CREATE TABLE initial_tape (pos int, symb int);
-- Halting states are encoded by setting `motion` to zero and `new_symb` to `old_symb`.
CREATE TABLE transitions (old_symb int, old_state int, new_symb int, new_state int, motion int);
If you want to try things, or see an example for the above, here are some inputs that accept input strings indicating the parity of their length.
-- Optionally, initial values that check parity of the input.
INSERT INTO initial_head VALUES (0, 0);
INSERT INTO initial_tape VALUES (0, 1), (1, 1), (2, 1), (3, 1), (4, 1);
-- We are checking even or oddness of the input tape.
INSERT INTO transitions VALUES
(0, 0, 0, 0, 0), -- on a blank, halt
(0, 1, 0, 1, 0), -- on a blank, halt
(1, 0, 1, 1, 1), -- on a symbol, toggle state
(1, 1, 1, 0, 1); -- on a symbel, toggle state
With these input tables, we can get the final head
position and state with the following query:
WITH MUTUALLY RECURSIVE
-- Track the machine's head and state.
head (pos int, state int) AS (
-- In the first round use `initial_head`; in later rounds use `head`.
SELECT * FROM head
UNION ALL SELECT * FROM initial_head
EXCEPT ALL SELECT * FROM initial_head_delay
-- Apply the movement of the machine
UNION ALL SELECT new_pos, new_state FROM action
EXCEPT ALL SELECT old_pos, old_state FROM action
),
-- Track the tape's contents; absent positions are read as blank.
tape (pos int, symb int) AS (
-- In the first round use `initial_tape`; in later rounds use `tape`.
SELECT * FROM tape
UNION ALL SELECT * FROM initial_tape
EXCEPT ALL SELECT * FROM initial_tape_delay
-- Apply the modification the head makes
UNION ALL SELECT old_pos, new_symb FROM action
EXCEPT ALL SELECT old_pos, old_symb FROM action
),
-- Determine what sort of transition to take.
action (
old_pos int, old_state int, old_symb int,
new_pos int, new_state int, new_symb int
) AS (
WITH
-- Read the symbol under the head from the tape.
-- Rewrite absent tape locations as blanks (`0`).
read (pos, state, symb) AS (
SELECT
head.pos,
head.state,
CASE
WHEN tape.symb IS NULL THEN 0
ELSE tape.symb
END
FROM
head LEFT JOIN tape ON (head.pos = tape.pos)
)
SELECT
read.pos, read.state, read.symb,
read.pos + t.motion, t.new_state, t.new_symb
FROM read, transitions t
WHERE read.symb = t.old_symb
AND read.state = t.old_state
),
-- Delayed versions of the input, to retract in the second iteration.
initial_head_delay(pos int, state int) AS (SELECT * FROM initial_head),
initial_tape_delay(pos int, symb int) AS (SELECT * FROM initial_tape)
SELECT * FROM head;
There is an awkward _delay
idiom used to present input only in the first round, but otherwise the update rules are probably just what you'd write with WITH RECURSIVE
if you were allowed to.
It even keeps tape
indexed by pos
and takes time linear in the number of machine actions taken before it halts.
How cool is that?
Recursion is important, and doing recursion well is important. If recursion is too complicated or too confusing, you miss out on the opportunity to express valuable things about the intent of your query. That's a pity, because many useful tasks require artful use of recursion to work effectively.
Fortunately, we are well-positioned to make recursion delightful. You don't need to take thing away from SQL other than the restrictions on recursion.
Also go read A Fix for the Fixation on FixPoints.
Materialize's WITH MUTUALLY RECURSIVE
is currently experimental, and may never land.
It certainly isn't optimized yet (the optimizer literally nopes out of any optimizations other than some normalization).
If we learn amazing things about why WITH RECURSIVE
is the way that it is, we may need to change what we have done!
We also have a late-arriving proposal from Jamie Brandon to use ACTUALLY
instead of MUTUALLY
.