Skip to content

Commit

Permalink
ArrNF
Browse files Browse the repository at this point in the history
  • Loading branch information
frankmcsherry committed Oct 10, 2024
1 parent 12d3d12 commit 902df9c
Showing 1 changed file with 93 additions and 0 deletions.
93 changes: 93 additions & 0 deletions posts/2024-10-10.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@

## Arrangement Normal Form (ArrNF)

Arrangement Normal Form (ArrNF) is a way to represent Materialize and Differential Dataflow computations.
Each term in ArrNF is headed by an optional arrangement-forming operator, and the body of the term forms no other arrangements.

The intent of ArrNF is to align the presentation of the planned work with the primary costs: arrangement formation and maintenance.
ArrNF is meant to simplify the questions of "where", "what", and potentially "why" a computation requires arrangement resources.
* **Where** does each arrangements occur?
* **What** does each arrangement contain?
* **Why** does each arrangement exist?

The quality of the answers may vary, but ArrNF means to put their subjects as the subjects of its representation.

ArrNF describes dataflows, which are comprised of a sequence of basic blocks, which are each an optional arrangement forming operator atop any number of arrangement-free operators.
```
Dataflow ::= [ Block ]*
Block ::= [ Arr ] Str
Arr ::= ArrangeBy | Reduce | TopK | Threshold
Str ::= Constant | Get
::= [ Map | Filter | Project | FlatMap | Negate] Str
::= Join [ Get ]*
::= Union [ Str ]+
```
This grammar leaves various details unexplored, for example how `Get` references specific collections, how we determine whether a term is well-typed, and exactly what sort of behavior various operators have.
Subtlely buried in here is that `Join` must not form arrangements, which means that "multi-way differential" joins are not covered, and they must be disassembled to their intermediate arrangement-forming parts.

The `Str` term has four productions, corresponding to 0-ary "leaves", 1-ary linear operators, and variadic product (`Join`) and sum (`Union`) operators.
These terms admit further normalization, if we are interested: as all the terms are linear, they can be re-arranged to form
```
Str ::= Union [ [ Negate ] Unary Leaf ]+
Unary ::= [ Map | Filter | Project | FlatMap ]*
Leaf ::= Constant | Get | Join [ Get ]+
```
Said aloud: all stream terms are `Union`s of (optionally negated) unary operators, applied to either a constant, another collection, or a join of other (arranged) collections.

### An example

Consider the following fragment, taken from a recent incident reproduction
```
cte l4 =
Project (#1..=#19)
Union
Negate
Project (#1..=#20)
Join on=(#0 = #21) type=differential
Get l2
ArrangeBy keys=[[#0]]
Distinct project=[#0]
Project (#0)
Get l3
Project (#1..=#20)
Get l1
Project (#1..=#20)
Get l3
```
We can rewrite this in ArrNF as two blocks, with a division forced at the `Distinct` operator.
```
[l4.tmp0]
Distinct Project(#0) Get l3
[l4]
Union Negate Project(#2..=#20) Join on=(#0 = #21) Get(l2) l4.tmp0
Project(#2..=#20) Get l1
Project(#2..=#20) Get l3
```
The same information is present in both plans, but where the first provides nested scoping information about how the `Distinct` is used, the second calls out the moments where resources are used and what data must be captured.
Arguably cosmetic differences like putting sequences of unary operators on one line reduce the cognitive load vs the same information in a tree representation.

### Reversing out SQL

One potential benefit of ArrNF is the ability to translate back to SQL.

Each `Str` term, in the second and more opinionated form, can be reversed out to SQL without much work.
The `Union` and optional `Negate` terms translate to `UNION ALL` and `EXCEPT ALL` (put the negative terms last).
The `Unary Leaf` pair becomes a mostly standard "select-project-join" term, with some caveats around `FlatMap` and whether it fits in SQL at all.

For example, the fragments above would become the SQL
```sql
CREATE VIEW l4 AS
WITH l4.tmp0 AS SELECT DISTINCT (#0) FROM l3
-- Definition of l4 using l4.tmp0
SELECT #2 ..= #20 FROM l1
UNION ALL
SELECT #2 ..= #20 FROM l3
EXCEPT ALL
SELECT #2 ..= #20 FROM l2, l4.tmp WHERE #0 = #21;
```
Although this SQL likely has little to do with the source SQL, it can be understood by most readers.
It is a query that can potentially be run, modulo column referencing syntax, and its results counted and otherwise assessed by the user.

In principle, one could go as far as rewriting input SQL into ArrNF SQL, as a SQL-centric form of plan pinning.
ArrNF slices larger plans into basic blocks that are sufficiently simple that there is little to no further intra-block action for an optimizer, at least not that influences the arrangement resources used.

0 comments on commit 902df9c

Please sign in to comment.