Key(anti)chains
This week I was supposed to write about chapters 5 and 6 of the feedback control book but I had a busy week and so wanted to talk about something a bit more familiar. Next week will be the feedback control chapters.
I wanted to revisit the idea we talked about before of bounded queries. Queries that, to execute, only use O(1)
space and O(n)
time, with n
being the number of rows being processed.
I'm specifically talking about a query processing model based on Volcano, where every operator is an iterator of rows:
type Operator interface {
Start()
Next() (Row, bool)
Close()
}
Lots of useful queries can be expressed subject to this constraint:
- Selection, or filters, only use constant memory because they only need to look at a single row to determine if it matches the predicate or not.
- Projections, which rearrange or discard columns.
- Maps, which render new columns out of old ones.
These are the “linear” operators which are naturally constant memory since they can be computed on each row in isolation.
Some non-linear operators can also be implemented this way under certain conditions. An aggregation (like GROUP BY) in SQL can be done in constant memory so long as the index we’re scanning is ordered by the columns we’re grouping on.
Joins are an interesting example because they, in general, are not possible to implement with this restriction. A cross join for example (assuming we don't have the ability to restart our iterators) can't use constant memory because it has to buffer all of one side of the join.
Let's be more precise about what we mean. A JOIN B ON a_1 = b_1 AND ... AND a_n = b_n
is the join between A
and B
. It consists of all pairs of rows in A
and B
which satisfy the predicate. We can generalize this notion to multiway joins, but we can get by with just two relations for now. Since the predicate is all equalities between columns, this is called an equijoin, and (a_1, ..., a_n)
is A
's join key, and (b_1, ..., b_n)
is B
's join key.
A join is called 1:1
if for each row in A
, there is at most one matching row in B
. A join is 1:N
if for each row in A
there is potentially many matching rows in B
, but for each row in B
there is at most one matching row in A
. And M:N
if there are no restrictions, each row on each side might have many matches on the other side.
Of these, the only ones that potentially require nonconstant memory and superlinear time are M:N
joins. For a 1:N
join the size of the output is bounded by the size of B
, and we only need to buffer a row from the 1
side, and can stream rows from the N
side.
So: if we were to construct a query language that only disallows M:N
joins, it would have to have some way to statically determine if a join was M:N
. We're in luck, because the semantics of this are already well understood from the query planning world, where they're used by the internals of a query planner to detect opportunities for optimization.
A key is a set of columns in a relation which is guaranteed to uniquely identify a row. If we take any row and project it to the columns of the key, no other row will have the same set of values. This is usually endowed externally on a table by giving it a unique constraint, but it can also arise naturally via things like aggregation: all the grouping columns of an aggregation form a key.
When joining against a key, we are guaranteed that we won't encounter multiple matches. So if a join key is also a key, then we are guaranteed that we are not living in an M:N
join.
This suggests that we need a good understanding of how to talk about keys, and what kind of structure they form so that we can do analysis on them.
Notably, keys imply other keys: any superset of a key is also a key. So we probably don't need to store separately {x,y}
and {x,y,z}
as keys since the latter is implied by the former.
The set of sets of columns form a partially ordered set under set containment. This is not a big revelation: it's just the power set of a set which is a classic partially ordered set. Only slightly more surprising is that the set of keys (which is a subset of the set of sets of columns), is what's called an upwards closed set, or an upset: if X
is in the set, everything above X
is also in the set.
When we have an upwards-closed set in a set like this, we might have the idea that we can represent it more succinctly than "all of the sets," and this is true: we can represent it by only tracking all the minimal elements, those which have nothing beneath them. This is an antichain: a set of elements in a partially ordered set within which no two objects are comparable, and we can check if a set of columns is a key by checking if it's a superset of any elements of our antichain.
So, how would we represent this in a query language, so that only 1:N
joins are allowed? I'm not sure: I presume there is a way to set up a relational type system that makes it work out. I don't know enough about type systems to say.