The Relational Derivative Pt. 2: Groups are Good
Last week we talked about the notion of the relational derivative and its application to enforcing integrity constraints in databases.
The relational derivative is defined by
And we can compute it by rearranging:
We can use this to do things like incrementally update a query in response to new data, or to tell whether or not a query will change when we make a mutation, like in the case of integrity constraints.
However! This definition is problematic in the world of SQL. A reader wrote in:
brother your definition only works for monotonic queries and is embarrassed by a simple
a NOT IN B
sql querypce out
pittsford t.
Pittsford is correct! This definition is inadequate to describe nonmonotonic queries in SQL. A monotonic query is one that satisfies
That is, if we have a over a database, and we add new rows to that database, only extra rows exist in the new database:
Monotonic queries have been the subject of much study in the world of distributed systems, since they can be used to model computations where you never have to “take back” an answer you gave in response to new information.
As Pittsford says, the following query is nonmonotonic in B:
because it violates our requirement above:
Pittsford’s point is that in SQL, there isn’t a coherent way to take the derivative of
Because we get the following:
Now, in SQL semantics, we can’t actually reassociate the subtractions of B like we might like to. Remember:
D select * from (VALUES (1)) EXCEPT (VALUES (1), (2));
┌────────┐
│ col0 │
│ int32 │
├────────┤
│ 0 rows │
└────────┘
So we wind up not really being able to simplify the derivative here in a meaningful way. At least in a way that is useful.
There is a way to resolve this somewhat, though, which is to work in a different kind of relational semantics than SQL provides. SQL semantics are such that every relation is a multiset of rows. A row may appear multiple times in a given relation:
D values (1), (1), (1);
┌───────┐
│ col0 │
│ int32 │
├───────┤
│ 1 │
│ 1 │
│ 1 │
└───────┘
You might also think of such a relation as a function which maps a row to the number of times it occurs, that is, maps from rows to {0, 1, …} (and you probably want them to be 0 almost everywhere).
From this perspective, we can modify our definition of what a relation is to have a more robust notion of a derivative by simply changing the codomain of this function from the natural numbers (0, 1, 2, …) to the integers (0, 1, -1, 2, -2, …).
That is, a relation is a map from rows to integers which is 0 almost everywhere. Each row can occur any positive or negative number of times in a relation.
dbsp calls these “Z-multisets,” some of my friends call them “free abelian groups.” You can mostly just think of them as some data structure like HashMap[Row, int]
. This solves Pittsford’s complaint though, because now our relations look like this:
And the derivative of
is simply
That is, d with its multiplicities inverted. Of course, this has some different semantics in many cases: before, adding an unrelated row like {4} would have no impact on the result of our query, but now we wind up with a {4: -1} term.
But we get a lot of nice properties by changing this! Being able to commute around various unions and negations is very nice and admits a lot of additional types of incremental computation that SQL’s model makes more difficult.
The trick here is that we changed our model of relations into a group, which is a very strong structure that admits a lot of extra reasoning not possible with mere multisets.
This is probably the last time I’ll write about this topic for a little bit at least so I will leave you with some fun exercises if you are tickled by the relational derivative and want to spend some more time thinking about it:
Write out the relational derivative for a
CHECK
constraint.Write out the relational derivative of an inner join, for each side of it.
Write out the relational derivative of an outer join, for each side of it.
Write out the relational derivative of a Distinct operator, which maps each nonzero multiplicity to 1.
Write out the relational derivative of a
SUM
aggregation.