Integrity Constraints and the Relational Derivative
In a SQL database, you can set up a foreign key with REFERENCES
:
nullbitmap=# CREATE TABLE ab (a INT PRIMARY KEY, b INT);
CREATE TABLE
nullbitmap=# INSERT INTO ab VALUES (1, 10), (2, 20), (3, 30);
INSERT 0 3
nullbitmap=# SELECT * FROM ab;
a | b
---+----
1 | 10
2 | 20
3 | 30
(3 rows)
nullbitmap=# CREATE TABLE xa (x INT PRIMARY KEY, a INT REFERENCES ab (a));
CREATE TABLE
nullbitmap=# INSERT INTO xa VALUES (100, 2);
INSERT 0 1
nullbitmap=# INSERT INTO xa VALUES (100, 4);
ERROR: insert or update on table "xa" violates foreign key constraint "xa_a_fkey"
DETAIL: Key (a)=(4) is not present in table "ab".
In this example, we create two tables, ab
and xa
, where the a
column in xa
references the a
column in ab
. This means that for every row in xa
, we want the value of a
to be a valid reference to a row in ab
.
Foreign keys are basically relational pointers: we don't have stable memory addresses in a relational database so we need to use a more logical way to refer to other data. If we want to be sure that we don't point to any invalid "memory," we ask the database to ensure our foreign keys are always valid. This is called referential integrity.
How might we provide referential integrity, if we are implementing foreign keys in our database? Typically we add a check to every mutation to make sure they don't do anything bad. Let's just talk about INSERT
for the moment. The most obvious thing to do is to do a check right before we commit (if you know what DEFERRABLE INITIALLY DEFERRED
means then stay quiet, not today), maybe we write some code like this:
def insert_txn(table, rows):
txn = begin_transaction()
for fk in table.fks():
for row in rows:
value = row[fk.referencing_column()]
if !txn.get(
table=fk.referenced_table(),
index=PRIMARY_INDEX,
column=fk.referenced_column(),
value=value,
):
txn.insert(table, txn)
else:
throw new Error("inserted row doesn't reference a valid row")
txn.commit()
For each foreign key constraint on the table we're inserting into, for each row that we're inserting, we check that the foreign key is valid by doing a lookup into the other table.
This is sort of fine, it's a little manual, a little overly explicit, a little brittle (what if I don't want to look up against the primary index? What if I insert a row and reference that row in the same transaction?). It adds an entirely new way of accessing data in the database that might need to be accounted for down the line. The worst thing, though, I think, from a query planning philosophy perspective, is that it completely sidesteps the query planner's authority as the primary decisionmaker regarding data access. We're saying "I have this question about my database that I want answered," namely, "do all of these values exist in this table," and then computing it manually. We have a tool for figuring out how to answer questions! It's the query planner!
Some databases have a general-purpose mechanism for expressing this sort of thing that also works for other kinds of constraints, called postqueries. Before we get to that, though, let's take a detour.
The thing about foreign keys is that really, the natural way to define them is not the way we described enforcing them. It's not that "for every row that I insert, I want to check if the things it references are valid." That's very algorithmic and mechanical and not really what we want as a declarative definition. And it's probably not even what you'd use! Probably you'd say something more like, "every row in table A should reference a row in table B." One way to express such a thing is with, well, a query. We could change that definition to be expressed as, "the following query must return no results:"
Okay, that's kind of ugly. There's actually a fairly common relational operator that expresses more directly what we want, called antijoin, often written
which says "return all the rows in A that do not have a matching row in B." This is the companion operator to semijoin which is its complement, all the rows in A that do have a match in B:
(Imagine I used \ltimes
instead of \overline\times
but I think that does not render in this Latex dialect)
So, we can express our foreign key constraint by specifying the query
So how do we turn this into something actionable we can do on a particular insert? Well, there is a special operation in relational algebra which is the relational derivative, which is defined over a starting dataset D and a relational query f and written fD which satisfies:
This says "the output of the operation when applied to database D, union the output of the derivative applied to change d, gives the output of the operation at D union d" (I made up this notation although maybe more standard notation exists, this is a commonly understood idea, though. Excuse some liberties I take with formality).
Basically, the relational derivative of a query lets us see how the output of the query will change if we change the input. This is useful for us as an integrity constraint! Because we want to see how any given mutation will impact the status of the integrity constraint. Let's start from our constraint:
We can compute the derivative of this (f = A antijoin B, with respect to A):
So it turns out that the derivative is just again the antijoin (this is not surprising, since antijoin is linear). But it tells us something useful! It tells us that in order to verify that adding new rows to A will not invalidate our integrity constraint, indeed, we just need to check that all our rows reference a row in B, which we can do by executing the antijoin.
The cool thing is that this technique can generalize to more complicated integrity constraint and to the other kinds of mutations. It's also fun to take the derivative of relational operators. As Alex Miller pointed out to me once, you do have to be careful about using this technique at lower isolation levels, it only obviously works at serializable.
Anyway, back to our original problem, how we check for integrity constraints. The way we solved this in CockroachDB was that we would emit a set of "postqueries" to be checked after the execution of the query, like an antijoin (for insertions) or a semijoin (for deletions) or both (for updates). This approach is really good: it lets you use the full machinery of the query optimizer to optimize these checks. You already have a tool that tells you the best way to answer a particular question about your database, you best use it.