Deriving Functional Dependencies for Selections
Over the last two weeks you were babysat by the pre-scheduling feature of Buttondown, since I was away in Japan without my laptop. Here's me in Japan:
I had a nice trip. To hear more about what I thought, ask your nearest "white guy who likes video games" how he felt about going to Japan. Anyway, I'm back and writing these weekly again! Back to ol' reliable: query planning esoterica.
In query optimization it is important to be able to derive facts about relations prior to execution for optimization purposes. One useful fact to be able to derive is the set of functional dependencies that any result of a given expression must obey. These let you do all kinds of optimizations like simplifying grouping keys.
Recall that a relation satisfies a functional dependency X->Y
if any two rows that agree on X
also agree on Y
(named such because the projection to XY
is "functional"). A relational expression satisfies X->Y
if any result of evaluating that expression satisfies X->Y
.
For example, the result of selecting from a table with a set of columns X
having a unique constraint will always have the functional dependency X->Y
for every Y
.
There are rules for deriving new FDs from existing expressions. Notationally, say the set of FDs of an expression is
Then, we can write various rules to derive FDs about more complex expressions, for example, the join of two relations has all the FDs of the two input relations:
The intersection of two relations also has all the FDs of the inputs:
This shouldn't be that surprising since intersection is really just a special case of join.
This is because FDs are conjunctive, they're claims that all of the rows of a relation, or all combinations of its rows satisfy a property. If a relation satisfies some FD F
, and then we remove some rows from that relation, the result must still satisfy F
. Since the intersection of R
and S
is both "R
but smaller" and "S
but smaller," it must satisfy all of the FDs in each of them.
Some of these are easier to convince yourself of than others. One I think is particularly interesting is the selection, or filter operator:
This says, take all the rows in R
and restrict it to the rows that satisfy the predicate p
, where a "predicate" is an expression over columns that evaluates to true or false, like x = y
. If we know the FDs of R
, what do we know about the FDs of this expression?
There's a couple things we can guess right off the bat. Like with our intersection example, this must satisfy at least the FDs of R
. Those just get passed through. But it probably also satisfies some additional FDs! Which ones though?
Functional dependencies are fundamentally a property of a relation, or of a relational expression. To answer this question, though, we can introduce the notion of the functional dependencies of a predicate.
For a predicate p
, let the relation induced by p
be the largest, possibly infinite relation whose rows all satisfy p
, and write it p*
.
Cautious readers might feel here like we're crossing some weird wires here. There's sort of a distinction to be drawn between a "Codd-style" relation used in databases, which is subject to relational algebra and is almost always finite, and a "math-style" relation, which is significantly more abstract, often infinite, and not really treated in the context of relational algebra. In the context of math, someone might refer to the expression a=b
and the relation with two columns which are always equal as the same object, for example. We're not going to do that since we're trying to bring this all back to computer systems that actually need to process data, so to us the expression and the corresponding relation it induces are distinct objects.
There are some natural properties of p*
. Notably, conjunction turns into intersection:
Let the functional dependencies satisfied by p
be the functional dependencies satisfied by p*
:
Okay, so, now we have a reasonable notion of "the functional dependencies of a predicate." This is useful for the following reason: filtering a relation on a predicate p
is equivalent to intersecting that relation with p*
:
Since we already know how to determine the functional dependencies of an intersection, this lets us derive the functional dependencies of a selection:
So filtering on a predicate adds its set of functional dependencies to the input.
In practice, it's easy to derive some of the simple FDs out of a given predicate. The most useful two rules are:
and
Remember that FDs, like all analyses in the context of query optimization, are generally "best-effort:" it's okay for correctness if we miss some FDs, it just might mean we will miss some rewrites that we could do otherwise.
I think this is a cool way of thinking about things: it's sort of weird to go through infinite relations, which are not even really a part of the relational model, generally. You can probably also convince yourself that this was true without jumping through all these hoops, but for me personally this justification was pretty helpful for me and each step feels pretty motivated and doesn't require many leaps of logic.