Heath's Theorem
We don't have all that much in the world of relational query planning that could be considered a "fundamental theorem," as in like, some central idea that everything else rests on. This is partly because we don't really have a lot of "theorems" in the first place, outside of more involved stuff like join bounds and join ordering algorithms, there aren't that many like, cold hard FACTS that get discussed. At least not things that would not be more accurately classified as an offshoot of statistics, rather than part of relational query planning.
The only thing I'm aware of that has a shot at being classified as such is Heath's theorem. And that's perhaps more of a tool for designing relational schemas in the first place, but I think it's fun, and pretty easy to understand, and one of those theorems that just kind of formalizes something you probably already understood, so let's talk about it.
A functional dependency is an invariant on a relation that states what values can appear in what contexts.
If a functional dependency X -> Y
holds in R
(where X
and Y
are sets of
columns), that means that any two rows in R
which agree on their values of X
must also agree on their values of Y
.
Some natural situations this kind of thing arises in is places where Y
is
derived directly from X
, like if we computed Y
as a function of X
(this
is sort of where the name "functional" comes from), or where Y
is a natural consequence of X
, like where Y
is a state, and X
is a city.
I've written a bunch(*) about functional dependencies in the past, so I'm not going to go too into detail on them. We're just going to talk about Heath.
Heath's theorem states the following:
If R
is a relation having column set U
and having X -> Y
as a functional dependency, then
R = Project(R, X + Y) JOIN Project(R, U - Y)
You better consider yourself lucky that I don't have access to LaTeX on this thing. As for notation here, JOIN
is a "natural join," where we join on all like-named columns. +
is set union, and -
is set difference. Project
restricts its relational first argument to the columns in its set second argument.
The main thing to do here is to untangle what exactly this claim is saying, because once you do, the rest falls out pretty naturally, so let's look at an example.
Here is R
:
In R
, we have a functional dependency between region
and town
—town
determines region
, so we write {town} -> {region}
.
So, the relevant sets for the statement of the theorem are:
U = {name, region, town}
X = {town}
Y = {region}
X + Y = {town, region}
U - Y = {name, town}
Computing the various projections, we get
and
If you compute the join of these two relations (on town
, of course), you'll find that they do indeed join to produce the relation we started with.
The core trick here is in what X -> Y
means. We're saying that if you know X
, you know Y
. Conceptually, you might think of that meaning that there is a function which can take a value of X
and produce the corresponding value of Y
.
The simplest way to represent such a function is to just write out all the elements of its domain and the values in the codomain that they map to. If we are trying to represent the "squares to" relation for the first 5 integers, we could just write out the pairs (0,0), (1,1), (2,4), (3,9), (4,16)
and that would be an adequate representation of that function.
It turns out we have a way to generate such a table when we have a functional dependency, and it is to just project to the relevant columns. This is what we're doing with Project(R, X + Y)
: we're writing out the function that brings a value in X
to a value in Y
.
Once we have this table, since the result is "functional in X
", i.e., it has only one occurrence of each value in X
, to "apply" it as a function is to simply join it with another relation. The other relation in our identity, Project(R, U - Y)
, is just throwing away these "redundant" columns in Y
. Redundant because they are totally determined by X
, and can be recovered by looking up the relevant value of X
in the function table, which we do when we join them.
This is the essence of database normalization: decomposing tables like this by eliminating redundancy is basically exactly what database normalization is, and Heath's theorem is a nice, formal way to talk about how to do that.