A Query Planning Guideline
A number of readers last week reached out to direct my attention to What if a SQL Statement Returned a Database which is a much more thorough and well-articulated explanation of the core idea—thank you!
A Query Planning Guideline
Say I want to construct a plan for the following query:
SELECT a, sum(b) FROM
(SELECT a, a + 1 AS x, b FROM ab)
GROUP BY a, x
Perhaps we have parsed this query into a syntax tree and resolved all the names and types, giving us something like this:
Select {
selections: [a, sum(b)],
group_by: [a, x],
from: Select {
selections: [a, a + 1, b],
names: [a, x, b],
from: Table { name: ab },
},
}
At a very rough level, our planning code might look something like this (note
that this is simplified and the rules for the semantics of SQL are a fair
bit more complicated than what's expressed here):
fn plan_from_source(src):
if src.is_select():
return plan_select(src)
if src.is_table():
return plan_scan(src)
...
fn plan_scan(table):
return Scan { table_name: table.name }
fn plan_select(select):
let input, scope = plan_from_source(select.from)
let is_aggregation = any_aggregations(select.selections) or select.grouping_cols.is_some()
let result = input
result = Filter {
input: result,
predicates: plan_predicates(select.where),
}
let passthrough, computed = partitions_projections(select.selections)
result = Projection {
passthrough: passthrough,
computed: computed,
input: result,
}
if is_aggregation:
result = Aggregation {
grouping_cols: select.grouping_cols,
aggregations: extract_aggregations(select.selections)
}
return result
A natural translation of this syntax tree into a query plan might be something like this:
Projection {
passthrough: [a, b],
Aggregation {
grouping_cols: [a, x],
aggregations: [sum(b)],
input: Projection {
passthrough: [a, b],
computed: [x => a + 1],
input: Scan {
table_name: ab
}
}
}
}
Cool, seems pretty reasonable. Now, looking at this, we might observe the following optimization that can be done: the way an aggregation works in SQL is that any two rows with the same values for the grouping columns go in the same bucket. In our query here, we're saying that any two rows with the same values of [a, x]
should go in the same bucket. But we might cleverly observe that we don't need to look at both of those columns, it's sufficient to just look at one of them, say a
, because any two rows that agree on a
also will agree on x
.
The way we make this kind of derivation mechanically is via functional dependencies. We have some code in our query planner that will look at the projection operator and say "aha, we actually have a {a}->{x}
functional dependency here."
You might think of these kinds of dependencies as "redundancies:" in the set of columns {a,x}
, x
is "redundant" if we have a
.
This suggests some kind of way to reduce a set of columns to some minimal set that doesn't contain any redundant columns. This is not a canonical process, but we can imagine a function which takes a set of columns, a set of functional dependencies, and spits out a smallest set of columns. simplify({a,x}, {a} -> {x}) = {a}
.
This would let us get to the simpler query plan:
Aggregation {
grouping_cols: [a],
aggregations: [sum(b)],
input: Scan {
table_name: ab
}
}
A natural inclination, to me, is to insert this optimization into the code where we plan:
fn plan_select(select):
...
if is_aggregation:
let fds = compute_fds(result)
let grouping_cols = simplify(select.grouping_cols, fds)
result = Aggregation {
grouping_cols: grouping_cols,
aggregations: extract_aggregations(select.selections)
}
...
This sort of makes sense, right? Instead of constructing the more complex thing, we should construct the simpler thing. But this is, I think, Bad Design:
Separation of Concerns
It might not look like it because this all seems in service of "query planning," but this design is actually a pretty clear violation of separation of concerns:
it entwines the building phase, where you translate a syntax tree into a query plan, with the optimization phase, where you make improvements to the plan.
These stages don't have to be temporally separated, this step can actually happen at this point in time using techniques like Simplifying Expressions Bottom-Up, but teaching this part of the code how to simplify the set of columns would violate a boundary: this part of the code is concerned with translating the semantics of a SQL AST into the semantics of a query plan, and nothing else.
Missed Future Opportunities
If, in the future, we construct an aggregation, we want this optimization to apply. This is what bottom-up optimization gives us, where the optimizations are built into the constructors for the operators. First, because when we first construct the tree is not the only time that we will build the operator. Future transformations will also construct new aggregation operators and we want this optimization to apply equally to them. Second, because we might not know about all the FDs that the expression holds until later, if our particular optimization framework permits, say, only fully optimizing a subtree later on, we might learn new facts about the subtrees that let us optimize them better later on.
If we were to apply optimizations where we build expressions, they'd have to be duplicated across the various locations we do that. This way, having an explicit "rewrite the tree" module, lets us avoid that.
Indulgent Generalization
This is not that indulgent because I want to go into it with the caveat that I don't really like it when people make connections between things that are sort of aesthetically similar and attempt to draw practical conclusions from that that go beyond aesthetic. Like when a problem has a "fixing one problem just introduces a new one" quality and people who read the first 40% of Godel Escher Bach feel compelled to call it "Godellian," or something.
So I want to be clear that I'm only really drawing an aesthetic link between these two things. But I think this has a connection to a more general kind of design principle, of like, I guess people call this "first make it work, then make it right, then make it fast," but I think I mean something more specific which is like, it's better to go from the general to the specific when it comes to solutions to certain kinds of problems. I'd call it start general, become fast. It's better, architecturally here, to have the code that builds the query plan originally build a fully general, always-correct implementation, which we later specialize to the specifics of the problem at hand. Rather than having it attempt to generate the optimized version from the get-go, which would violate separations of concerns.
I was in a conversation recently where someone was trying to get an answer for
why it was necessary for a particular kind of service to provide linearizability. I didn't really have a particularly strong opinion in this case but I felt like I had difficulty even imagining the shape of a compelling argument for such a thing. And I think it comes down to a similar thing: it's hard to justify that something should be stronger, absent some particular use case. I think the burden of proof for such an argument probably needs to go the other way: you start with a fully correct, general-purpose solution, and then, based on how you're using it, argue that you can remove of the guarantees it has. Or, I don't know, maybe I'm just coming up with excuses for why I couldn't figure out how to navigate this conversation.