Physical Properties #4
Previous parts of this series:
Relational query planning makes a distinction between "logical properties" and "physical properties." Logical properties are the attributes of a result set that are essential to it being considered correct, such as "which rows are present." Physical properties are those that are artifacts of the way we computed a result set, like the order of the rows in a stream.
One way I've defined physical properties in the past is that they are able to be "enforced." A property is physical if there's an idempotent operator that can bring a result set that doesn't satisfy a given property into one that does. An unordered relation can be sorted to get an ordered one, for example. Since these operators only change physical properties, they are a sort of logical no-op.
One thing about this physical/logical distinction is that it's particularly as clear cut as it first sounds. It's sort of fake. Or at least kind of hazy. There are aspects of even specific query languages, like SQL, that can be modelled either way.
When we declare something to be a physical property rather than a logical property, what we're doing in effect is constructing an equivalence relation on the set of result sets where we say result sets that differ only in physical properties are equivalent.
Ordering being a physical property rather than a logical one means things like the following equivalence hold:
[1 2 3]
=
[2 3 1]
There isn't really a way to turn ordering into a logical property, at least in SQL, since the language itself is so ignorant to ordering semantically. Consider another property though, like "does this result set need to be deduped?"
If I run the SQL query
SELECT DISTINCT * FROM abc
We might represent this as the tree
DistinctNode
^
|
Scan(abc)
But we might also just represent it as
Scan(abc)
With the required physical property that "this result set requires deduplication," which can be enforced with a Distinct enforcer. The difference here is that we wouldn't treat the distinct operation as an operator that can be pushed around and rewritten with algebraic rules, but rather as a property that at optimization time we would try inserting at various places to find the best spot for it.
Why would we pick one over the other? Well, there are tradeoffs for both.
In Cascades, the algorithm for optimization is something like this. We want an expression having a given set of physical properties, so we enumerate over all (non-strictly) weaker physical properties and see if the expression can give them to us. If it can, we insert an enforcer to go from those weaker physical properties to the ones we actually want. If we do this for every weaker properties, then we must find the optimal combination of "natural" properties and enforced according to our cost function.
fn optimize(group, desired):
let best = None
let best_cost = Infinity
for props in weaker(desired):
let enforced = enforce(request(group, props), props, desired)
if enforced.cost < best_cost:
best = Some(enforced)
return best
Of course, in practice, there are too many sets of physical properties weaker than any given set, since they're generally the product of a bunch of different individual physical properties, so we generally use some simpler heuristic algorithm to try them one at a time. You can see a real-world implementation of this idea in the Cockroach optimizer.
However, this means that the more things we make physical properties, either the more our heuristic algorithms will stray away from optimality, or the more we will succumb to combinatorial explosion when we try to descend down the poset of physical properties.
Lots and lots of things can be modelled this way, for instance, when a join occurs can be seen as a physical property
rather than a logical one. In general things being physical properties sort of descend into a kind of brute force search
that gives up a lot of the flexibility that Cascades offers and so I think generally when possible
it makes more sense to keep things as logical properties.
It's also easier to debug rewrite rules than the optimize
procedure since
you're tracking things in the memo as rewritten expressions rather than a
complicated recursive procedure.
I don't think anyone has done a full exploration of the design space here but I'm sure there are some clever tricks used to model various aspects of relational expressions in commercial optimizers. It's hard to get a sense for that stuff from the outside! If you work on such a tool that pulls these sorts of tricks I'd love to hear from you about it.