Physical Properties #3
Last week we walked through how a query optimizer might use Physical Properties to optimize a query plan. This week, I want to talk through one surprising application of physical properties: using them to solve The Halloween Problem.
Some time ago, I watched this video, which was one of the CMU Quarantine lectures, given by Nico Bruno and Cesar Galindo-Legaria, where they talk about how they use Cascades in SQL Server.
They mention at one point that they use physical properties to provide protection from the Halloween Problem. As a lover of both physical properties and the Halloween Problem, you can imagine how my pupils dilated. Andy's reaction is similar—he's bewildered to learn about this. I was glad, because that was my thought as well: I'm pretty familiar with both of these things but I've never heard of a connection between them. They explain the general idea, but I didn't fully understand how it fit together at the time. After some noodling I figured it out, and I'm going to explain it to you today.
The Halloween Safety of an expression is the set of columns on which an update will not encounter the Halloween Problem.
The bottom element of Halloween Safety is the empty set: we provide no guarantees, you are not allowed to update any columns at all.
The enforcer for Halloween Safety is a "spool" operator. This is simply an operator which slurps up all of its input before it emits anything at all. A spool always provides Halloween Safety on every column, because before anyone can start processing its output, its input can no longer change.
Let's work with an example from my other post about the Halloween Problem.
Recall our table looks like this:
| employee | salary |
+------------------+--------+
| Don Chamberlin | 11000 |
| Pat Selinger | 16500 |
| Morton Astrahan | 21000 |
And we have a query plan that looks like this:
Update [ (employee, salary * 1.1) ]
^
|
IndexScan [ employees_salary_index, [0, 20000) ]
If we execute this naively, each time we process a row from the index scan, it will get inserted back into the index in a location after where our scan currently is located in the index, causing it to get processed again. This is the basis of the Halloween Problem.
The root of the problem is that our index is ordered on salary
. If our index was on employee
, or, if we were updating the employee
row instead, it would be safe to execute this operation. Since it would be safe to update employee
, the Halloween Safety of IndexScan
is {employee}
.
Since we do not have Halloween Safety on {salary}
, which is the column we're updating, we need to either read from a different index, or insert an enforcer. The algorithm for evaluating both of these options was discussed in part two of this series. But the key thing that makes this expressible as a physical property is fundamentally that we can solve it with an enforcer. Namely, a spool.
Update [ (employee, salary * 1.1) ]
^
|
Spool
^
|
IndexScan [ employees_salary_index, [0, 20000) ]
The Halloween Safety of Spool
is always the full set of columns.
One other reason this works so well being expressed as a physical property is that plenty of intermediate operators can provide Halloween Safety "naturally," like how an OrderedDistinct
will provide ordering naturally. The example they use in the video about SQL Server is a hash join where the table in question is on the build side of the hash table: all the input will have to spool into the hash table before any rows can be emitted from the hash join at all, and so it's safe to do whatever updates on any column. Since we know that this operators naturally provides Halloween Safety on every column, we can avoid inserting a spool operator where we might otherwise try to with a less sophisticated approach.
They mention in the SQL Server video that there's lots of stuff like this they manage to cast as a physical property. I'm curious what they're referring to, if anyone has any insight please let me know!
There's a lot of nonobvious examples of things that can be expressed this way, and lots of things that can reasonably go either way, either being considered logical or physical. For instance, whether a result set which should be distinct has actually been physically deduped yet. This can be logical, in which case we'd figure out where to place a logical distinct operation via rewrite rules, or it could be physical, in which case we'd use physical properties to figure out where to place a physical distinct operation via enforcers. There isn't a hard-and-fast rule for which way to go with properties, as far as I can tell, it seems like a matter of taste.