SQL Scoping is Surprisingly Subtle and Semantic
It is not my intention with NULL BITMAP to dwell forever on somewhat obscure, unintuitive SQL-isms, but...with that said, I know a lot of you out there are SQL weirdos...so let's fire up the ol' SQL database and run some queries!!
> CREATE TABLE aa (a INT);
> INSERT INTO aa VALUES (1), (2), (3);
> CREATE TABLE xx (x INT);
> INSERT INTO xx VALUES (10), (20), (30);
Here's the question, what are the results of the following queries?
> SELECT (SELECT sum(1) FROM xx LIMIT 1) FROM aa;
> SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa;
> SELECT (SELECT sum(x) FROM xx LIMIT 1) FROM aa;
> SELECT (SELECT sum(a+x) FROM xx LIMIT 1) FROM aa;
I will wait while you try to figure them each out. I will write some extra stuff in here to pad things out so you do not accidentally spoil yourself and look at the answers.
To fill some space I will do some housekeeping: We are now 12 posts in. If I had stuck to my original designs of one post per month we'd have gone a year together by now. It's been a lot of work to write one of these per week but it's been satisfying to hear that people enjoy them, so if you read them thanks for spending your time with me!
Ok, that's enough space that you won't accidentally spoil yourself. Here are the answers:
> SELECT (SELECT sum(1) FROM xx LIMIT 1) FROM aa;
3
3
3
> SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa;
6
> SELECT (SELECT sum(x) FROM xx LIMIT 1) FROM aa;
60
60
60
> SELECT (SELECT sum(a+x) FROM xx LIMIT 1) FROM aa;
63
66
69
This actually blew my mind the first time I saw it. Like, it seemed to me like one thing you might reasonably expect of SQL's semantics is that the world of scalar expressions like a
or x
or 37
or whatever, are "downstream" of the relational world. That is to say, if you were to swap out any given scalar expression for another, the relational operators in the plan would be unchanged (modulo any, like, optimizations that would imply).
It turns out none of that is true though! As this example shows, you can actually completely change the structure of a SQL query by swapping around some scalar values.
Let's start from the beginning, though, since the weirdness here can only be understood by first understanding the parts that the queries are made up of.
SELECT
is the basic building block of SQL queries, we use it to get data out of a table:
> SELECT a FROM aa;
1
2
3
Fine, fine. Easy. Next is sum
. This is an "aggregate." If there's an aggregate in a SELECT
list, it turns the SELECT
into an aggregation:
> SELECT sum(a) FROM aa;
6
Which is, syntactically, a bit weird, admittedly. You want to take the sum of the set of a
s, but it sort of looks like you're taking...the set of the sums? Or something? Anyway, this is how it works. The designers of SQL presumably wanted aggregations to be very syntactically quiet, and this is how we ended up here.
The next weird thing here is what's called a "subquery." A subquery is a relational expression in a scalar position:
> SELECT (SELECT 1), (SELECT 2) FROM aa;
1,2
1,2
1,2
Conceptually, each subquery will be rerun every time the scalar expression is evaluated (though databases have lots of clever tricks to avoid actually doing that).
Notably, subqueries need to return one row and one column. More rows, and they error:
> SELECT (SELECT * FROM aa)
ERROR: more than one row returned by a subquery used as an expression
This is why I threw a LIMIT 1
on each of the queries we opened up with, in case some of them returned more than one row.
Ok, so, we're ready to tackle the original queries. Let's start with the easiest one:
> SELECT (SELECT sum(x) FROM xx LIMIT 1) FROM aa;
60
60
60
This query will turn into a plan that looks something like this:
for a in aa:
sum = 0
for x in xx:
sum += x
emit(sum)
Which is probably what you expected. The sum(x)
gets aggregated at the level of the SELECT ... FROM xx
.
I'd say the next simplest is this one:
> SELECT (SELECT sum(a+x) FROM xx LIMIT 1) FROM aa;
63
66
69
The plan here looks like this:
# SELECT (SELECT sum(a+x) FROM xx LIMIT 1) FROM aa;
for a in aa:
sum = 0
for x in xx:
sum += a+x
emit(sum)
This is not too weird either, the first row gets computed as (10 + 1) + (20 + 1) + (30 + 1)
, the second as (10 + 2) + (20 + 2) + (30 + 2)
, and the third as (10 + 3) + (20 + 3) + (30 + 3)
.
The third one is where things get a bit weird:
Its execution looks something like this:
> SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa;
6
# SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa;
sum = 0
for a in aa:
sum += a
for x in xx:
emit(sum)
break
But why! You might reasonably expect that the aggregation for the sum(a)
would happen at the level of the xx
SELECT
, after all, that's how the previous one worked. If that was how it worked, we'd expect to get 3
, 6
, 9
as the result.
It turns out that aggregations climb up the tree of SELECT
s until they find one that provides a column they reference. Since this one references a
, it creeps up the tree until it finds the source of a
, and then that's the level that aggregation happens at. Since a+x
referenced x
as well, it was already at an appropriate level, and could aggregate there.
What about the last one?
> SELECT (SELECT sum(1) FROM xx LIMIT 1) FROM aa;
3
3
3
This appears to, again, aggregate at the xx
level. Well, it turns out that the rule I described doesn't work if the aggregate references no columns at all, so that's just a special case, and it aggregates at the level of the nearest SELECT
. For completeness, its plan would look something like:
# SELECT (SELECT sum(1) FROM xx LIMIT 1) FROM aa;
for a in aa:
sum = 0
for x in xx:
sum += 1
emit(sum)
To summarize, the rule is:
Aggregation happens at the level of the closest
SELECT
that contains at least one of the free variables of the aggregation, OR the nearest one if it references no columns at all (as in the case ofsum(1)
, for example). [1]
There's two reasonable possible ways this could work, I guess. The first would be to keep aggregations as low as possible, and have them bind to the nearest SELECT
, and this is the alternative, which is having them climb as high as possible (they can't go any higher because then they'd reference an out-of-scope column). I guess this one is more flexible since it lets you put the aggregations wherever you want, since you can make them stop by doing like, sum(a+x*0)
, or something.
I say this is "obscure," although basically anyone who has had to implement aggregations in SQL has probably run into this. It's a little wonky, though, I suspect the fact that there is subtlety or ambiguity at all here might be a fact exclusively known to people who...have had to implement aggregations in SQL.
I think the source of this weirdness is a direct consequence of, like I said, wanting aggregations to be very syntactically quiet, thus introducing some ambiguity of where they actually live.
1: Reference.