Multiple Returns in SQL
Podcast Appearance
I was on the Thinking About Computers podcast recently! We had a fun chat on the topic of writing technical content and research. Personally I can't stand hearing my own voice so I haven't listened to it but I enjoyed participating. They're also on podcast platforms and they have lots of good episodes!
Multiple Returns in SQL
One thing that I think is annoying about SQL is that a SQL query is forced to return only a single relation.
This isn't an original thought, Jamie Brandon brought it up in his well-known post Against SQL,
but it's been on my mind thinking about Datalog recently and I wanted to explain some of the things that this would enable.
First of all, lots of queries build up intermediate relations as part of their computations. Sometimes these intermediate relations are useful for more than one thing.
Like if I want to compute X
and Y
, and both of those require A
, it would be nice if I could write a query that:
- Computes
A
, - uses
A
to computeX
, - uses
A
to computeY
, then - returns both
X
andY
.
As opposed to writing two queries that each have to recompute A
.
That's sort of the obvious use case, that we want to reuse some work within a single query.
But I think there are a couple less obvious but more concrete examples of where this would be a useful feature.
Returning Normalized Data
It's generally agreed that we should store data within a database normalized. That is, rather than embed data directly within rows, they should contain IDs that point to some other relation. It's of course more complicated than that but that's the main idea.
This has a number of benefits: we don't have to update multiple places if the entity they refer to changes, if the entity they refer to has some large bit of data, we don't have to duplicate it, and if we don't want that data in a particular query, we don't have to scan it.
This generally doesn't cost us in terms of expressivity because we can recover the relations we "mean" by just joining them all together,
and Heath's Theorem gives us a tool to understand when exactly it is when
can do this kind of "decomposition" to get normalized data.
When returning data from a query, since we can only return a single relation, we're forced to denormalize our data set: any relations we
return have to be fully joined and fully inline all the data that they're returning.
This means that a lot of data is going to be duplicated if some rows join against the same other row.
Compression will probably alleviate the problems from this somewhat but it still means clients will have to use more memory to store the data, and if they want to recover the constituent tables they'll have to do so themselves.
Practically what this would look like generally is the ability to return lookup tables alongside interned data, which is something
APIs often like to do anyway. It would just be cool if you could do it all in one query.
Faking Sum Types
For my money, the most annoying deficiency in SQL is how abysmal modelling any kind of sum type, or fat enum, is.
People point you at a bunch of ways to do it...having exactly one of a set of FKs non-NULL, having the union of all the columns for all the variants you want to represent... I think they all suck.
The only one that I've been remotely happy with is just having a single-column JSONB
table that has a discriminating field.
But that also kind of sucks.
If you could return multiple relations, there would be another only sort-of-bad
option, which is to return a different relation for each of the variants you
want to support. This would at least let us stay "relational," as opposed to adding an entirely new language construct that represents sum types, which would probably be my actual preference.
How would it work?
A SQL query, by default, is basically an expression in relational algebra that evaluates to the result that gets returned.
I think if you were to add "multiple returns" to SQL the way you'd do it would be to modify the WITH
clause, which is SQL's version of a let
binding.
WITH x (a, b, c) AS (
SELECT * FROM abc WHERE b = 3
)
SELECT x NATURAL JOIN auv
Can be read something like
let x = SELECT * FROM abc WHERE b = 3;
return x NATURAL JOIN auv
WITH
takes a list of let bindings and then a "body" that gets evaluated with those let bindings in scope, which is the value that gets returned.
The way I'd change this is to add a new top-level only WITH RETURNING MULTIPLE
expression that doesn't have a body at the end, but has a list of the let bindings that should be returned:
WITH RETURNING MULTIPLE (x, y)
a (a, b, c) AS (SELECT ... ),
x (x) AS (SELECT ... FROM a),
y (y) AS (SELECT ... FROM a)
Which would resolve to the pair of named tables x
and y
.
This is sort of similar to the way this would work in Datalog.
Datalog is nice because it lets (forces) you to name every intermediate relation that gets constructed and isn't limited to tree-shaped queries.
a(a, b, c) :- ...
x(x) :- a(...), ...
y(y) :- a(...), ...
And then as metadata to the query you can include which relations should be exported once the query is fully evaluated.
Difficulties
This is, of course, not free. Most existing SQL engines are notably terrible with DAGgy query plans and pushing users towards writing such queries might be asking for trouble. Not to mention that although it's fairly easy to graft on new grammar to SQL that is backwards compatible, making this kind of change would generally require drivers to change their mode of interaction since they expect to just stream rows from a singular relation. Which raises other questions: how do you return rows to the client? Each relation one at a time? Or interleaved as the rows are available? Something else? Lots of questions, none of the problems being insurmountable, but certainly not devoid of design questions.