Some of My Favourite Query Planning Papers
Something I've learned about programmers is that for some reason they absolutely love being recommended papers. They lose their minds for it. So here is an incomplete list of some of my favourite papers on the topic of query planning and some brief commentary on them.
A number of these have had NULL BITMAP issues dedicated to them in the past.
This list is obviously a bit biased to my own interests: I prefer simple, even if more limited expressions of ideas, which means sometimes a more sophisticated paper will be omitted if I find it less expressive of the underlying idea. There's also certain topics I don't have much interest in, like the use of AI for cardinality estimation. I also strongly favour things that are at least shaped like survey papers, aimed at distilling a couple years of good ideas, rather than papers that present a single innovation. Anyway. Here's the list.
An Overview of Query Optimization in Relational Systems
If you have or are attempting to cultivate any kind of interest in query planning, this is 100% where you should start. There are two kinds of old papers: the ones that use outdated terminology and ideas ("XML is an exciting new technology,"-type stuff), and the ones that predate a time where readers could be expected to have vast realms of background knowledge. This one is one of the latter: near everything is well explained from first principles, and it aggregates a lot of fundamental ideas that you'd otherwise have to piece together from the literature by yourself. Absolutely a masterpiece, in my opinion.
Despite all that, I don't think this one is all that well-known, at least compared to some others on this list. It most definitely deserves to be.
How Good Are Query Optimizers, Really?
In contrast, this one is very popular. This is an evaluation of a number of optimizers in relational database (the commercial of which are unnamed, thank you DeWitt clause) on the newly-introduced Join-Order Benchmark, the point to show that there are some consistent patterns in the way that query optimizers fail to provide good plans.
How I Learned to Stop Worrying and Love Re-optimization
There's a lot going on in this paper, and I sort of think (no offense meant) that some of it is noise. I'm not that interested in the ideas of re-optimization, personally. But I think this paper provides a really compelling argument as to the hopelessness of a "perfect" query optimizer, and the need for mitigating strategies to fend off catastrophically bad query plans.
Access Path Selection in a Relational Database Management System
The OG. The classic. I absolutely love this one—what a piece of history. It's really remarkable how fully-formed a lot of the ideas that underpin query optimization were right at the outset. Pat Selinger, please let me interview you for NULL BITMAP.
Not much to say, just a really great, essential read if you care about understanding the history of the problem space. I will say that one thing that really hammered home to me some intuition about the problem space was this table:
I'd always assumed that there must be some deeper representation that we could assume of our predicates to do this kind of analysis, but seeing these things laid out here really communicated to me how much of cardinality estimation is guesswork and trying to make safe bets. Something about this page has always really spoken to me, I have wanted to get it printed out poster-size and put on my wall.
Leapfrog Triejoin: a worst-case optimal join algorithm
Jumping ahead in time, this is a more modern take on what it means to do joins. This isn't the first paper on the topic, but it is, I think, the most accessible one. There's a couple others that are also worth reading, but as far as getting a taste for the flavour of what worst case optimal joins are all about, I think this is a good one.
The impact on query planning is that in some circles, there's a belief that having these join algorithms that have good worst-case guarantees might be the path forward to freeing us from the danger of catastrophically bad binary join plans. There's still estimation work to be done to pick things like variable orderings, but the thought is that the danger of a bad decision is much less. These ideas still haven't made their way particularly into more mainstream databases but I think that could be less true in another decade or so.
hugejoins.pdf
The real title of this paper is "Adaptive Optimization of Very Large Join Queries," but they named the file hugejoins.pdf and that's the only way I can ever think to find it again.
This is a paper that attempts to answer the question "what if you had to optimize a query with several hundred joins?" The idea of doing any kind of exact algorithm is definitely infeasible here, so it's something of a tour around a handful of complementary optimization techniques.
end
I have more, but I had more to say on these than I thought, so I will save the rest of the list for another time. If you have any other papers you think belong in my personal canon please send them my way!