A Review of Extensible Query Optimizers in Practice
My roommate and I are in a war with UPS. I came home the other day to find she’d left this note on our front door:
It would appear she also has some kind of additional beef with DHL ("an even more evil company than UPS"). I don’t know anything about that and I’m not getting involved. I think there’s some problem with the app that my landlord installed in lieu of a buzzer, which is causing UPS to think we're not home when we are.
I don’t know what she’s ordered, but I ordered, and consequently had to pay six dollars to reroute, my copy of Extensible Query Optimizers in Practice:
I was made aware of this book by being directed to its publisher's page, which has a completely bizarre pricing scheme:
I of course shrewdly googled around to find a discount code so I only paid something like $70 (plus my six dollars to prevent the UPS demons from telling me they're "sorry they missed me").
If you don't want to be quite so shrewd but want to read this yourself it turns out there is a free pdf available on the Microsoft website. It is also the topic of an as-of-writing, ongoing CMU course.
The existence of this book is exciting, if, like me, you like query planning and have ever tried to find canonical resources on it. It's even more exciting if you recognize Surajit Chaudhuri as the author of probably the best query planning paper ever written.
Before I get into each chapter: I think the book is good. If you like this stuff, or you think you might like this stuff, or you have a job-related need to learn about this stuff, you'd probably be hard-pressed to find a better way to understand the lay of the land. The book is a fantastic achievement that probably only this set of authors was capable of putting together in such a lucid way, and the world is better for its existence.
I feel compelled to say this before I get to any criticisms because while it's convenient for me to me to think of this mailing list as "my mother and twenty people I worked with at one time or another" (which is broadly true) the world of databases is small enough that there is some nonzero chance that this makes it back to the authors. This book is great. If you enjoy reading NULL BITMAP, you will enjoy reading Extensible Query Optimizers In Practice.
My high-level complaint is that I think the book suffers from trying very hard to not be about engineering. I feel that the authors are trying to avoid too many concrete details in a hope to make the material more timeless and I think the work is less good for it. And this is basically true of everything that happens when people write about query optimization because the interesting engineering problems in query optimization are not really the sort of thing you can easily write a paper about, and in academia people tend to just modify existing query planners or write hacky one-offs to test their hypotheses.
The problem is that this is intrinsically a topic about engineering. The authors even use the word "extensible" in the title to describe the kind of software they're interested in talking about. Query planners being a pattern at all is a question of engineering: I think anyone talking about query planners needs to justify, regularly and continually, proportional to their complexity, their position in the stack (and "people don't know how to or don't want to write query plans" is not a valid reason).
You can write about how to do query decorrelation, but I can look that up in a paper, and a lot of the difficulties of doing it come from fitting it into an existing framework (or more likely, designing a framework to support it) that doesn't mangle your plans in a near-unrecoverable way.
Chapter 1: Introduction
The authors introduce the problem solved by a query optimizer (finding an acceptable way to run a declarative query) and introduce the topic of the book briefly. They make a point of being clear about what they mean by extensible in the title which I think is important: we are chiefly interested in the general purpose systems that lets us optimize queries.
I love that the authors made a point of calling out System R because I think we are blessed in the query planning world by having a canonical “first system” that got it, broadly, right.
Chapter 2: Extensible Query Optimizers
This is the meat of the book. This is where the gold is: an explanation of Cascades.
I have two broad quibbles with the presentation put forward here. The first is what I said before: they are uninterested in getting into any of the gritty details around what kind of techniques actually work well for implementing and organizing the ideas they talk about (and some techniques work better than others). While reading this might qualify you to talk about query planning, and it might qualify you to do some engineering explorations, I don’t think I could sit down and write a quality implementation based only on what I learned from this.
My other quibble is that the authors did not take the opportunity to improve some of the presentation of the material. What they’ve done here is largely synthesis of existing treatments, which is of course valuable, but I think there is room for improvement in a lot of traditional explanations (for instance, making more explicit the distinction between a logical and physical property).
The focus on examples is lovely, with very detailed traces and pretty diagrams. I hate pseudocode though, at least the kind that shows up in Latex documents. I'd rather no code at all.
Chapter 3: Other Extensible Optimizers in the Industry
This chapter is an exploration of a number of systems that are similar to the archetypal system described in chapter 2. In what I think is a sort of brilliant organizational move, everything is described in relation to that platonic ideal: by spending so much care in chapter 2 to teach the reader how Cascades works, they get to explain each of these systems as a sort of delta. This is great stuff. This is also part of what I would call the heart of the book.
Chapter 4: Key Transformations
I get the feeling that when a lot of people imagine a book on query optimization they picture a big table of relational algebra identities that say "if you apply these, you're good." I get so frustrated when people talk about things in this way because that big book would be, for the most part, trivial. In volume, most query transformations are extremely obvious and only require a bit of thought to come up with.
The important ideas in query planning are:
- the general framework that applies the rules, and the data structures that undergird it, and
- the handful of actually very complex transformations that are nonetheless important.
The book up to now has focused on the former, and this chapter focuses on the latter, talking about some of the gnarliest transformations you need to implement, things like commuting aggregations and joins and decorrelation. They are bound of course by the length of the book: a whole separate volume of this length could be written on decorrelation alone.
Chapter 5: Cost Estimation
Hard to come up with much to critique here, it's a very solid tour of the world of various estimation techniques. I would like more practical coverage of how to think about cost models from the perspective of evaluation, myself.
Chapter 6: Plan Management
This is what I'm TALKING ABOUT man! We are down in the muck here. Plan caching, plan reuse, all that nasty, unsexy stuff. These are problems I've worked on and they are not fun, nobody likes dealing with them, and they are super important. This is the shit people are afraid to talk about. Good stuff.
Chapter 7: Open Problems
I'll admit I'm biased, if you couldn't already tell. I think there is so much to be done in terms of clarifying, hardening, and better understanding the problems we've already "solved" that I find "open problems" a little unexciting. That said, I understand why a book like this has to include a section like this, and it is certainly interesting.
In Conclusion
The world needed a holistic treatment of query planning. Most writing on query planning is devoid of any actual information that would ever help anyone implement a query planner. The prevailing attitude among academics is that the mechanisms and engineering questions are largely solved and from where I stand, that is not true at all. There's a common adage that says you have to read a bunch of papers on query optimization to learn about it, but most papers pay only lip service to the actual structure of an industrial query optimizer.
Extensible Query Optimizers in Practice is a breath of fresh air and was fully worth the $75.99 I paid for it in total. If not the emotional distress of dealing with UPS.