Databases Have a Lot of Data
A recurring theme in the ongoing coverage of Sam Bankman-Fried's trial has been his apparent recklessness. One particularly popular anecdote being his bizarre coin-flip hypothetical. From Inner City Press:
AUSA: What did the defendant [say to] you about risk?
Ellison: He said it was OK if positive EV, expected value. He said he was willing to take large coin flips - he talked about being willing to flip a coin & destroy the world, as long as a win would make it twice as good
Michael Lewis's book says the following about SBF's time at Jane Street:
A Jane Street intern had what amounted to a professional obligation to take any bet with a positive expected value.
Much has been said about like, ah, geez, man. I don't know if you really understand risk as much as your résumé suggests. Because it's really not as simple as "positive EV = good." I don't bring this up to be that guy who is always trying to talk about the FTX collapse. I mean I am that guy. I guess. But I promise I'm going somewhere.
This whole saga has had me thinking about the nature of risk inherent in changing programs, and I think at the core, these things are very similar. I think this is an interesting angle to think about because programmers are forced to make decisions like this quite often. Changes to code are often not Pareto efficient and maybe you take some kind of stance like:
- I care more about the cases that this change will improve than the cases it will make worse, or maybe
- I think in aggregate the cases this will improve are more numerous than the cases it will make worse,
or some combination of those.
But it's actually more complicated than that, because you might care about these speedups in a linear way. A program getting x% faster in some number of cases is not necessarily balanced out by a program getting x% slower in as many cases, or even in a way biased towards improvement. If you have big, world-ending coin flips, maybe you'd rather opt to keep performance more predictable, even if it would get better in expectation.
This is why people are often surprised that query optimizers don't do a particular optimization. For example, Postgres is a lot less aggressive than some other databases when it comes to, say, decorrelating subqueries, and I think this is partly due to risk assessment. Decorrelating a correlated subquery moves the structure of the query plan further away from the structure of the query, increasing the risk of a nasty surprise. Because fundamentally, if your query runs faster than you expect, or than it did on the last release, you might not even notice. If it runs slower, you're sending a sternly worded letter to your vendor.
This is at the heart of the what's different between query planners and, say, optimizing compilers. Because they're similar in a lot of ways. In fact, functionally, it's sort of hard to find a distinction. They both take an input higher level program, and output a lower level, "more executable" program, using a lot of the same tools.
The big difference is in the possible scale of an optimization. The impact of an optimization, good or bad, is a lot smaller in a compiler than it is in a query planner. This is inherent in the fact that a query planner is shuffling around expressions that might represent huge amounts of data and computation. This means that if you get, say, 100 opportunities to make a positive expected value optimization in a compiler, then you can be fairly certain that the sum of those effects is going to be positive, because the distribution of their impact is not too wide. You have far fewer such opportunities when optimizing a SQL query, and each individual optimization you might apply has a much higher chance to be "the big one" that has a large impact on the total runtime of the query.
How do we think about this? Well, I don't know if there's actually a lot of pivot points that allow people to act on this distinction, outside of like, being more conservative in what kinds of optimizations they do in their query planner. There are some fundamental spectra of decisionmaking, however.
One of my favourite papers is Towards a Robust Query Optimizer: A Principled and Practical Approach by Babcock and Chaudhuri, which summarizes one of these ideas beautifully:
Frequently, the query optimizer has a choice of multiple query plans which differ in the degree to which their execution time depends on query selectivity. An example is the choice of the access method used to retrieve records from relation R that satisfy the predicate (A = a) ∧ (B = b), where A and B are two indexed attributes of R. An index intersection plan that identifies the qualifying records based on the indexes and then retrieves just those records will perform well if the number of records to be retrieved is low. However, since the index intersection plan requires one random disk read per record, it fares poorly when the selectivity is high. The cost of a sequential scan plan, on the other hand, is essentially independent of the query selectivity.
I'll admit that I'm entirely convinced of the ideas they put forth in this paper, but this fundamental framing of "risky choice that depends on selectivity" vs. "safe choice that doesn't depend on selectivity" has been a really good mental model for me personally to think about queries. I think if it as a graph that looks like this:
Where the red line is the "risky" choice, and the blue line is the "safe" choice. Of course the difficulty with living with this graph is that in a transactional SQL workload (that will typically involve at least some joins), almost all of the queries live at the far left. The implications of this problem are explored a bit in How Good Are Query Optimizers, Really?.
I think a lot of software has to deal with this tradeoff between predictability and performance, but the volume of data and the variety of workloads mean this is much more prominent in the world of databases than it is in some other fields.