Lightweight Cardinality Estimation with Density

Last week we talked about the different ways we can decompose a simple predicate which is a conjunction of two simpler predicates.
Given the query:
SELECT * FROM ab WHERE a = 7 AND b = 100
We can:
- Not decompose the filter at all, and scan our base data applying the predicate
a = 7 AND b = 100, - push down the
a = 7filter and translate the scan over the primary index into a scan on a secondary index, then apply theb = 100filter on the results of that, or - do the same with the
b = 100filter and usea = 7as our residual filter.
Algebraically these options might look something like:
Option 1
(filter (and (= a 7) (= b 100)) (scan ab PRIMARY))
Option 2
(filter (= b 100) (seek ab A_IDX 7))
Option 3
(filter (= a 7) (seek ab B_IDX 100))
Recalling from last week, we generated this data like this:
let data: Vec<(u64, u64)> = (0..1000000)
.map(|_| (random_range(0..10), random_range(0..1000)))
.collect();
And each of the access methods had different execution times:
Scan Primary Index : 0.302s
Use A Index : 0.126s
Use B Index : 0.001s
Why the large difference between A and B? If take a look at how many actual rows each query plan examines, we'll see that the A Index plan looks at about 100000 rows, while the B Index plan only looks at about 1000.
The reason is because there are simply more rows that have a=7 than those that have b=100, so the initial filter is less selective when we use the a index first, while using the b index lets us make better use of seeks.
Now, if you were to write this program by hand, you'd probably know this about your data. You'd know, say, that there are fewer people with a particular first name than those who, say, don't have a pet. And so if you had an index on those two properties, you'd know, intuitively, that you should seek in your first_name index rather than your has_pet index. A computer program like a query planner doesn't have the luxury of having this intuitive domain knowledge of its dataset, though, and in many ways the job of a query planner is fundamentally to approximate such domain knowledge. So we need to have analytical tools at our disposal that will let us make such determinations cheaply.
One such tool is the number of distinct values that the column has. This statistic is fundamental, and goes all the way back to Patricia Selinger's seminal paper on query optimization, which called it ICARD:

Nowadays, it's sometimes common to instead talk about the reciprocal of this value, 1/ICARD, as the density of a set of columns in a relation.
From Extensible Query Optimizers in Practice:

The density of a set of columns tells us a couple interesting things.
First, it gives us a guess at how selective an equality predicate over that set of columns is going to be. If our column is roughly uniform (which, noted, it seldom will be), we can estimate the selectivity of any equality predicate over a set of columns by the density of that set of columns, and get an estimate of the number of rows of a selection by num_rows * density.
Next, it also lets us get a limited view of the correlation of some columns. If the density of (a, b) is sufficiently higher than the product of the densities of a and b, then there's some clear correlation in the two columns.
All these views are limited though: there's many questions we can't answer with density alone. In fact there are only very few we can answer, it's just that it's very powerful relative to how lightweight it is.
Density is very easy to track and to compute over large amounts of data. There are many algorithms for computing the number of distinct values in a set, and it can in fact even be done incrementally via use of something like HyperLogLog. Of course, in full generality we'd like to keep histograms or samples over our data, which let us estimate the selectivity of more complicated predicates like inequalities, but these are more expensive and it can be difficult to extend them to multiple dimensions effectively. For its cost, density is a very valuable tool and it's much more practical to track it for many sets of columns than it is to keep histograms across all of them.
Add a comment: