NULL BITMAP by Justin Jaffray logo

NULL BITMAP by Justin Jaffray

Archives
Subscribe
December 8, 2025

Predicate Decomposition

NULL BITMAP.png

One of the fundamental things that query planners have to do is decompose predicates. What do I mean by that, well, it's actually something that comes up quite often in everyday programming.

Imagine we have a database and a relation with two columns, a and b, along with two indexes: one on a, and one on b. We can write some code to simulate this.

First, we can create a dataset of as and bs:

let data: Vec<(u64, u64)> = (0..1000000)
    .map(|_| (random_range(0..10), random_range(0..1000)))
    .collect();

Then, we can write a function to create a secondary index, which will let us query a dataset on a particular column. In a real database, this would be something like a BTree, but for our purposes we can just use something simpler like a sorted array we binary search on:

struct SecondaryIndex<T: Ord> {
    data: Vec<(T, usize)>,
}

impl<T: Ord + std::fmt::Debug> SecondaryIndex<T> {
    fn new<U>(input: &[U], f: impl Fn(&U) -> T) -> Self {
        let mut data: Vec<_> = input.iter().enumerate().map(|(i, u)| (f(u), i)).collect();

        // Will sort lexicographically by the first coordinate, then the second.
        data.sort();

        SecondaryIndex { data }
    }

    fn query(&self, t: &T) -> impl Iterator<Item = usize> {
        let start = match self.data.binary_search_by(|(k, _)| match k.cmp(t) {
            // We want to make sure we get the first matching entry.
            std::cmp::Ordering::Equal => std::cmp::Ordering::Greater,
            other => other,
        }) {
            Ok(x) => x,
            Err(x) => x,
        };

        (start..)
            .take_while(|i| *i < self.data.len() && self.data[*i].0 == *t)
            .map(|i| self.data[i].1)
    }
}

Now, we can use SecondaryIndex to efficiently find records with a given value for a column:

let b_idx = SecondaryIndex::new(&data, |(_, b)| *b);

for idx in b_idx.query(&43) {
    println!("{} -> {:?}", idx, data[idx])
}

This outputs something like:

199 -> (99, 43)
1118 -> (502, 43)
1810 -> (47, 43)
4565 -> (939, 43)
4779 -> (555, 43)
5556 -> (261, 43)
5750 -> (393, 43)
6019 -> (389, 43)
6976 -> (183, 43)
...

Now we can imagine that we have two indexes, one on a and one on b:

let a_idx = SecondaryIndex::new(&data, |(a, _)| *a);
let b_idx = SecondaryIndex::new(&data, |(_, b)| *b);

And imagine further that we want to perform a query of the form a = <x> AND b = <y>.

Well, we have three options:

Option 1: scan the primary index:

let mut count = 0;
for (a, b) in &data {
    // Evaluate the predicate for each row.
    if *a == x && *b == y {
        count += 1;
    }
}

Option 2: use the secondary index on a:

let mut count = 0;
for i in a_idx.query(&x) {
    // Our use of the index has implicitly evaluated a = x for us already.
    if data[i].1 == y {
        count += 1;
    }
}

Option 3: use the secondary index on b:

let mut count = 0;
for i in b_idx.query(&y) {
    // Our use of the index has implicitly evaluated b = y for us already.
    if data[i].0 == x {
        count += 1;
    }
}

It is the query planner's job to identify these three alternatives and then decide between them. The task of identification requires us to decompose the predicate a = x AND b = y.

One way to think about this is as algebraic transformations over a base query plan.

The initial query is:

SELECT count(*) FROM ab WHERE a = 7 AND b = 100

Option 1 might have a query plan that looks like this:

(filter
  (and (= a 7) (= b 100))
  (scan "ab" primary-index))

To "prove" that the other two plans are equivalent, we can invent some axioms for ourselves.

Axiom 1

(filter (and p q) $input)
=
(filter p (filter q $input))

and Axiom 2:

(filter (= $col $val) (scan $table))
=
(seek $table INDEX $val)
(if INDEX is over $val)

Then we can successively apply these axioms starting from our initial plan to get to our alternative query plans:

(filter
  (and (= a 7) (= b 100))
  (scan "ab" primary-index))
=> axiom 1
(filter (= a 7)
  (filter (= b 100)
    (scan "ab" primary-index)))
=> axiom 2
(filter (= a 7)
  (seek "ab" 100 b-idx))

Now, I want to posit that this is actually a somewhat fundamental "trick," this idea of breaking up a complex predicate into a simpler predicate and a "seek."

It comes up a lot whenever we have some kind of index that can serve simple filters, but we need to be able to deal with more complex filters. The natural way to express queries over such a dataset is to have a homogenous form for the more complex filters but specialized ways of querying that can "suck" down parts of those filters that they're capable of computing efficiently. In this case, we might implement this by teaching the scan how to pull down predicates for which there exists an index over, and transform the scan into a seek.

The next task is, of course, choosing which of these plans is going to be best. And there is, usually, a "best" plan. Look at the runtimes for the options we've outlined here on my laptop:

Scan Primary Index : 0.30204479s
Use A Index        : 0.12665179s
Use B Index        : 0.00104525s

But this is a different job.

Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

Add a comment:

GitHub
Website favicon
Bluesky
X
Powered by Buttondown, the easiest way to start and grow your newsletter.