NULL BITMAP by Justin Jaffray logo

NULL BITMAP by Justin Jaffray

Archives
Subscribe
June 29, 2026

There are Databases Everywhere for Those with the Eyes to See

NULL BITMAP.png

Programming note; this is the 150th issue of NULL BITMAP, and NULL BITMAP will be going on a hiatus for a while, maybe just for the Summer, maybe indefinitely. I think I need a bit of a break from it and I don't think the weekly cadence is serving me like it once did.

I have been poking around the Jujutsu repo recently, and I was very pleased to realize that this thing has a whole query language and query planner baked inside of it. This was expected, I guess, if you spent five seconds interacting with the revset language, which the docs call a "functional language," but is clearly really more of a query language.

For example (lifted from their docs), if you have this history:

  D
  |\
  B C
  |/
  A
  |
root()

You can use the revset language to express queries like "find me commits which are parents of D and ancestors of A (inclusive)," via A::D, and get out {A, B, C, D}.

And these operations compose: they all take revsets as input (which are "revision sets," or, sets of commits) and emit revsets as output.

So, while this sort of looks like a cute DSL for specifying commits in a version control system, what you have is actually a fully general query language over DAGs, where the fundamental operations are traversing edges and taking upper and lower closures. And inspecting the source gives some clues as to how this thing is constructed, for example the IR gives some good insight into how this thing is represented internally.

They do some classic database tricks to rewrite what look like an expensive set intersection into a filter.

In the semantics of the JJ language, for instance, author("Justin Jaffray") is logically the set of all commits authored by Justin Jaffray, and visible_heads() & author("Justin Jaffray") is the set of visible heads (visible commits without any children) authored by Justin Jaffray (& is the set intersection operator). If you've done any query optimization, alarm bells might be going off in your head: even though logically, we're creating the entire set of commits having a given author, this kind of intersection can be done more efficiently as a filter: we'd probably want to execute it like visible_head().filter(author == "Justin Jaffray").

And that is what Jujutsu does, if we look into their IR, they turn members of such an intersection into a wrapper AsFilter, and rewrite expressions involving them into filters:

RevsetExpression::Intersection(expression1, expression2) => {
    match expression2.as_ref() {
        RevsetExpression::Filter(_) | RevsetExpression::AsFilter(_) => {
            ResolvedExpression::FilterWithin {
                candidates: self.resolve(expression1).into(),
                predicate: self.resolve_predicate(expression2),
            }
        }
        _ => ResolvedExpression::Intersection(
            self.resolve(expression1).into(),
            self.resolve(expression2).into(),
        ),
    }
}

This is a classic kind of database optimization. It makes use of an important relational model concept, which is the equivalence between filters, intersections, and joins.

My usual way to explore these kinds of ideas is to re-implement them a bit on my own, so I made my own little implementation of this query language. It ignores many of the problems inherent in this, such as having to provide efficient access over a version control repo from short-lived processes. I really just wanted to come up with a few rewrite rules, because that sounded fun.

My little IR is much smaller than theirs, because it doesn't have many features:

pub enum Expr {
    All,
    Constant(Vec<Vx>),
    Up {
        input: Box<Expr>,
        // Only include results if they appeared in the lo-th round.
        lo: usize,
        // Don't include results appearing later than the hi-th round.
        hi: Option<usize>,
    },
    Down {
        input: Box<Expr>,
        // Only include results if they appeared in the lo-th round.
        lo: usize,
        // Don't include results appearing later than the hi-th round.
        hi: Option<usize>,
    },
    Range {
        lo: Box<Expr>,
        hi: Box<Expr>,
    },
    Union(Vec<Expr>),
    Intersection(Vec<Expr>),
    Filter {
        input: Box<Expr>,
        label: String,
        value: String,
    },
}

Then, my representation of data is simply an ordered list of vertex IDs, with additional relations for annotations like author names.

Now, let's say we want to filter for a particular author in some set of commits, like we had before:

Given the query c:: & author("Justin Jaffray"), meaning, the upper set of c which was authored by "Justin Jaffray," a natural translation of the query might be something like this:

let query = Expr::Constant(vec![c])
    .up(0, None)
    .intersection(Expr::All.filter("author", "Justin Jaffray"));

Of course we'd rather perform that filter directly on the upper set, rather than going through the Expr::All. So we can do a fairly normal rewrite here:

// In intersection's optimize
if self.rules.intersection_to_filter {
    if inputs.iter().any(|input| {
        matches!(
            input,
            Expr::Filter { input, .. }
                if matches!(input.as_ref(), Expr::All)
        )
    }) {
        let (hoisted_filters, base_inputs): (Vec<_>, Vec<_>) =
            inputs.into_iter().partition(|input| {
                matches!(
                    input,
                    Expr::Filter { input, .. }
                        if matches!(input.as_ref(), Expr::All)
                )
            });

        let mut expr = Expr::Intersection(base_inputs).optimize(self);

        for filter in hoisted_filters {
            let Expr::Filter { label, value, .. } = filter else {
                unreachable!();
            };
            expr = expr.filter(label, value);
        }

        return expr;
    }
}

Now, if we print out the resulting query plans, we can see the improvement from the unoptimized to the optimized version:

Unoptimized:

intersection(
  constant([Vx(3)])
    .up(0, *),
  all()
    .filter("author", "Justin Jaffray"),
)

Optimized:

constant([Vx(3)])
  .up(0, *)
  .filter("author", "Justin Jaffray")

In my implementation right now, a filter is still just an intersection, but the core idea still simplifies the query tree.

It's always fun to be poking around a piece of software and recognize: hey, there's a query planner here. Let's go take a look at that bad boy.

Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

Add a comment:

You're not signed in. Posting this comment will subscribe you to this newsletter with the email address you enter below.
GitHub
justinjaffray.com
Bluesky
Twitter
Powered by Buttondown, the easiest way to start and grow your newsletter.