NULL BITMAP by Justin Jaffray logo

NULL BITMAP by Justin Jaffray

Archives
May 11, 2026

Has Merge Join's Time Gone?

NULL BITMAP.png

I have long thought of the hash join/merge join duopoly as reflective of the fundamental truth that basically all data-processing related algorithms are based on either hashing or sorting. Yes, we also have nested loop join, but that's sort of just hash join in disguise.

Aggregation? Your options are sorting and grouping, or hash bucketing. Deduplication? Same deal. Union? Intersection? Difference? Same thing. And to me at least, this dichotomy feels very orthogonal. The way that you hash a piece of data bears no real resemblance to the way that you put a total ordering over it, one doesn't follow from the other.

Historically, the holy trinity of join algorithms has been hash join, merge join, nested loop join. You use merge join when you can lean heavily on the existing sortedness in your database, nested loop join when one side of your join is much larger than the other (since you can do ~O(1) lookups in one side, so runtime only scales with one side), and hash join when you have no other choice.

Increasingly, though, we've been seeing merge join fall out of favour. Many modern, carefully considered databases don't even ship with an equi-merge join operator. DuckDB (an analytical database) and CedarDB (a transactional database) being two notable examples.

One of the reasons we, historically, have been able to make good use of merge joins as a field is because databases have long been very closed systems. Not in the sense of "closed source," but in the sense of, "I have extremely tight encapsulation boundaries over my data, and my blessed APIs are the only way that you may interact with me." This monolithic view of a database meant that algorithms and interfaces could be designed very explicitly to be deeply intermeshed. Postgres is like this: it has the final say over how all its data gets inserted into the database, and so its architecture can be designed around having consistent ordered access to data.

Compare this to the requirements of the modern analytical database: DuckDB supports a vast array of file formats, including local files, unstructured blob storage, and more explicit data lakes. And it has to support all of these things in the same query. Once you need to query data that was created out of your walled garden, you very quickly begin to lose the ability to enforce the kind of structure that makes things like merge join easy to get "for free." Once the universal interface you have is simply "a stream of unordered rows," hashing begins to be the only option.

Hash-based algorithms also parallelize much better than sort-based. Merge join, requiring sorted data, makes it hard to cleanly introduce parallelism, since there's fundamentally a single coordination point in the head of the streams.

The calculus has simply changed here. Twenty years ago:

  • Memory was much more expensive (even with, you know...), spending extra time to save a bit of memory was an okay tradeoff.
  • In addition to that, memory is also simply less valuable now: it used to be that disks were so slow that if you had to spill your hash table to disk, things would grind to a complete halt. Now, that's still undesirable, but disks (and associated caching algorithms) are fast enough that it's not the end of the world.
  • Hashing algorithms were much lower quality. The field of non-cryptographic hashing was basically born and came to maturity in the last two decades.
  • Research on parallel hash tables was much less mature. They're just a lot better now.

Compare the developments that have led to improvements in hashing-based algorithms to those that have led to improvements in merge-based algorithms? Sure, sorting has gotten better, perhaps, but it was rare that we were trying to sort our data before we merged it anyway, we always wanted to merge data we already had sorted access to.

Tracking sortedness has also historically placed a large burden on query planners. Tracking interesting orders in an efficient way was a large part of why cascades was so popular, despite its deficiencies with some other paradigms, like DAG optimization. In an ordering-agnostic future, we can give up a lot of the infrastructure that we've relied on to handle those things well.

If I were building a new query engine today, I'd think long and hard about the set of primitive operators that I wanted to support, and also the constraints that providing those put on the rest of the system, and maybe the conclusion is that merge join is no longer worth it in a lot of cases.

Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

Add a comment:

GitHub
justinjaffray.com
Bluesky
Twitter
Powered by Buttondown, the easiest way to start and grow your newsletter.