The Two Machines

workloads

                September 23, 2024

            The Two Machines

There's a joke in my friend circle that asks "is it a database?" A startup, a program, a syscall, a person good with numbers, a person with a good memory. It's all very "is it a sandwich.” But it’s kind of true that it’s weird you can look at RocksDB and Snowflake and say “these are the same class of thing,” because they have very little functionality in common and exist at wildly different levels of abstraction.
As someone interested in the idea of “database” broadly speaking (as you might be, if you are reading this) this is something I’ve had to reckon with when saying what it is I’m actually interested in. Because “patterns for writing to disk efficiently” and “optimizing join orders” are so different but both of interest to me, that I’m forced to conclude there are two different categories of “thing” here and I think of them as “fsync machines” and “join machines.”
An fsync machine is concerned primarily with things like ACID. Reliability, failover, and being able to trust that the data is correct. If you're dealing with disk write throughput, you've likely got an fsync machine on your hands. If you're an fsync machine, you don't tend to care too much about complicated computational procedures outside of "shove that data into that buffer as quickly as possible." Storage engines are fsync machines. Most transactional systems are fsync machines, primarily.
A join machine wants to evaluate queries. It's concerned with optimizing query operators for a CPU, approximating the optimal order to evaluate joins in, and reading large quantities of data from storage (or memory) as quickly as possible. A join machine often doesn't have to think that much about durability because it's endowed by some other system, be it object storage, or Kafka, or that the data is just static in the first place, like a CSV you got emailed. Data warehouses are join machines. Excel is a join machine.
The conflation of these two things into “database” occurs because it’s so often necessary to colocate them that they wind up being sort of the same thing in effect. Plus, it’s often relatively easy for an fsync machine to provide modest join machine functionality (see Postgres) or for a join machine to provide modest fsync machine functionality (see DuckDB) that it usually is coherent to view them as the same kind of object that exists along a spectrum. Sometimes this spectrum is called On-line Transaction Processing (OLTP) to On-line Analytical Processing (OLAP) where somewhere in the middle is Hybrid Transactional/Analytical Processing (HTAP). I think those terms work best when describing workloads rather than objects and so I think it’s somewhat distinct from the “fsync machine/join machine” dichotomy.
It makes sense that these two things would orbit each other as a binary star of the word “databases,” one says you can safely get data into your system, one says you can suitably get it out. It’s just that those are very different tasks and wind up developing entirely different sets of theories.
So why are there people interested in both of these things despite their lack of similarities? My guess would be that it’s just proximity—I had very little interest in databases before I started working at a database company, I was more into like, programming language design and stuff and saw SQL as a way to work on something roughly similar to a programming language but that was more useful. But being at a database company surrounded by people who were interested in databases as a general concept (some in some aspects, some in others) got me into it.
Do not push on this model too hard—it might crumble.

Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

Start the conversation: