Thinking about Closure and LSMs
Math, specifically algebra, has some very nice interfaces. The idea that to perform operations on groups you simply take two of its elements and bash them together to get a new one is actually pretty beautiful. Compare that to like, pread
, or something. Of course, there's more to it than that, we require that this bashing together behaves in a particular way, but whatever. Details.
SICP calls this idea "the closure property," and it's been a fundamental tool for me for ages to explain (to myself, mostly) why some implementations of an idea feel clunky and contrived and some feel clean and inspired.
It's often easier than you might think to represent an idea in a way that respects this. Say I have a notion of a "database" which is the collection of "stuff" that can be queried at any given time. It's a set of rows. In addition to that, we have "mutations:" I might have a notion of an "insertion" which can add rows to this database, and a notion of a "deletion" which can remove rows.
This is a pretty natural way of defining these concepts. And it's not obviously wrong. Or wrong at all, in many contexts.
But consider that if we squint a little, and generalize our notion of a "database" from "a set of rows" to something like "a map from rows to integers" (thus allowing for negative numbers of rows, "multiplicities"). Now all three of these ideas, a "database," an "insertion," and a "deletion" can be rolled into one: we can apply a "mutation" by simply adding all the multiplicities of two databases together pointwise, and represent insertions as databases with positive multiplicities and deletions as databases with negative multiplicities. Now our databases actually are a group.
This particular generalization doesn't work in all contexts, of course, like, say, if you want to do deletions predicated on the rows actually existing in the database, or something. But sometimes these limitations are fine and let us derive lots of nice optimizations, like batching, algebraically:
(((D + d_1) + ...) + d_n) = D + (d_1 + ... d_n)
Ok, so, on NULL BITMAP we are a bit skeptical regarding the application of abstract theory like this to actual everyday working programmer stuff. This might surprise you given the sort of theoretical tone I like to have these discussions in, but that is because I think that is fun and a worthy pursuit independent of any actual applications to anything.
However! This framing, I think, if you're inclined that way, is very useful for intuition in some settings. Thinking about things this way was the way I managed to really understand what is going on with LSM trees as a concept.
All write-optimized data structures are alike; each read-optimized data structure is optimized in its own way.
— shachaf (@shachaf) September 15, 2023
To process a write request, the absolute minimum that you'll have to do is write out the data in the request itself. This is also sufficient -- just append the request to a log (and possibly process it asynchronously). Many write-optimized data structures like LSM trees do this.
— shachaf (@shachaf) November 6, 2023
As Shachaf notes, if you want to process an update as quickly as possible in a database, it's both necessary and sufficient to simply write down that update somewhere like a log and then ensure you reference it later if you ever do a read. If you take this idea to its logical conclusion you end up inventing LSMs. Or at least, a simplified, idealized version of them.
That way of describing writes always sounded kind of...misleadingly simplified to me. Like, "reference it if you ever do a read" sounds very nice and clean, but in practice this is like, a whole bunch of like if-elses and ugly bookkeeping to implement what is actually a sort of complicated policy. So I guess...maybe it feels a little...disingenuous to describe it in such simple terms?
But I was wrong! If you design your "data sources" such that everything obeys the closure property and you are able to happily bash together collections of data, it really is as simple as "write it down and reference it if you ever do a read."
The way I visualize it is that the initial state of my database is just some kind of hunk of data, say, called D
, in my group of databases as described before:
D
When I do a write, we write down the value we want to write:
D + d_1
and now any reads must read from D + d_1
as they do, performing the addition as part of the read.
Over time, we build up a bunch of these writes:
D + d_1 + ... + d_n
and performing this addition for each read becomes expensive. So we invent a new tool, compaction, which is to asynchronously perform this addition and then atomically swap out all the little writes with our new big write:
D'_1 = d_1 + ... + d_n
D + d_1 + ... + d_n = D + D'_1
And this process continues. We have a balancing act to perform, now: if any of our hunks of data are too small, we'll have to do a big complicated sum for every read. But keeping our hunks of data very large means we're doing a lot of work to keep them like that.
It turns out that at least one point on the pareto frontier of "good ways to organize this data" is to attempt to keep them within some multiplicative factor at each level. This is where we exploit another nice algebraic fact about this "group of databases" we have constructed, which is that combining them is associative:
(A + B) + C = A + (B + C)
This means that when our database is a big complex sum:
D_1 + D_2 + D_3 + D_4
and in order to get everything into the right shape, we might like to merge D_2
and D_3
, we are allowed to do that, because of associativity:
D_1 + (D_2 + D_3) + D_4
in order to maintain the shape of our database that we would like.
I think this is a very nice mental model, and thinking about associativity in this way makes it very clear what is possible and not possible in say, RocksDB merge operators. And this abstract way of thinking about a database as a big opaque hunk of data is very satisfying to me, that I think doesn't really sacrifice all that much in terms of the fidelity of the mental model.
NULL BITMAP is free. If you earn a software engineering salary and think you've gotten some value out of these posts, please consider tossing a couple dollars to the Palestine Children's Relief Fund.