Tiering vs. Leveling
Last week we talked about how to think about "what is going on with LSMs." Today, we are going slightly deeper and a little bit sideways to discuss a design junction in them that I think is interesting but have not ever really seen much mention of, which is how to organize the data in the various levels of our LSM. To borrow the terminology of Sarkar et al., I want to talk about tiering vs. leveling.
First, a quick review of the basics of LSMs. We have hunks of immutable data arranged at various "levels," and when we do a read, we need to take the union of every level in order to determine the actual value of a particular key. As a note on terminology, these hunks of data are often called "sorted-string tables," or "SSTs."
Typically, the expectation is that the amount of data that lives at each level in the structure is greater than the one above it. The upper levels are constantly getting rewritten as writes come in, which is relatively cheap, since they're small, and the lower levels only get rewritten when they must, since they're big and expensive to rewrite.
In a traditional LevelDB-style LSM, there is an invariant that within a particular level (with one notable exception), none of the SSTs overlap. This means that any read must touch at most #levels
files. I think of a single-key read in an LSM as a vertical line that crosses through all the files that it has to look at:
This guarantee on bounded file-lookups is not free, and this invariant requires work and concessions to maintain. We need at least a somewhat thoughtful policy of when and where to push things down into lower levels. Pushing down a single small file that causes a larger file to be rewritten is no good, for example.
Another way I think of this is that if you were to write an expression that represented the totality of data here, it would look something like:
data = Merge(
Concat(all the files in L0),
Concat(all the files in L1),
...
)
This design is referring to by Sarkar as "leveling."
An alternative design, referred to by Sarkar as "tiering," is to forego the invariant of no overlap altogether, and instead structure our LSM more like this:
In this design, we lose the guarantee that we can bound the number of files to be read from, but in exchange, organizing the data becomes much simpler. We don't need to do any extra work to keep our data in a valid shape. When a level gets too big, we simply merge all the files in it and push that down as a file in the level below (thus, if writes are roughly uniform, the levels should grow in size roughly geometrically).
Now, the expression that represents all the data is something more simply:
Merge(all files)
But hang on, now, what are the levels even doing for us? They no longer appear to bear any semantic meaning at all. This is basically true, they don't. In full generality, they're just sort of aesthetic, now. One nice property they do have is that if the volume of writes the LSM is processing is relatively constant over time, the levels a file is in roughly corresponds to how large it is, which can suggest some nice and simple compaction strategies.
I regret to say I don't have some deep insight regarding when one of these is more appropriate than the other. I think it probably depends largely on the nature of the workload. It seems to me that write-heavy workloads probably benefit from tiering, given that we need to spend less time doing bookkeeping and reorganizing of data to maintain our invariants, and the main benefit of leveling comes at read time.
Sarkar observes that RocksDB is actually a hybrid approach: the first level (L0) of the LSM is tiered, and the rest is leveled. This guarantees that they can always flush their in-memory memtable to disk without having to rewrite any existing data. I've heard reasonable people disagree over whether this was a good idea or not, but I don't know any of the nuances of that discussion. If you have opinions I would love to hear them.
NULL BITMAP is free. If you earn a software engineering salary and think you've gotten some value out of these posts, please consider tossing a couple dollars to the Palestine Children's Relief Fund.