Building One to Throw Away
I've been thinking lately about the nature of bespoke data structures for bespoke purposes. I think you learn fairly early on in most programming contexts that usually, the best thing to do is to use an off-the-shelf data structure for most tasks, because it will probably be good enough, and it's been tuned much better than something you could make yourself. And like, let's be real: how likely is it that your use case is unique enough that you'd benefit from a fresh data structure.
Here's a pattern that shows up a decent amount of a databases context: grow-only data structures.
The classic example of a grow-only data structure is an arena allocator: this is an allocation strategy that doesn't support freeing an individual allocation and only supports freeing the entire batch of allocations at once. The idea is that allocations become very cheap (just a pointer increment) and deallocations become nearly free, you deallocate everything at once by either resetting the allocation pointer or by freeing the whole chunk of memory.
Arena allocators are really nice to work with and dramatically improve the performance of code when you have the ability to use them. Of course, you have to be solving a particular kind of problem: one where you need to do a bunch of allocations that can all be freed at the same time. The classic example of this is game engines. It's commonly considered a bad idea to hold pointers across the computation of a single frame in a game, so it's common to do all of their per-frame allocations into an arena allocator which can get cleared and re-used for the next frame.
In general, we're looking for things that have a well-defined lifespan that they share with a bunch of other things.
One obvious thing in the world of databases that has this shape is an individual query. Once a query is serviced, we're done with it and all of its state. This means many allocations, like query optimizer state, can be stuffed into an arena allocator.
These sorts of lifetime things show up in other, larger ways too. When I wrote Fine! I'll Play With Skiplists, there was a comment:
Skiplist discussions that don't talk about heap effects are probably incomplete. The big limitation with skiplists for anything that isn't a "mostly-write-once" workload (something that, it should be noted, doesn't get much benefit from the atomic implementation discussed) is that the metadata record is variable sized, which does bad things to heap fragmentation and cache locality (two things you probably care deeply about if you're going so far as to junk your AVL tree for a lockless data structure) and isn't amenable to the slab-style optimizations used elsewhere.
I love skiplists for their simplicity, especially compared with the balanced trees against which they compete. But the truth is they're very one-trick-pony data structures and don't bring lot to the table as a general choice. Their home is pretty limited.
This is a reasonable objection in general, but remember that the situation described is for an LSM Memtable, which, by design, gets filled up, flushed to disk, and then thrown away. This means we actually don't have to worry about anything like heap fragmentation. Something that would be a concern for a general-purpose data structure is actually not a problem at all for a data structure that is insert-read-only, since it won't live for arbitrarily long periods of time.
These were the things on my mind when I was reading a fun paper, Global Hash Tables Strike Back! An Analysis of Parallel GROUP BY Aggregation (whose name, as far as I know, is a reference to a series of join ordering papers like Dynamic Programming Strikes Back), from Daniel Xue and Ryan Marcus, where the authors propose that by revisiting some fundamental assumptions about hash table design for aggregation. The problem of aggregation is to answer queries that look like:
SELECT sum(x) FROM xy GROUP BY y
Which requires getting all the rows from xy
that have the same value of y
into the same place. The most obvious way to do this is with a hash table keyed by y
. For each row in xy
, look up an aggregation cell by y
and add x
to it.
One of the modern approaches to solving this problem has been partitioning: if you shuffle all the data based on y
before you aggregate things, as long as you can guarantee that two rows with the same value of y
are in the same partition, you can avoid having to use any kind of locking in your hash table since each one can be owned by a single thread. This has been the basis of a lot of thread-per-core data-parallel approaches to doing analytics queries, and is the basis of the exchange operator.
That approach has some problems though: it's very vulnerable to skew, where some of the values of y
have dramatically more rows than others, or maybe all of the rows have a single value. Morsel-Driven Parallelism is a modern technique that attempts to resolve this by thinking about parallelism in a different way that can reduce the problems of skew, but reintroduces an older problem of having to deal with a shared-memory hash table.
This is the context of the Xue/Marcus paper, which is that they are discussing the design of such a hash table, which, as they observe, can exploit some of the properties of an aggregation query: namely that it is write-only until the very end and need not support deletions at all, which they observe has some interesting consequences:
An interesting finding from the experiment results is that for our lookup and insert-only workload, even a very simple linear probing design (Folklore*) achieves excellent performance. A contributing factor to this surprising fact is that the typical down- side to linear probing, deletions, is a non-issue given our workload. Thus, linear probing’s cache-conscious forward scan does not really have a downside. We conclude that to perform efficient ticketing, fancy hash tables are not required: linear probing is all you need.
The paper goes into more detail and there's a bunch more to it, but I think the takeaway is clear: there's big wins, both performance and simplicity, to be found when you can observe places where your state isn't going to be living past some specific, predefined point in time.