What if SQL Was Informed by LSMs?
Programming Note: Feedback Control Study Group
I'm hosting what I am calling a "study group" about feedback control, loosely backed by Feedback Control for Computer Systems: Introducing Control Theory to Enterprise Programmers. I will post my thoughts on the designated chapters on scheduled weeks and I would like you (yes you) to email* me with your thoughts on those chapters too so I can either include or link to them (and I love linking to blogs with one post). For more real-time discussion there is now a NULL BITMAP Discord server.
I don't want to call it a "book club" although it will proceed with discussion of the book. I'm more interested in it being a space for a bunch of people interested in studying this topic to have other people to bounce ideas off of and discuss simulations and thoughts.
To give people some time to acquire the book (I found a beat up copy on thriftbooks for fifteen bucks) I will be posting about the first two chapters on March 3rd. Please send in your own commentary before then. The next two chapters will be discussed two weeks later.
The kind of commentary I'm interested in is things like:
- if this jogs any kinds of problems you've encountered in the past and how you solved those,
- if you ran his simulations (which I encourage!), if you could reproduce his results, or variations you tried and how they worked,
- visualizations you did and if they gave you any insight or improved intuition,
- other problems you invented for yourself to simulate.
Or anything else that seems appropriate. It's an open invitation.
* firstname.lastname at gmail, or if you're an email subscriber you can also reply to this email.
SQL and LSMs
I was linked to a database project this week. I’m not particularly informed on it, although by default I’m:
- bullish on more Calvin-descended databases, and
- bearish on such databases being SQL-based (a topic for another time).
They have one cool feature in particular (taken from this docs page):
--
-- using generated stored and resolved columns to
-- group inserts by last 5 seconds per device_id and aggregate hits
--
create table example (
time timestamp as ( current_timestamp::date_bin(interval '5 sec') ) stored,
device_id int,
hits int default 1 as ( hits + 1 ) resolved,
primary key(time, device_id)
);
insert into example (device_id) values (1);
insert into example (device_id) values (1);
insert into example (device_id) values (1);
insert into example (device_id) values (1);
select * from example;
[["2024-12-11 17:03:30+02", 1, 4]]
This gives the ability to address primary key conflicts by computing some function instead of failing with an error.
You can do something like this on the query side in Postgres via ON CONFLICT DO UPDATE
, but I actually think embedding this logic into the schema is much more interesting.
I think there are two ways to view what this is doing.
The first is operationally, which is how it's expressed in Amelie. You can think of this as doing a read-modify-write (RMW) on every insert. Every insert is turned into:
read the old value of the row with that primary key
if it didn't exist already:
just insert it
else:
compute any resolved columns and update with the new value
This is a good way to present the feature. It's clear what's happening and it has clear analogues to things people do with imperative programming languages.
Another way to look at this feature is that it is performs an implicit aggregation. Every insert creates a whole new row, and performing a read does an aggregation over all the rows sharing a primary key. In this way, it looks more like this:
create table example (
time timestamp as ( current_timestamp::date_bin(interval '5 sec') ) stored,
device_id int,
hits int default 1,
primary key(time, device_id)
);
insert into example (device_id) values (1);
insert into example (device_id) values (1);
insert into example (device_id) values (1);
insert into example (device_id) values (1);
-- select * from example;
-- A read like this above actually gets executed as:
select time, device_id, sum(hits) from example group by time, device_id;
One thing that's interesting about the latter framing is that it's a bit less prescriptive about what's actually going on under the hood. We could, in theory, represent the actual table however we wanted, and if we had a mechanism to amortize actually performing those aggregations we could get some nice performance benefits along a couple of axes.
A Mechanism to Amortize Actually Performing Those Aggregations
Recall the way an LSM works: when we do a write of, say, x = 5
, we simply append to our database the operation "set x to 5." Then, when we go to read the value of x
, we traverse the database in reverse order to see what the last operation that set x
was. We then periodically compact segments of the log into index files to make this traversal easier.
The cool thing is that if we accumulate a bunch of such operations, we can clean them up as we do compaction. If we start with
[(x = 1), (x = 2), (x = 3), (x = 4)] -> a read gives x = 4
and we compact the first three elements of this, we can elide the x = 1
and x = 2
commands because they are shadowed by the x = 3
command:
[(x = 3), (x = 4)]
Now, one cool trick we often pull with LSMs is to observe that we can actually be more general than "set/get" operations: subject to some restrictions, any way of "combining" values works. Say, instead of our commands being "set x to y," they were "increment x." Then, when we do a read, instead of stepping backwards until we find a write for the key x
, we find all the entries that set x
and combine them to get the true value.
[(add x 1) (add x 1) (add x 1) (add x 1)] -> a read gives x = 4
This is nicer than, in SQL, writing like, UPDATE x SET x = x + 1
because the latter requires a RMW: we have to fetch the old value of x
, increment it, then write out the new value of x
. In contrast, the LSM way allows us to do a blind write: we can increment the value without actually reading the old value. You might think of this as evaluating the increment lazily. We don't actually compute the incremented value until we go to read it (in that sense, this is a form of write-optimization). RocksDB (and many other LSMs) expose this kind of behaviour via merge operators.
It's remarkably hard to convince a SQL database to do something logically equivalent to a blind write since in most cases it tries to make sure you haven't violated any kind of uniqueness constraint. You'd need to do some kind of careful upsert that doesn't read the old values at all (but even then I'm not sure whether many SQL databases take advantage of that).
It seems to me the Amelie approach to this doesn't go far enough: to actually realize the benefit of this in a tangible way (i.e., enabling blind writes) you'd need to make sure that the operation that merges the two values is associative: (ab)c = a(bc)
. Any old function doesn't just work. The reason is that if your LSM is arranged like so:
File1 File2 File3 File4
You might want to perform compaction in any order, and merge either File1
with File2
, File2
with File3
, or File3
with File4
, depending on the needs of your LSM structure at that moment.
Another benefit to having blind writes be an option in a transactional workload is that they permit more serializable transaction histories than read-modify-writes do. Consider two transactions T1 and T2 that otherwise don't conflict but both write to a row r
. If they both read the original value of r
to update it, then they will conflict and one must abort. If they each do a blind write to r
, then the database is free to reorder them.
The Point
I've only seen one execution of this idea, which is in ClickHouse (which I was linked to while musing about this).
My big picture point with this whole observation is that there are not that many features in SQL that seem informed by enabling blind writes in this way. Any? I think it's pretty clear that the evolution of SQL was mostly driven by B-tree based storage engines where there's not much cost to doing a read prior to a write.
A very coarse way of describing the history is that SQL has been largely informed by B-trees and NoSQL has been largely informed by LSMs. I'm sure someone can point me to a NoSQL database that takes advantage of this in this way, but I think there's a ripe space for innovation to think about how to embed this kind of thing into relational languages.
TokuDB supported blind writes. The SQL standard has yet to support this because the modified row count can't be returned in such a case. https://www.percona.com/blog/fast-updates-with-tokudb
Long ago (200x) there was a DBMS startup and one of their primary features was commutative writes for high-throughput OLTP. They were in the Bay Area but I can't remember their name.
That is super interesting! Thank you for the link!