NULL BITMAP Builds a Database #1: The Log is Literally the Database
It is time to end the tyranny of people becoming interested in database implementation and building a BTree. Let us turn to the succor of immutable storage.
Today we are starting a new series. I know I have several ongoing series. But writing for different series requires different states of mind and I make the rules here.
We are going to make an LSM-based storage engine piece by piece. I think a really lovely thing about LSMs is how they lend themselves to a piece by piece implementation: they're fundamentally a bunch of little components that fit together in intricate ways that can be improved independently and swapped out, so I think it's the perfect project for a piecemeal implementation. I think it's important to set expectations, so:
What this project won't be:
- Won't be subject to the kind of correctness and performance testing that a production-grade database should be. I am most interested in communicating the core ideas! Not to say we won't do a bit of this, but it's not my primary interest.
- Won't be on any kind of strict schedule. I will work on this, like every other topic, when I feel like it. But you are welcome to discuss with other people and work on your own implementation. If you are working on your own implementation let me know.
- Won't be a place to follow along with every line of code. You can follow along with the code at the repo if you want, because I will omit details that are not of the problem domain. The primary intended mode of consumption is to just read the post (entertainment mode). The secondary intended mode of consumption is to implement along in your language of choice (educational mode). Using this series to learn Rust is not an intended or supported mode.
- Won't be a progression you would follow if you were actually building a database from scratch. The focus is on introducing ideas in a logical way for understanding and not for ease of engineering.
What this project will be:
- Will be a functional database at the conclusion of every issue. Even this first one. It might not be a database you would ever want to use for any purpose, but it will, technically, fulfill that role.
- Will be written in Rust. Sorry. I don't think it's a particularly good language for this sort of thing (because we are trying to talk about general ideas and not Rust-specific ideas) but I only have space for one programming language in my brain at a time.
- Will be minimal on dependencies. I can’t stand posts that set out to teach a thing and then start off by installing a million dependencies. This headache is, to some extent, unavoidable in Rust, but I will do my best to keep it to a dull throb.
With that out of the way, let's get started. We are going to build an LSM, but we will get there somewhat slowly. We're going to start from the very beginning of "what is the minimal work that a database can do and still be called a database?"
A key-value database is the most fundamental kind of database there is. Most other data models, like the beloved relational model and the reviled document model can be built on top of it, and that is what we shall do.
The first two operations we shall support are put and get.
put(key, value) writes key into the database mapping to value, while get(key) returns the most recent write for a key. Distributed systems-minded readers might take issue with the term "most recent" used so flippantly but we will not concern ourselves with such things. Be chill. We're single threaded for now.
We need a couple things, for now:
$ cargo add serde -F derive
$ cargo add serde_json
$ cargo add tokio -F full
We shall begin in a most natural place by creating a Db
struct to represent our database:
struct Db {}
We're going to keep this pretty low-abstraction where we can help it. No generics or anything on the datatypes in the database, just a string-to-string map. Also: we're going to be fairly cavalier with our allocations. Writing this in a more alloc-friendly way would add some conceptual overhead that I'm not interested in pulling in yet, but perhaps we will get there someday.
In order to acknowledge a put, we must write it down somewhere. Our implementation will solve this by keeping a log of every put we've ever seen. Doing this sort of write is necessary to accept a put. We must write it down somewhere. A file we append writes to is called a log.
struct Log {
path: PathBuf,
log: File,
}
impl Log {
async fn open(path: impl AsRef<Path>) -> Result<Log, NdbError> {
let log = OpenOptions::new()
.append(true)
.create(true)
.open(&path)
.await?;
Ok(Log {
path: path.as_ref().to_path_buf(),
log,
})
}
}
We represent a Put with this struct:
#[derive(Serialize, Deserialize)]
struct Put {
key: Vec<u8>,
value: Vec<u8>,
}
Our log is going to be a series of line-delimited JSON objects:
impl Log {
async fn put(&mut self, key: &[u8], value: &[u8]) -> Result<(), NdbError> {
let put = Put {
key: key.into(),
value: value.into(),
};
let serialized = serde_json::to_string(&put)?;
self.log.write_all(serialized.as_bytes()).await?;
self.log.write_all(b"\n").await?;
self.log.sync_all().await?;
Ok(())
}
}
Let's take a quick aside to talk about what's happening here. The way your computer writes to a file, when you call the write syscall, is that writes are stored in buffers in memory, and then at some point in the future, flushed to disk. This is good: most applications don't really care that much if the writes they do actually make it to disk in some specific timeframe, and writes to disk are much much slower than just storing things in memory, so they'll happily take the tradeoff of less durability for better latency. We, however, are a database. We care. We have a contract with our users that if we acknowledge their write, they can pull the plug on their server and the write will still be there when it comes back up.
Unix provides a syscall called fsync that forces the operating system to flush these buffers to disk. There are further complications involved with this (see fsyncgate) but we will ignore those for now. This call to sync_all defers to the fsync call, and so when that successfully returns we have written to disk.
It turns out that writing down our put in this way is not only necessary, it is also sufficient: we can implement get in terms of it by scanning the whole log to see what the final value of a given key is.
trait Queryable {
async fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>, NdbError>;
}
impl Queryable for Log {
async fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>, NdbError> {
let reader = File::open(&self.path).await?;
let reader = BufReader::new(reader);
let mut lines = reader.lines();
let mut result = None;
while let Some(line) = lines.next_line().await? {
let put: Put = serde_json::from_str(&line)?;
if put.key == key {
result = Some(put.value);
}
}
Ok(result)
}
}
Now, we have a functional database, albeit a slow one that does a lot of unnecessary memory allocations:
impl Db {
async fn new(db_dir: impl AsRef<Path>) -> Result<Db, NdbError> {
if !db_dir.as_ref().exists() {
tokio::fs::create_dir_all(&db_dir).await?;
}
let log = Log::open(db_dir.as_ref().join("log")).await?;
Ok(Db { log })
}
async fn put(&mut self, key: &[u8], value: &[u8]) -> Result<(), NdbError> {
self.log.put(key, value).await?;
Ok(())
}
async fn get(&mut self, key: &[u8]) -> Result<Option<Vec<u8>>, NdbError> {
Ok(self.log.get(key).await?)
}
}
#[tokio::main]
async fn main() -> Result<(), NdbError> {
let mut db = Db::new("db").await?;
db.put("foo".as_bytes(), "bar".as_bytes()).await?;
db.put("baz".as_bytes(), "qux".as_bytes()).await?;
db.put("foo".as_bytes(), "goo".as_bytes()).await?;
println!(
"{:?}",
String::from_utf8(db.get("foo".as_bytes()).await?.unwrap()).unwrap()
);
Ok(())
}
Some("goo")
Inspecting our log
file we can see what we expect:
{"key":[102,111,111],"value":[98,97,114]}
{"key":[98,97,122],"value":[113,117,120]}
{"key":[102,111,111],"value":[103,111,111]}
And if we remove our calls to put
and run our program again, we will still get the expected value.
And there we have it: a functional database. This might be unsatisfying to you. There's several glaring problems with it that we will address soon:
- Our reads are O(n) in the number of writes that have occurred ever. In fact, it's O(n) disk reads. This is obviously untenable and it doesn't take very many writes for this to become unworkably slow.
- Our log will grow without bound. If I write the same key over and over, we will never reclaim any of our space.
There are also some other obvious problems that we won't address in the near-term:
- Our serialization format is JSON and very slow.
- We're not defensive to any kind of log corruption: if only a prefix of our writes get out, then we will just fail.
- Surely lots of other stuff.
We will begin to address these problems next time. We're already far past my usual allotted word count. Peace.