NULL BITMAP by Justin Jaffray logo

NULL BITMAP by Justin Jaffray

Archives
June 8, 2026

SmithDB

NULL BITMAP.png

LangChain recently posted about a database they built. I liked the post quite a bit, I thought it was pretty well written and did a really good job of explaining their architecture. It highlighted for me some of the interesting database challenges and workloads that are consequences of AI.

This is an "observability database," which sits sort of outside the traditional OLTP/OLAP dichotomy, but leans a bit on the OLAP side. It exists to collect data from a bunch of different sources (in LangChain's case, agents performing long-running tasks) and make it queryable as quickly as possible. This isn't a "source of truth" transactional database with isolation or particularly strong consistency requirements, it's more "here's a firehose of event data, we need to make it queryable as quickly as possible."

This need for low-latency causes problems for a lot of traditional analytics pipelines, which generally assume that users are only really interested in aggregate data, and it can be a couple minutes to hours late. For observability data, users might be watching the data come in live as they run a job, so waiting several hours to see those events is a non-starter.

Generally, if you're being pragmatic, the canonical way you'll go about building any sort of analytics database today is to lean on off-the-shelf parts as aggressively as you can. This mostly means leaning on DataFusion as your query engine and something like Vortex or Parquet as your file format. SmithDB does DataFusion and Vortex.

The obvious architecture if you're doing some kind of analytics here is: you have ingest nodes that consume from the stream of data that you care about, those:

  • buffer up data for some period of time,
  • poop out a big file into object storage,
  • tell someone about that file so it's queryable, then
  • query nodes simply ask the metadata store which files they should read from object storage.

Then you generally want to have some kind of background process that compacts those files into fewer, larger files so you're not left managing a gigantic number of files in the fullness of time.

This suggests a fundamental tradeoff, which is: how long do you let ingest nodes consume data for before they spit it out to object storage? Too short, and you'll produce a lot of small files that have lots of overhead to read, more API calls to object storage, and worse compression. Too long, and data sits in your ingest nodes for a long time before anyone can actually see it. This is the kind of question you answer with tuning and customer requirements. But SmithDB also uses a clever trick to circumvent it, which is to allow the ingest nodes to serve queries for data they've read but have not yet flushed. This means you can basically eliminate the period of time where data is only buffered and non-queryable, at the cost of your query workload interfering with your ingest workload, which might be a trade-off you're willing to make.

I would say, based on my read of the post, that SmithDB seems a fairly clean implementation of this object-store-backed analytics engine.

Now, something they highlight in the post is that agent trace workloads are fundamentally different from a lot of observability data in the past. While, traditionally, this kind of thing might be metric data or traces of calls through microservices, which will generally finish in a few milliseconds, or a couple seconds at most, agents can go off and do stuff for minutes-to-hours, which means that the resulting traces will be much longer, larger, and cover a much longer range of time.

This is the problem that I would say is the most novel that LangChain was dealing with when they were building SmithDB. Unfortunately we only get a few nuggets of what they did to deal with that. They mostly talk about their inverted index (which also seems pretty traditional) and splitting out large fields from their payloads and keeping them out of the hot path.

Overall I liked reading about SmithDB quite a bit. I'd love to see some more detailed posts in the future.

Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

Add a comment:

You're not signed in. Posting this comment will subscribe you to this newsletter with the email address you enter below.
GitHub
justinjaffray.com
Bluesky
Twitter
Powered by Buttondown, the easiest way to start and grow your newsletter.