NULLBITMAPGPT

Markov Chain

                July 14, 2025

            NULLBITMAPGPT

            I don't know if you've all been paying attention, but this is the ONE HUNDREDTH NULL BITMAP??
To celebrate, I've heard people are really into "next token prediction," so we're going to do...that, I guess, to write our one-hundredth NULL BITMAP.
A Markov Chain is a process where the probabilities of each next state ("word") depends only on the previous state ("word"). If we "train" a Markov chain on some corpus of text, that is, we compute the frequencies of a word following some previous word, then we can use that data to generate plausible-looking text.
One of the fundamental theses of NULL BITMAP is that understanding database ways of thinking about the world is valuable because they're restrictive. Oftentimes, databases only let you do certain things specifically because those things have nice implementations, or implementations of them have nice properties, or it's easy to leverage existing tools to do them.
In this post, we're going to see a nice way to represent a Markov Chain in a clean, database-minded way that will lead us to an implementation with some nice properties. We'll start by actually implementing it in a database, and then do something a little more hands-on.
For this I'll be using DuckDB but you could get away with most modern SQL databases, I think. I've downloaded markdown files of all 99 previous NULL BITMAPs (and I cleaned them up slightly).
First, we need to read all of our files, this is easy:
D select * from read_text('emails/*.md');
┌──────────────────────┬─────────────────────────────────────────────────────────────────────────────────────┬───────┬──────────────────────────┐
│       filename       │                                       content                                       │ size  │      last_modified       │
│       varchar        │                                       varchar                                       │ int64 │ timestamp with time zone │
├──────────────────────┼─────────────────────────────────────────────────────────────────────────────────────┼───────┼──────────────────────────┤
│ emails/a-card-coun…  │ I used to call this thing a “game” but my friend Kevin (who is not the same Kevin…  │  6263 │ 2025-07-12 11:24:06-04   │
│ emails/a-left-to-r…  │ Note: see the end of this post for some housekeeping on last week.\n\nI get the s…  │  6631 │ 2025-07-12 11:23:02-04   │
│ emails/a-query-pla…  │ A number of readers [last week](https://buttondown.email/jaffray/archive/multiple…  │  8164 │ 2025-07-12 11:23:08-04   │
│ emails/a-review-of…  │ My roommate and I are in a war with UPS. I came home the other day to find she’d …  │  9474 │ 2025-07-12 11:23:09-04   │
│ emails/a-sniff-tes…  │ One important part of query planning is performing transformations over queries. …  │ 23144 │ 2025-07-12 11:23:09-04   │
│ emails/a-trick-tha…  │ ## Ancestry\n\nI came across a [cool algorithm](https://gist.github.com/pervognse…  │  5801 │ 2025-07-12 11:23:10-04   │
│

Then we need to extract word-pairs from each one. We can do this with a couple fancy builtins by splitting the context on a regex (adding special beginning- and end-of-file tokens):
with files as (
  select
    string_split_regex('BOFBOF ' || trim(content) || ' EOFEOF', '\W+') as words
    from read_text('emails/*.md')
  )
select * from files

...
[BOFBOF, I, got, enough, positive, interest, in, some, variety, of, book, club, ... , EOFEOF]
...

Then pulling that into a table and doing some popping, we can compute all the adjacent pairs of words:
with files as (
  select
    string_split_regex('BOFBOF ' || trim(content) || ' EOFEOF', '\W+') as words
    from read_text('emails/*.md')
)
select * from
    (select
        unnest(words) as x, unnest(array_pop_front(words)) as y
    from files)
where y is not null

Finally we can complete our computation of the frequency table with an aggregation:
with files as (
  select
    string_split_regex('BOFBOF ' || trim(content) || ' EOFEOF', '\W+') as words
    from read_text('emails/*.md')
)
select x, y, count(*) from
    (select
        unnest(words) as x, unnest(array_pop_front(words)) as y
    from files)
where y is not null
group by x, y

┌────────────────┬─────────────────────┬───────┐
│       x        │          y          │ freq  │
│    varchar     │       varchar       │ int64 │
├────────────────┼─────────────────────┼───────┤
│ BOFBOF         │ I                   │    12 │
│ but            │ my                  │     2 │
│ but            │ who                 │     1 │
│ mathematician  │ kept                │     1 │
│ to             │ a                   │    79 │
│ many           │ fun                 │     1 │
│ some           │ set                 │     8 │
│ a              │ funny               │     1 │
│ discussed      │ it                  │     1 │
│ each           │ card                │     4 │
│ you            │ can                 │   109 │
...

Cool! Now we just need to pull a sequence of words out according to these frequencies. We can do the "sequence of words" with a recursive CTE.
A recursive CTE is basically a way to iterate over a result set until it stops changing. In our case, we want to: look up in the table some token that follows the current one and return it, once we hit our EOF token (EOFEOF), we won't find any, and we wont find any new value, and we'll stop.
t as (
    select 1 AS idx, 'BOFBOF' AS tok
union all 
    (select idx + 1, y as tok
        from
    t, frequencies as words
    where
        t.tok = words.x
    and t.tok != 'EOFEOF'
    limit 1
    )
)

Everything before the union all is our initial state. We want to begin from word 1, which is our "beginning of email" token. Next, we read a value from our accumulated value, incrementing its index, and joining it against the frequencies table to give us a next token, and stopping if we have the EOF token. Then, since we only want one next token, we limit 1.
The output of this looks sort of plausible, but it's not quite right (you'll note our token-extracting technique...needs improvement):
Podcast Appearance I love this pretty quickly so long we still sort at most on for such ranges are other Great right Instead of memory I d 1 CROSS_PRODUCT CROSS_PRODUCT CROSS_PRODUCT CROSS_PRODUCT UNGROUPED_AGGREGATE UNGROUPED_AGGREGATE UNGROUPED_AGGREGATE UNGROUPED_AGGREGATE PROJECTION SEQ_SCAN ab 1 card 1 accomplishes this way Chapter 1 But we still only on If only really about behaved this as our version Or BlueBoxIsFalse false WhiteBoxHasGems Which...

It's not right yet because we need to be picking the next words relative to the frequency they appear in the original text. To do that, we need a clever trick courtesy of Tom Moertel (this is one of my all-time favourite blog posts). It turns out that through the magic of the exponential distribution, if we order by -ln(1-random())/words.freq, we will get the weighting we want. Thus, our final query is:
with recursive
files as (
  select
    string_split_regex('BOFBOF ' || trim(content) || ' EOFEOF', '\W+') as words
    from read_text('emails/*.md')
),
frequencies as (
select x, y, count(*) as freq from
    (select
        unnest(words) as x, unnest(array_pop_front(words)) as y
    from files)
where y is not null
group by x, y
),
t as (
    select 1 AS idx, 'BOFBOF' AS tok
union all 
    (select idx + 1, y as tok
        from
    t, frequencies as words
    where
        t.tok = words.x
    and t.tok != 'EOFEOF'
    order by -ln(1-random())/words.freq
    limit 1
    )
)
select string_agg(tok, ' ' order by idx) from t
    where tok != 'BOFBOF' and tok != 'EOFEOF';

Which lets us generate sort of plausible looking text:
It s pretty fast reads register y in SQL were already tried https ottertune com understanding cost of out BNF grammar is actually more complicated to anything about correctness whereas this is simply never think is things like so x x is a setpoint which I think there s several runs red if their reads the filter p blockquote p subject to these posts are executed Which doesn t of the Query Execution https assets buttondown email images de4e1454 02dd 48ee ae39 d06c1500da27 png w 960 amp fit max We can change d AND reason and the value prev and up generating a whole tree algorithm that we can do you can push t InternTable struct or not have these things that would like x 2 s also biased to me firstname lastname at Lots of Worst Case Optimal Joins It is related They only place

Well, maybe not that plausible. I wouldn't ask it how many R's are in the word strawberry.
Taking a step back, what I think we learned here is that this problem is pretty easy. At least the training part. Since these are all commutative, associative operations, we have a lot of leeway in our construction of the frequency table. Looking at the query, we could see that we can build the frequency tables individually for each input file and then merge them afterwards. We could build them on separate threads. We could build them on separate machines. This is not something that's obvious to me from the structure of the problem but becomes very clear when it's written out relationally.
Anyway. Thank you all for reading 100 NULL BITMAPS.

Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

Join the discussion:

            Chase Saunders

                Jul. 14, 2025, evening

    I read all 100 NULL BITMAPs... not sure about those other guys.  
Thank you, Justin!!

        Reply
Report

                NULL BITMAP by Justin Jaffray

                Jul. 14, 2025, evening

    Thanks Chase!!

        Reply
Report

Delete