NULL BITMAP by Justin Jaffray logo

NULL BITMAP by Justin Jaffray

Subscribe
Archives
June 16, 2025

Search Engine Retrieval

NULL BITMAP.png

I want to talk about the high level concept of what a search engine does. And I don’t mean like, Google, or Kagi, “the product,” but the technical object which is a search engine.

A search engine is really does two almost entirely conceptually separate tasks that are simply colocated frequently enough that it’s coherent to bundle them together as the single concept of “search engine.” These two tasks are retrieval and ranking.

Retrieval is finding all the documents which satisfy some query, and ranking is picking, among those, which satisfy it “the best.” Of the two, I want to talk today about retrieval, which sounds kind of fancy but when we figure out the right way to express the problem, we’ll see is served by classic, standard algorithms.

Max Bernstein recently published a post implementing a ranking system for a blog (of which the number of posts is going to be sufficiently small that there's no need for retrieval) so if that aspect of this interests you more you should check that one out.

Our dataset in search is a set of “documents,” which is a set of words. Documents in reality, of course, have more structure than that, but from the perspective of retrieval, they are the combination of a document ID and a set of words.

Maybe our database has the following structure, with document IDs 1, 2, 3, and possible "words" a, b, c, d.

1 -> {a, b, c}
2 -> {b, c, d}
3 -> {a, b, d}

This structure of the data is called the forward index, because it goes from document to token.

The kinds of queries retrieval is chiefly concerned with (although it needs to handle others, in many cases) is conjunctions: "I want to find all the documents that have word the a and the word b."

You might observe that the forward index doesn't really help us answer queries like this. In order to answer it we'd have to iterate over every document and check if it satisfies the query or not. To solve this we construct the inverted index, which goes the other way, from words to the set of document IDs that contain them:

a -> {1, 3}
b -> {1, 2, 3}
c -> {1, 2}
d -> {2, 3}

Now for any given word, we can easily find the set of documents that contain it. To do a conjunction query, then, the problem is reduced to finding the intersection of these sets (the set of documents that contain a and b is the intersection of the documents that contain a and the documents that contain b).

An interesting observation of the inverted index is that it really is, in some sense, an "inverse" (or involution operation): the operation that constructs the inverted index from the forward index also constructs the forward index from the inverted index:

from collections import defaultdict

def invert(dataset):
    result = defaultdict(lambda: [])
    for (x, ys) in dataset:
        for y in ys:
            result[y].append(x)
    return sorted([(y, set(xs)) for y, xs in result.items()])

data = [
    (1, {'a', 'b', 'c'}),
    (2, {'b', 'c', 'd'}),
    (3, {'a', 'b', 'd'}),
]

print(invert(data))
# [
#     ('a', {1, 3}),
#     ('b', {1, 2, 3}),
#     ('c', {1, 2}),
#     ('d', {2, 3}),
# ]
print(invert(invert(data)))
# [
#     (1, {'a', 'b', 'c'}),
#     (2, {'b', 'c', 'd'}),
#     (3, {'a', 'b', 'd'})
# ]

The representation of choice for the inverted index is typically a sorted list, called a posting list, for two main reasons.

First is that it allows for very fast intersections (assuming we have the ability to skip ahead) by doing something like a Leapfrog Join (sorry, all the python I know I learned from conduction coding interviews (I do know slicing will copy the list. That's not my fault)):

from bisect import bisect_left

t1 = [1,4,5,9,10,16,30]
t2 = [4,5,10,20,25,30]
t3 = [2,4,10,20,21,22,45]

def intersect(vs):
    output = []

    while True:
        minval = None
        maxval = None

        for v in vs:
            if len(v) == 0:
                return output
            if minval == None or v[0] < minval:
                minval = v[0]
            if maxval == None or v[0] > maxval:
                maxval = v[0]

        if minval == maxval:
            output.append(minval)
            for i in range(len(vs)):
                vs[i] = vs[i][1:]
        else:
            for i in range(len(vs)):
                if vs[i][0] < maxval:
                    # Binary search (or use some other index) to skip ahead to
                    # the right value.
                    vs[i] = vs[i][bisect_left(vs[i], maxval):]

print(intersect([t1,t2,t3])) # => 4, 10

The other reason is that it's amenable to delta encoding, where we represent each document id as the difference from the previous. This is desirable because for traditional compression algorithms, lots of distinct values, like a lot of different document IDs, compresses very poorly, but a lot of small, similar values, compress quite well, or can be efficiently packed into varints since they'll generally be small.

[1,4,5,9,10,16,30]
=>
[1,3,1,4,1,6,14]

But the takeaway I want to highlight here is that actually retrieval is literally set intersection, of course, if you've taken an algorithms class, once you have that insight you can certainly come up with the implementation, but the insight that it's set intersection is the key.

Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:
Start the conversation:
GitHub Website Bluesky X
Powered by Buttondown, the easiest way to start and grow your newsletter.