Internal Affairs

string interning

                September 25, 2023

            Internal Affairs

            There is a trick called string interning that is well known to compiler authors, and sometimes shows up in other places as well. It's used to change the performance characteristics of working with strings.
This trick is very simple: rather than storing any given string, I keep a table of every string I've ever seen and map them all to the same underlying memory:
type InternTable struct {
    entries map[string]*string
}

func (t *InternTable) Intern(s string) *string {
    if p, ok := t.entries[s]; ok {
        return p
    }
    t.entries[s] = &s
    return &s
}

We use this by first interning any string we observe:
func main() {
    var t InternTable = InternTable{entries: 
    make(map[string]*string)}
    x := t.Intern("foo")
    y := t.Intern("foo")
    z := t.Intern("bar")

    fmt.Println(x == y)
    fmt.Println(x == z)
}

> go run main.go
true
false

Once we've done this to all the strings we're tracking, it suffices to compare two strings by the pointer to their data to determine if they're the same string or not.
Something that is true in any context is that pointer equality implies value equality. No matter what kind of equivalence relation I've put on my objects, if they're represented by the same pointer, they must be equal. I think of interning as strengthening that relationship to go the other way as well: if two strings have the same contents, they also have the same pointer. Now, value equality implies pointer equality.
This is a really popular trick when implementing compilers, partly because any program will have a lot of repeated strings for keywords, variable names, etc., and being able to do fast comparisons and lookups for strings like that is really convenient.
This general idea has a lot of applications and analogues in the world of databases. If you think of the pointers as opaque identifiers, string interning begins to look suspiciously like dictionary compression. Plus, it's basically just normalization:
Before
S
foo
bar
foo
foo
baz
bar

After
table 
foo 1
bar 2
baz 3

S
1
2
1
1
3
2

From which we can recover the original set of strings by joining the two tables.
There's a generalization of this technique called hash consing, which is, to my knowledge, more popular in computer algebra circles than it is in compiler ones. It's basically the same idea, but for representing recursive structures.
The idea is similar. Whenever we construct an expression, we check a hash table to see if we've seen that expression before and return a pointer to that one, like before (excuse the boilerplate, I'm not sure if there's a nicer way to implement this sort of thing in Go):
type Expr interface{}
type Number struct {
    value int
}
type Plus struct {
    left  Expr
    right Expr
}
type Times struct {
    left  Expr
    right Expr
}
type Table struct {
    numbers map[Number]*Number
    pluses  map[Plus]*Plus
    timeses map[Times]*Times
}
func (t *Table) NewNumber(n int) *Number {
    number := Number{n}
    if p, ok := t.numbers[number]; ok {
        return p
    }
    t.numbers[number] = &number
    return &number
}
func (t *Table) NewPlus(left Expr, right Expr) *Plus {
    plus := Plus{left, right}
    if p, ok := t.pluses[plus]; ok {
        return p
    }
    t.pluses[plus] = &plus
    return &plus
}
func (t *Table) NewTimes(left Expr, right Expr) *Times {
    times := Times{left, right}
    if p, ok := t.timeses[times]; ok {
        return p
    }
    t.timeses[times] = &times
    return &times
}

Something that might not be immediately obvious is that we don't need to hash the whole tree recursively, since we know that the children are also hash consed, and thus it suffices to hash their pointers. We can use this and see that indeed, if we build the same tree twice, we get the same pointers, but a different tree is distinguishable:
func main() {
    table := Table{
        numbers: make(map[Number]*Number),
        pluses:  make(map[Plus]*Plus),
        timeses: make(map[Times]*Times),
    }
    expr1 := table.NewPlus(
        table.NewTimes(
            table.NewNumber(3),
            table.NewNumber(4),
        ),
        table.NewNumber(5),
    )

    expr2 := table.NewPlus(
        table.NewTimes(
            table.NewNumber(3),
            table.NewNumber(4),
        ),
        table.NewNumber(5),
    )

    expr3 := table.NewPlus(
        table.NewTimes(
            table.NewNumber(3),
            // NOTE: different.
            table.NewNumber(6),
        ),
        table.NewNumber(5),
    )

    fmt.Printf("%p %p %p\n", expr1, expr2, expr3)
}

This has some very cool benefits:

O(1) equality checks,

a compact representation for trees with lots of repetition (consider the shape of ((1+1)+(1+1))+((1+1)+(1+1)), and

easy caching of analysis work (because you can easily find all the instances of a given expression).

All of these are really nice to have in a query planning context in particular, since lots of query transformation rules tend to lead you to the same expression multiple times, and it's helpful to have a fast way to detect those kinds of repetitions to stop the explorations (or to be able to detect that you've proven an identity).
Of course, it's not without its drawbacks: having to maintain a hash table of all of the expressions you have in the system is not free, and having to do a hash table lookup might not be worth it in some cases. In general though, I think it's a cool technique to have in one's back pocket as a dimension a design can take.
Links
I'd seen this trick before, but Per Vognsen's Bitwise series (whose name might have had some influence on the naming of this newsletter) was where I first learned the name "hash consing" for it.
Housekeeping
I've successfully become content-brained. I had a topic I wanted to write about for a while, but realized it had a couple prerequisites that were not particularly well known. A brief moment of despair was followed by the realization that I can just turn those prerequisites into new issues.
If you are enjoying this newsletter please let me know what kinds of posts are of more interest to you—I like doing commentary on papers and current events, but I also like doing these kinds of time-agnostic, general purpose posts, and the only way I can know which of them are resonating more is if people reach out!

Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

Start the conversation: