Hash Consing Absolutism

I might have mentioned before that I've been slowly working through the Simple tutorial for building a Sea of Nodes compiler. A lot of it is still sort of going over my head but I've learned a couple of interesting things in the process of working through it that I think I can bring back to a query planning context.
One thing I learned from it is that my understanding of Hash Consing was a bit myopic. Or incomplete.
I have always thought of hash consing as a way to turn value equality into pointer equality. Pointer equality of two objects has always implied value equality, of course, but to get the other direction as well requires a trick: maintain a hash table of every value that exists in your system, when you would construct a new object, check the table to see if you've seen it before. If you have, you just return the existing version. Compiler people actually tend to call this "global value numbering." It gets introduced in Chapter 9 of the Simple book.
Now, something sort of important about hash consing is that it's usually important that, if you're hash-consed, your inputs are also hash-consed. Because if you're going to hash an entire, possibly large value, you would be most happy if you could just hash the pointers it has to any children values. If your children are also hash-consed, then you are guaranteed that this is sufficient.
Notably, you have to be very careful mutating these values, for two reasons:
First, they're naturally shared via the hash consed table. "Shared mutable state," being the root of all evil, and all that. This can be actually fine, though, if you restrict yourself to mutations that don't modify the semantics of the result. Many "optimizations" work this way already, where we replace something equivalent but "better," in some way.
The more insidious-seeming problem is that changing a value in some way will result in the node hashing differently, so the existing hash table must be updated, and anyone pointing to the old value needs to be updated to point at the new value, if it changed. And then, you have the problem that egraphs spend a bunch of time trying to solve, which is that now, you might have two expressions which now have value equality but not pointer equality. Consider the following situation:
We have an expression named a and an expression named b. We have some constructor that consumes them, which is named f. We have, at some point, both the expressions f(a) and f(b):
expr1 = f(a)
expr2 = f(b)
Now, we do some optimization on a and learn that it actually simplifies to b. Now we have a problem: expr1 and expr2 are the same now; they are both f(b), but we don't really have a great way to detect this. Note that we didn't have to go all the way to egraphs to run into this problem, normal optimizations were enough.
So, whenever I've worked with this kind of system, I lean a lot on aggressive bottom-up simplification. With this technique, we would learn the a and b are the same at the time we construct them. It's impossible for us to learn new things about them after-the-fact.
This works great in a lot of cases, but breaks down if you ever have cycles in the graph, in which case it's impossible to recurse to a base case all the time. This is okay for a lot of more limited query languages, but a non-starter for programming languages which require loops. You can't fully optimize an expression on construction if, when you first build it, you don't know all of its inputs yet.
It's also just annoying; if you want to pass information down a tree, you generally have to do it via immutable rewrites, when in some cases it might just be simpler to mutate nodes in place then to do these back-and-forths of passing information down the tree via rewrites, and up the tree via properties.
So, naturally, when we got to this problem in Simple, I was very interested to learn how they handled it. And what they do is pretty simple: they treat hash consing (GVN) as an optimization that nodes do not need to always participate in:
Since we edit the graph a lot, Nodes can have their inputs change which changes their hash which means they can't be found in the GVN table. E.g. we swap out the
Addfor aConin:
(Mul (Add 4 8) x) // Fold the constant
(Mul 12 x) // The `Mul` changes hash and inputs
The fix for this is to require Nodes NOT be in the
GVNtable when their edges modify; they need some sort of "edge lock". If "edges are locked", we need to remove the Node fromGVNfirst, before modifying the edges.
I'd always thought of this as a binary choice; either your expressions are hash consed, or they are not. But it turns out there's another solution, which is to do best-effort hash consing.
Add a comment: