NULL BITMAP by Justin Jaffray logo

NULL BITMAP by Justin Jaffray

Archives
March 16, 2026

Solving Cryptic Crosswords with Egraphs

NULL BITMAP.png

I've liked Cryptic Crosswords...somewhat. In the past. They're a bit tricky for me and I haven't really put in the time to be comfortable enough to have fun solving them.

Anyway, the Wordle guy has a new website where he aims to teach people how to solve Cryptics.

If you're not familiar, Cryptic Crosswords follow a very particular format for their clues. Let's take one of the Parseword examples:

Cooked rustic orange (6).

The (6) refers to the number of letters in the result. So we're looking for a 6-letter word.

In a Cryptic, every clue is split into two halves: the definition half and the "wordplay" half. These rules are a bit loose, sometimes you'll see like, two definition halves, but I think at Parseword they seem to be fairly rigid about this kind of split. It's ambiguous in any given clue where the split is, and also which half is which.

In this clue, the two halves are Cooked rustic and orange.

"Cooked," in this clue, is called an "anagram indicator:" it suggests that the (or a) word it abuts should be rearranged to get some other word. In this case, it wants us to rearrange "rustic" to get "citrus." So we get "citrus" for the wordplay side, and "orange" for the definition side, and the answer is "citrus."

There are a bunch of these kinds of "indicators." Another Parseword example is

Meter in Poe's verses (5).

Here, the split is:

Wordplay half:

Meter in Poe's

Definition half:

Verses.

A "meter" is commonly represented with an M, and "in" is a "container indicator," which suggests that we should place the left thing inside of the right thing. So "M in Poes" becomes "Poems," which matches the definition half of "verses."

Something that Parseword highlighted to me is the extent to which we can represent this kind of puzzle solving as a series of successive algebraic rewrites. This is its whole deal: it presents an interface where you can click words to replace them with synonyms, and click sets of words and indicators to apply them.

For example, let's try a slightly harder puzzle from Parseword:

Stare, ignoring Eastern idol

We can treat this as applying a sequence of rewrites until we "prove" the solution is what we expect:

  • stare ignoring Eastern idol
  • "stare ignoring Eastern" "idol" (split)
  • "stare ignoring E" "idol" (synonym)
  • "star" "idol" (removal indicator "ignoring E")
  • "star" (matching sides: solution)

The presentation of these puzzles in this way set off my alarm bells: how automatable is this? Not perfectly, I suspect, since we will often have to use some sort of suspicious "definitions," but we can definitely automate enough of it to have a fun time.

We could do some kind of traditional search for this, but it turns out we have a great, different tool in the databases and programming languages communities for exploring expressions that are manipulatable via rewrite rules in this way: egraphs.

Egraphs are a cousin of Cascades, with some differences.

The important thing is that it's a way to represent a forest of possible representations of a particular expression, and it has some nice properties that will make themselves clear as we get into the implementation.

Now, I should note that egg and egglog are great, optimized, high-quality implementations of these ideas, but we can make a slow, simple version that will also work for our purposes, where we're just manipulating small expressions.

Let's begin with a few imports:

from collections import defaultdict
from itertools import product

An egraph is made up of two alternating objects: enodes, which represent a (somewhat) concrete expression, and eclasses, which represent a set equivalent expressions. The "alternating-ness" of it is that enodes always have eclasses as their children, and eclasses are a set of enodes.

We begin with a representation of enodes. Each enode will have a "tag," denoting what kind of expression it is, some "data," and then a set of children in the form of eclasses:

class Enode:
    def __init__(self, tag, data=None, *, children=[]):
        self.tag = tag
        self.data = data
        self.children = children

    def key(self):
        """A hashable representation of the enode."""
        return tuple([(self.tag, self.data), *self.children])

    def __repr__(self):
        label = f"{self.tag}:{self.data}" if self.data is not None else self.tag
        if not self.children:
            return label
        return f"({label} {' '.join(repr(c) for c in self.children)})"

Now the meaty part: we're going to implement the actual Egraph data structure. You might need to go read the egg paper to fully grasp this, but it's not too bad. The key insight is that we need to maintain the invariant that if a == b, then f(a) == f(b). This means that if we have f(a) and f(b) as separate expressions, but we later learn that a == b via unification, we need to go through and replace f(a) and f(b) with a single expression.

Since this is just a toy, we'll do a super dumb implementation: every time we unify two expressions, we'll rebuild the whole egraph to make sure this is maintained. A real implementation will be carefully optimized to only do updates where they're necessary.

class Egraph:
    """An e-graph: a data structure that compactly represents a set of expressions
    and equivalences between them.

    Stores enodes (expression nodes) grouped into eclasses (equivalence classes).
    When two eclasses are unified, the congruence invariant is maintained: if
    a == b, then f(a) == f(b) for all enodes referencing them.
    """

    def __init__(self):
        # enodes and eclasses grow in sync: enodes[i] was assigned eclass i.
        self.enodes = []
        # Union-find: eclasses[i] points to the parent of i. The representative
        # is found by following until eclasses[i] == i.
        self.eclasses = []
        # Maps enode keys (hashable tuples) to eclass IDs. Ensures structurally
        # identical enodes share an eclass and enables congruence detection.
        self.expr_to_id = {}
        # Incremented on every new node or non-trivial unification.
        self.version = 0

    @property
    def size(self):
        """The number of distinct enodes in the graph."""
        return len(self.expr_to_id)

    def iter_enodes(self, tag=None):
        """Snapshot all enodes, optionally filtered by tag. Yields (key, eclass_id)."""
        for key, eid in list(self.expr_to_id.items()):
            if tag is None or key[0][0] == tag:
                yield key, eid

    def nodes(self, eclass_id):
        """Return list of enode keys belonging to the given eclass."""
        canon = self.find(eclass_id)
        return [k for k, eid in self.expr_to_id.items() if self.find(eid) == canon]

    def add(self, enode):
        """Insert an enode into the graph, returning its eclass ID.

        Children are canonicalized first. If a structurally identical enode
        already exists, its existing eclass ID is returned instead.
        """
        enode.children = [self.find(c) for c in enode.children]
        k = enode.key()
        if k not in self.expr_to_id:
            self.enodes.append(enode)
            eid = len(self.eclasses)
            self.eclasses.append(eid)
            self.expr_to_id[k] = eid
            self.version += 1
        return self.expr_to_id[k]

    def find(self, a):
        """Return the canonical representative of the eclass containing `a`."""
        while self.eclasses[a] != a:
            a = self.eclasses[a]
        return a

    def unify(self, a, b):
        """Merge the eclasses of `a` and `b`, maintaining congruence.

        After merging, any enodes that become structurally identical (because
        their children now belong to the same eclass) are themselves merged.
        Returns the canonical ID of the merged eclass.
        """
        a = self.find(a)
        b = self.find(b)
        if a == b:
            return a
        self.eclasses[b] = a
        self.version += 1
        self._rebuild()
        return a

    def unify_batch(self, pairs):
        """Merge multiple pairs of eclasses, rebuilding only once at the end."""
        changed = False
        for a, b in pairs:
            a = self.find(a)
            b = self.find(b)
            if a != b:
                self.eclasses[b] = a
                self.version += 1
                changed = True
        if changed:
            self._rebuild()

    def _rebuild(self):
        """Re-canonicalize all enode children and merge any congruent eclasses."""
        merged = True
        while merged:
            merged = False
            self.expr_to_id = {}
            for i, node in enumerate(self.enodes):
                # Canonicalize all the children.
                node.children = [self.find(c) for c in node.children]
                k = node.key()
                eid = self.find(i)
                # Check if we've seen this same key before on this loop.
                if k in self.expr_to_id:
                    existing = self.find(self.expr_to_id[k])
                    if eid != existing:
                        # We've learned about a new equivalence; merge the two
                        # eclasses and loop again on the next iteration, because
                        # this might trigger more merges.
                        self.eclasses[eid] = existing
                        merged = True
                else:
                    self.expr_to_id[k] = eid

    # Debugging utilities.

    def _fmt_node(self, key):
        """Format an enode key as a readable s-expression."""
        (tag, data), *children = key
        label = f"{tag}:{data}" if data is not None else tag
        if not children:
            return label
        args = " ".join(f"e{self.find(c)}" for c in children)
        return f"({label} {args})"

    def __repr__(self):
        # group enode keys by canonical eclass
        classes = defaultdict(list)
        for key, eid in self.expr_to_id.items():
            classes[self.find(eid)].append(key)

        lines = []
        for eid in sorted(classes):
            nodes = " = ".join(self._fmt_node(k) for k in classes[eid])
            lines.append(f"  e{eid}: </span><span class="s2"> </span><span class="si">{</span><span class="n">nodes</span><span class="si">}</span><span class="s2"> </span><span class="se">")
        return "Egraph(\n" + "\n".join(lines) + "\n)"

Now that we have this implementation, we can make sure it behaves as expected:

eg = Egraph()

a = eg.add(Enode("a"))
b = eg.add(Enode("b"))

print("Just a and b:")
print(eg)

fa = eg.add(Enode("f", children=[a]))
fb = eg.add(Enode("f", children=[b]))

print()
print("After adding f(a) and f(b):")
print(eg)

eg.unify(a, b)

print()
print("After unifying a and b:")
print(eg)
Just a and b:
Egraph(
  e0: { a }
  e1: { b }
)

After adding f(a) and f(b):
Egraph(
  e0: { a }
  e1: { b }
  e2: { (f e0) }
  e3: { (f e1) }
)

After unifying a and b:
Egraph(
  e0: { a = b }
  e2: { (f e0) }
)

Now we can use our Egraph structure to implement a cryptic solver. We'll start by just seeding the synonyms and indicators with the ones we'll need for these examples.

container_indicators = [
    "in",
]

synonyms = [
    ("meter", "m"),
    ("verses", "poems"),
]

# build a lookup: word -> list of synonyms
synonym_map = {}
known_words = set()
for a, b in synonyms:
    synonym_map.setdefault(a, []).append(b)
    synonym_map.setdefault(b, []).append(a)
    known_words.add(a)
    known_words.add(b)

Now, here's how we're going to represent our cryptic clues.

We're going to start with our root node which is just the puzzle itself which just contains each of its words as children.

def create_puzzle(eg, clue):
    words = clue.split(" ")
    word_ids = [eg.add(Enode("raw", word)) for word in words]
    return eg.add(Enode("puzzle", children=word_ids))

eg = Egraph()
puzzle = create_puzzle(eg, "meter in poes verses")
print(eg)
Egraph(
  e0: { raw:meter }
  e1: { raw:in }
  e2: { raw:poes }
  e3: { raw:verses }
  e4: { (puzzle e0 e1 e2 e3) }
)

Next, we'll generate all the splits of that root node:

def add_splits(eg, puzzle):
    splits = []
    for enode in eg.nodes(puzzle):
        if enode[0][0] == 'puzzle':
            for i in range(2, len(enode)):
                splits.append((enode[1:i], enode[i:]))
    for (left, right) in splits:
        left = eg.add(Enode("words", children=list(left)))
        right = eg.add(Enode("words", children=list(right)))
        joined = eg.add(Enode("split", children=[left, right]))
        eg.unify(joined, puzzle)

def reduce_singleton_words(eg):
    for key, eid in eg.iter_enodes("words"):
        if len(key) == 2:
            eg.unify(eg.find(eid), key[1])


add_splits(eg, puzzle)
reduce_singleton_words(eg)

print(eg)
Egraph(
  e1: { raw:in }
  e2: { raw:poes }
  e5: { raw:meter = (words e5) }
  e6: { (words e1 e2 e12) }
  e8: { (words e5 e1) }
  e9: { (words e2 e12) }
  e11: { (words e5 e1 e2) }
  e12: { raw:verses = (words e12) }
  e13: { (puzzle e5 e1 e2 e12) = (split e5 e6) = (split e8 e9) = (split e11 e12) }
)

Now we can replace all the words with their synonyms.

def add_synonyms(eg):
    """Expand synonyms for all raw enodes, including multi-word groupings."""
    for key, eid in eg.iter_enodes("raw"):
        data = key[0][1]
        for syn in synonym_map.get(data, []):
            syn_id = eg.add(Enode("derived", syn))
            eg.unify(eg.find(eid), syn_id)

add_synonyms(eg)

print(eg)
Egraph(
  e1: { raw:in }
  e2: { raw:poes }
  e5: { raw:meter = (words e5) = derived:m }
  e6: { (words e1 e2 e12) }
  e8: { (words e5 e1) }
  e9: { (words e2 e12) }
  e11: { (words e5 e1 e2) }
  e12: { raw:verses = (words e12) = derived:poems }
  e13: { (puzzle e5 e1 e2 e12) = (split e5 e6) = (split e8 e9) = (split e11 e12) }
)

Next, to apply the in indicator, we can match on patterns shaped like x in y.

def is_indicator(eg, eclass_id, indicator_list):
    for enode in eg.nodes(eclass_id):
        tag, data = enode[0]
        if tag == "raw" and data in indicator_list:
            return True
    return False

def get_raw_words(eg, eclass_id):
    return [e[0][1] for e in eg.nodes(eclass_id) if e[0][0] == "raw"]

def get_all_words(eg, eclass_id):
    return [e[0][1] for e in eg.nodes(eclass_id) if e[0][0] in ("raw", "derived")]

def add_containers(eg):
    """For words nodes like [A, indicator, B], try inserting A inside B and B inside A."""
    for key, eid in eg.iter_enodes("words"):
        if len(key) != 4:
            continue
        left, mid, right = key[1], key[2], key[3]
        if not is_indicator(eg, mid, container_indicators):
            continue
        for inner, outer in [(left, right), (right, left)]:
            for iw in get_all_words(eg, inner):
                for ow in get_all_words(eg, outer):
                    for pos in range(1, len(ow)):
                        result = ow[:pos] + iw + ow[pos:]
                        res_id = eg.add(Enode("derived", result))
                        eg.unify(eg.find(eid), res_id)

add_containers(eg)

print(eg)
Egraph(
  e1: { raw:in }
  e2: { raw:poes }
  e5: { raw:meter = (words e5) = derived:m }
  e6: { (words e1 e2 e11) }
  e8: { (words e5 e1) }
  e9: { (words e2 e11) }
  e11: { raw:verses = (words e5 e1 e2) = (words e11) = derived:poems = derived:pmeteroes = derived:pometeres = derived:poemeters = derived:pmoes = derived:pomes = derived:mpoeseter = derived:mepoester = derived:metpoeser = derived:metepoesr }
  e13: { (puzzle e5 e1 e2 e11) = (split e5 e6) = (split e8 e9) = (split e11 e11) }
)

We can see in the above that we have the expression (split e12 e12): this indicates we found a solution! We derived the same thing on both sides. All that's left is to detect and extract that.

def add_solutions(eg):
    for key, eid in eg.iter_enodes("split"):
        if len(key) == 3 and eg.find(key[1]) == eg.find(key[2]):
            sol = eg.add(Enode("solution", children=[key[1]]))
            eg.unify(eg.find(eid), sol)

add_solutions(eg)

print(eg)

# Assume a derived word that is a real word is a solution.
def extract_solutions(eg, root):
    solutions = []
    for enode in eg.nodes(root):
        if enode[0][0] == 'solution':
            child = enode[1]
            for n in eg.nodes(child):
                tag, data = n[0]
                if tag == "derived" and data in known_words:
                    solutions.append(data)
    return solutions

print(extract_solutions(eg, puzzle))
Egraph(
  e1: { raw:in }
  e2: { raw:poes }
  e5: { raw:meter = (words e5) = derived:m }
  e6: { (words e1 e2 e11) }
  e8: { (words e5 e1) }
  e9: { (words e2 e11) }
  e11: { raw:verses = (words e5 e1 e2) = (words e11) = derived:poems = derived:pmeteroes = derived:pometeres = derived:poemeters = derived:pmoes = derived:pomes = derived:mpoeseter = derived:mepoester = derived:metpoeser = derived:metepoesr }
  e13: { (puzzle e5 e1 e2 e11) = (split e5 e6) = (split e8 e9) = (split e11 e11) = (solution e11) }
)
['poems']

This is not, in general, sufficient to solve these puzzles. I mean, not by a long shot, but in particular here, some of these rules might trigger other rules. So we need to run them in a loop until we hit a fixed point. But you get the idea. There's a bunch more work you'd need to do to get this to a functioning solver that could solve a problem that you like, didn't already know the solutions to, but I think it's a cool demonstration of the power of this kind of data structure.

There are also some other problems about how to restructure expressions to catch all the possible tree-structures here, but this post is already too long!

Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

Add a comment:

GitHub
justinjaffray.com
Bluesky
Twitter
Powered by Buttondown, the easiest way to start and grow your newsletter.