A Trick That Doesn't Work and a Trick That Does
Ancestry
I came across a cool algorithm recently for being able to efficiently answer ancestor queries in a tree ("is x
an ancestor of y
"). The trick is to assign to each node in the tree an interval [x, y]
such that for any node, its children's ranges are contained within its range. It's easy to construct such ranges by doing a depth-first search of the tree and incrementing an integer upon each edge traversal. The integer on the downward traversal becomes the lower bound of the interval and the integer on the upward traversal becomes the upper bound. Then you can check if A
is a parent of B
by checking the condition A.lo < B.lo && A.hi > B.hi
. This is all explained more lucidly in Per's post. What I want to talk about is how I tried and failed to come up with an extension to this that goes beyond DAGs.
In query planning, our representations of queries are often not merely trees, but DAGs: this is not just because a query plan might logically be a DAG, but when representing a query in the Cascades framework you often wind up with a DAG. Being able to easily do this kind of ancestry query would be pretty nice and convenient in such cases.
I attempted to extend Per's original trick by doing the most obvious thing one might thing to do: you can think of the original DFS as tracing all the paths from a node downwards to its children and storing the information that generates. The DFS gives us exactly the relationship we want to check: "all the stuff that comes after me in this search, until you see me again, is beneath me." We can generalize this to the DAG world by doing the same thing, just DFSing outwards, not deduplicating visits to nodes. The obvious problem that comes up as: what do we do when we have to tag a node with multiple ranges, as we will with any node with two distinct ancestor chains.
It turns out there is a natural way to make this work: you can just assign each node a disjoint set of intervals instead of a single interval. Then A
is a parent of B
if any of A
's intervals contain any of B
's intervals. This means that there was a path from A
to B
.
This actually works: you can store a sorted list of intervals in each node and then merge them to discover, in O(number of intervals)
time, if one of the nodes is an ancestor of the other. Great, right? What's extra nice is that this genuinely is a generalization of the original tree algorithm, in the sense that when you run it on a tree, you just get the original algorithm.
Thinking about it more, this should sort of obviously work because it's equivalent to running the same algorithm on an inlined version of the DAG.
Well... I got all excited about this and began to write it up and I realized I failed to consider how many intervals this might create. In particular, there is one kind of DAG that is sort of an eternal foil to attempts to translating algorithms on trees into algorithms on DAGs. Any kind of lattice-y pattern is typically really bad:
o
/ \
o o
\ /
o
/ \
o o
\ /
o
/ \
...
\ /
o
Sadly, the bottom node in a DAG like this will wind up with an exponential number of intervals, mirroring the exponential number of paths through the DAG. Oops. How sad.
Do you have a nice algorithm to solve this problem for DAGs? Let me know.
Dynamic Programming with Subsets
Here's a slightly happier story. If you are implementing a dynamic programming algorithm you often need to iterate over cases in an order such that by the time you visit a given case, you've visited all of the cases that that case depends on. Like if the problem you're solving looks like this:
C(x) = C(x-1) + C(x-2)
by the time you visit C(53)
, you better have visited C(52)
and C(51)
.
When using dynamic programming to find an optimal join order, the order needs to be over set containment: in order to find the optimal ordering for a set of tables S
, we need to know the optimal ordering for every proper subset of S
.
So how do I iterate over the subsets of a set in an order that assures that when I hit a set, I've already hit all of its subsets?
It's typical in this problem to represent sets of tables as bitmaps, because when solving them optimally you won't be able to go particularly large anyway.
I convinced myself such an ordering of the sets had to exist in, what I think, was a natural way: you can visit every 1-element set, then every 2-element set, then every 3-element set, and so on. None of the sets that are the same size can be a proper subset of each other, so you will always visit them in a safe order.
This isn't a great way to actually implement this. It's kind of clunky! I spent a bunch of time figuring out a way to iterate over all the integers with one bit set, then the ones with two bits set, and so on. It looks something like this:
def visit(n, f):
for size = 0; size < n; size++:
for x in ints_with_bits_set(size):
visit(x)
where int_with_bits_set
is some annoying bit of code to generate all the n choose size
subsets of size size
.
Do you want to see the much simpler way to accomplish this?
def visit(n, f):
for x = 0; x < pow(2,n); x++:
visit(x)
That's right: iterating from 0 up to 2^n - 1 accomplishes this (with a typical unsigned integer representation). I find this weirdly hard to convince myself intuitively directly, unless I use exactly the argument "if X is a proper subset of Y, X has strictly fewer bits set so its integer representation must be smaller." You might be able to convince yourself of this more directly somehow.