Avoiding Cross Products with the Query Graph

natural join

                            May 27, 2024

                Avoiding Cross Products with the Query Graph

                        To compute the join of two relations, we find all pairs of rows their rows which have the same value for any columns with the same name. This is sometimes called the natural join in the software world, and in the join processing world is often just called the join.
fn main() {
    let r = Relation::new(["a", "b"])
        .row([1, 2])
        .row([3, 4])
        .row([5, 6]);
    let s = Relation::new(["b", "c"])
        .row([2, 10])
        .row([4, 20])
        .row([6, 30]);

    println!("R:");
    r.print();
    println!("S:");
    s.print();
    println!("R join S:");
    r.join(&s).print();
}

R:
+---+---+
| a | b |
+---+---+
| 1 | 2 |
+---+---+
| 3 | 4 |
+---+---+
| 5 | 6 |
+---+---+
S:
+---+----+
| b | c  |
+---+----+
| 2 | 10 |
+---+----+
| 4 | 20 |
+---+----+
| 6 | 30 |
+---+----+
R join S:
+---+---+----+
| a | b | c  |
+---+---+----+
| 1 | 2 | 10 |
+---+---+----+
| 3 | 4 | 20 |
+---+---+----+
| 5 | 6 | 30 |
+---+---+----+

(The complete code for this post is available here)
We can join in a third relation, the definition naturally extends to joining any number of relations:
fn main() {
    let r = Relation::new(["a", "b"])
        .row([1, 2])
        .row([3, 4])
        .row([5, 6]);

    let s = Relation::new(["b", "c"])
        .row([2, 10])
        .row([4, 20])
        .row([6, 30]);

    let t = Relation::new(["c", "d"])
        .row([10, 100])
        .row([20, 200])
        .row([30, 300]);

    r.join(&s).join(&t).print();
}

+---+---+----+-----+
| a | b | c  | d   |
+---+---+----+-----+
| 1 | 2 | 10 | 100 |
+---+---+----+-----+
| 3 | 4 | 20 | 200 |
+---+---+----+-----+
| 5 | 6 | 30 | 300 |
+---+---+----+-----+

Since joins are commutative (a join b = b join a) and associative (a join (b join c) = (a join b) join c), we can rearrange the order and it doesn’t change the result:
…
    r.join(&t).join(&s).print();
…

Still gives
+---+---+----+-----+
| a | b | c  | d   |
+---+---+----+-----+
| 1 | 2 | 10 | 100 |
+---+---+----+-----+
| 3 | 4 | 20 | 200 |
+---+---+----+-----+
| 5 | 6 | 30 | 300 |
+---+---+----+-----+

However, although they’re logically equivalent, if we make the relations bigger:
let r2 = Relation::new(["a", "b"])
    .rows((0..1000).map(|i| vec![i, i * 10]));
let s2 = Relation::new(["b", "c"])
    .rows((0..1000).map(|i| vec![i * 10, i * 100]));
let t2 = Relation::new(["c", "d"])
    .rows((0..1000).map(|i| vec![i * 100, i * 1000]));

Something catastrophic can happen. This runs very quickly:
r2.join(&s2).join(&t2).print();

While this chugs for a couple seconds:
r2.join(&t2).join(&s2).print();

Since r and t have no columns in common, their join is their entire cross product of size |r|||t| (which in our case has a million rows), so it’s not until we join in s that we get to reduce back down the size of our relations.
Joining two relations without any common columns is called a cross product.
While finding the actual best sequence to join relations is NP-hard, one easy
heuristic is to simply make sure we avoid performing cross products wherever
possible.
One way to think about what is happening here and how to resolve it is via the query graph.
The query graph of a join query has a vertex for each relation being joined and an edge between any two relations which share columns:

We had a big intermediate result because the order R, T, S, which translates to the join tree (R join T) join S, has a prefix (R, T) which is not connected.
When we join two relations, the corresponding operation on the query graph is to identify those vertices into a singular new vertex.
We can fully evaluate a join query by continually identifying pairs of vertices until there is just one left, which is the final value of the join.
If we only ever do this to vertices that have an edge between them, we’ll never compute a cross product.
As we saw, if we always compute the join from left to right as it was entered, we might inadvertently compute a cross product.
In general, user-facing databases need to be defensive of this sort of thing. Lots of users will input tables to be joined without thinking about how they will be computed, and lots of tools will generate SQL without broader context of how that SQL will be embedded into some larger query. This is a big reason why an optimizer is an important part of a database (and not some external tool): we want optimization to occur at the point in time where we have the most context about what the query we’re executing is.
To have a holistic view and not just eagerly evaluate our joins, we can introduce a planning layer that lets us defer actually performing any computation until we have a chance to make some changes.
struct Planner {
    joined_tables: Vec<Relation>,
    query_graph: Graph,
}

And we need a data structure to represent the query graph:
#[derive(Default, Debug)]
struct Graph {
    edges: Vec<Vec<usize>>,
}

impl Graph {
    fn edge(&mut self, a: usize, b: usize) {
        while self.edges.len() <= a.max(b) {
            self.edges.push(Vec::new());
        }
        self.edges[a].push(b);
        self.edges[b].push(a);
    }

    fn neighbours(&self, vertex: usize) -> Vec<usize> {
        self.edges.get(vertex).cloned().unwrap_or_default()
    }
}

Then, instead of joining relations as soon as we see them, we can just store them and update our query graph:
impl Planner {
    fn join(mut self, rel: Relation) -> Self {
        for (i, t) in self.joined_tables.iter().enumerate() {
            // If there is a common column between the current relation and the
            // new relation, add an edge between them.
            let t_cols: HashSet<_> = t.col_names.iter().cloned().collect();
            if rel.col_names.iter().any(|c| t_cols.contains(c)) {
                self.query_graph.edge(self.joined_tables.len(), i);
            }
        }
        self.joined_tables.push(rel);
        self
    }
}

Then to build up a plan, we can do a graph traversal, where we pick an unjoined vertex and then expand outwards adding connected vertices:
impl Planner {
    fn plan(self) -> Vec<Relation> {
        let mut plan = vec![];
        let mut remaining: HashSet<_> = (0..self.joined_tables.len()).collect();
        // Grab an unjoined relation.
        while let Some(next) = remaining.iter().next() {
            // Expand outwards adding relations that are connected to the
            // current result.
            let mut frontier = vec![*next];
            while let Some(relation) = frontier.pop() {
                remaining.remove(&relation);
                plan.push(relation);
                frontier.extend(
                    self.query_graph
                        .neighbours(relation)
                        .into_iter()
                        .filter(|n| remaining.contains(n)),
                );
            }
        }

        plan.into_iter()
            .map(|i| std::mem::take(&mut self.joined_tables[i]))
            .collect()
    }
}

Then, we can use this to generate a plan that doesn't use cross products:
let planner = Planner::default().join(r).join(t).join(s);

let plan = planner.plan();
let result = plan
    .into_iter()
    .reduce(|result, next| result.join(&next))
    .unwrap();

Which gives us the same result as before, but guarantees no cross products.
Of course, this resulting plan might not be optimal—there's a lot of possible ways to come up with a join plan that doesn't involve any cross products and is still slower than necessary.
Since join planning is NP-hard, there isn't going to be a silver bullet solution to doing this sort of thing effectively (and even if we could compute such a thing easily, we'd need good estimates to actually guarantee a practically good result).
But when it comes to doing discrete optimization like join planning, having ways to generate "sort of good" solutions is a good starting point upon which you can then layer iterative improvements, like looking at small, local parts of your larger object and then doing some kind of brute force optimization on them.
In the world of the Traveling Salesman Problem this is a very common strategy: you use a heuristic to find an "okay" solution and then use local "moves" like the 2-opt to improve the resulting tour locally without ever having to look at the whole tour holistically for your brute force moves.

                            Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

                        Comments: