The Little Planner Chapter 4: A Pushdown Party

attribute set

                December 4, 2023

            The Little Planner Chapter 4: A Pushdown Party

            Hi everybody, apologies, I didn't have time to put together a newsletter this week. In lieu of that, I found this chapter of a book called "The Little Planner" in a trash can in Berkeley, California and have done my best to get the dirt off of it. I'm not sure where it came from, but I hope it's an adequate substitute.

What's an Expr?
It is an expression over scalar values, like numbers.
What's the definition of Expr we've been using so far?
It's this:
enum Expr {
    ColRef { id: usize },
    Int { val: i64 },
    Eq { left: Box<Expr>, right: Box<Expr> },
}

What's a "free variable?"
It is a column in an expression that is not bound.
You're kicking the can down the road: what is a "bound variable," then?
It is a column that is given its values within an expression and is not a parameter that comes from some other context.
Can you give me an example?
In the JavaScript expression,
function (a) {
    return a + b;
}

a is bound, because it doesn't refer to anything outside of the scope of the expression itself, but b is free.
If expr is an Expr, what is expr.free()?
It is the set of free variables in expr.
Implement Expr::free.
It's 
fn free(&self) -> HashSet<usize> {
    match self {
        Expr::ColRef { id } => {
            let mut set = HashSet::new();
            set.insert(*id);
            set
        }
        Expr::Int { .. } => HashSet::new(),
        Expr::Eq { left, right } => {
            let mut set = left.free();
            set.extend(right.free());
            set
        }
    }
}

What's a RelExpr?
It's an expression over relational values, like scans, filters, and joins.
What's the definition of RelExpr we've been using so far?
enum RelExpr {
    Scan {
        table_name: String,
        column_names: Vec<usize>,
    },
    Select {
        src: Box<RelExpr>,
        predicates: Vec<Expr>,
    },
    Join {
        left: Box<RelExpr>,
        right: Box<RelExpr>,
        predicates: Vec<Expr>,
    },
    Project {
        src: Box<RelExpr>,
        cols: HashSet<usize>,
    },
}

impl RelExpr {
    fn scan(table_name: String, column_names: Vec<usize>) -> Self {
        RelExpr::Scan {
            table_name,
            column_names,
        }
    }

    fn select(self, predicates: Vec<Expr>) -> Self {
        RelExpr::Select {
            src: Box::new(self),
            predicates,
        }
    }

    fn join(self, other: Self, predicates: Vec<Expr>) -> Self {
        RelExpr::Join {
            left: Box::new(self),
            right: Box::new(other),
            predicates,
        }
    }

    fn project(self, cols: HashSet<usize>) -> Self {
        RelExpr::Project {
            src: Box::new(self),
            cols,
        }
    }
}

What is the attribute set of a relational expression?
It is the set of columns that each row in the evaluated expression has.
If rel is a RelExpr, what's rel.att()?
It is the attribute set of the RelExpr.
Show me its implementation.
fn att(&self) -> HashSet<usize> {
    match self {
        RelExpr::Scan { column_names, .. } => column_names.iter().cloned().collect(),
        RelExpr::Select { src, .. } => src.att(),
        RelExpr::Join { left, right, .. } => {
            let mut set = left.att();
            set.extend(right.att());
            set
        }
        RelExpr::Project { cols, .. } => cols.clone(),
    }
}

Are "free variables" and "attribute sets" similar?
They're both sets of columns that we derive from an expression.
How are they different?
Well, "free variables" represent the sets of columns we need, and "attribute sets" represent the sets of columns we have.
Another difference is that "free variables" come from scalar expressions, and "attribute sets" come from relational expressions.
Could relational expressions have free variables?
Yes, if that relational expression itself had scalar expressions that referenced columns not produced by the relational expression.
Could scalar expressions have attribute sets?
No, since an attribute set is about the set of columns that are produced specifically by a relational expression.
What does it mean for an Expr to be "bound by" a RelExpr?
An Expr is bound by a RelExpr if all of its free variables are part of the RelExpr's attribute set.
True or false, @0 is bound by scan("a", [0]).
True, since scan("a", [0]) produces the column 0, it gives us sufficient context to evaluate @0.
True or false, @0 is bound by join([], scan("a", [0]), scan("x", [1])).
True, since a join's attribute set is the union of its inputs' attribute sets, we have the context necessary to evaluate @0.
True or false, @0 is bound by project([1], scan("a", [0, 1])).
False, since the project will throw away the column named 0, we won't have access to it and thus cannot evaluate the expression @0.
Why is that important?
When we attempt to evaluate the scalar expression, we do so in the context of a row. We can only evaluate a scalar expression if the row we have has values for all of the columns we reference.
What's a join?
It's the cross product of two relations, filtered according to some predicates.
Can you show me a join?
let left = RelExpr::scan("a".into(), vec![0, 1]);
let right = RelExpr::scan("x".into(), vec![2, 3]);

let join = left.join(
    right,
    vec![
        Expr::col_ref(0).eq(Expr::col_ref(2)),
        Expr::col_ref(1).eq(Expr::int(100)),
    ],
);

This gives a query plan that looks something like this:
-> join(@0=@2 && @1=100)
  -> scan("a", [0, 1])
  -> scan("x", [2, 3])

Explain what this join does.
It looks at every pair of rows from a and x,
and emits the ones that satisfy both a.0=x.2 and a.1=100.
How do those predicates differ?
The first one, a.0=x.2, requires both inputs to the join to be bound, but the second one, a.1=100 is bound by only a.
What does that mean?
We don't need the values from x to evaluate if a row satisfies a.1=100.
Does that suggest anything we could do to this query?
We could execute that predicate before doing the join. We could write it like this:
let join = left
    .select(vec![Expr::col_ref(1).eq(Expr::int(100))])
    .join(right, vec![Expr::col_ref(0).eq(Expr::col_ref(2))]);

For a query plan that looks like this:
-> join(@0=@2)
  -> select(@1=100)
    -> scan("a", [0, 1])
  -> scan("x", [2, 3])

Can we automate this process?
Probably.
How might we do that?
I guess we'd have to check if each predicate is bound by either side of the join alone.
Write bound_by.
impl Expr {
    fn bound_by(&self, rel: &RelExpr) -> bool {
        self.free().is_subset(&rel.att())
    }
}

How can we use this?
We can intercept calls to join:
fn join(self, other: Self, mut predicates: Vec<Expr>) -> Self {
    for i in 0..predicates.len() {
        if predicates[i].bound_by(&self) {
            // We can push this predicate down.
            let predicate = predicates.swap_remove(i);
            return self.select(vec![predicate]).join(other, predicates);
        }

        if predicates[i].bound_by(&other) {
            // We can push this predicate down.
            let predicate = predicates.swap_remove(i);
            return self.join(other.select(vec![predicate]), predicates);
        }
    }

    RelExpr::Join {
        left: Box::new(self),
        right: Box::new(other),
        predicates,
    }
}

Now show me a query.
let left = RelExpr::scan("a".into(), vec![0, 1]);
let right = RelExpr::scan("x".into(), vec![2, 3]);

let join = left.join(
    right,
    vec![
        Expr::col_ref(0).eq(Expr::col_ref(2)),
        Expr::col_ref(1).eq(Expr::int(100)),
        Expr::col_ref(3).eq(Expr::int(100)),
    ],
);

Running this outputs this:
-> join(@0=@2)
  -> select(@1=100)
    -> scan("a", [0, 1])
  -> select(@3=100)
    -> scan("x", [2, 3])

What happens with the following query?
let join = left.join(
    right,
    vec![
        Expr::col_ref(0).eq(Expr::int(100)),
        Expr::col_ref(1).eq(Expr::int(200)),
    ],
);

Let me see.
-> join()
  -> select(@1=200)
    -> select(@0=100)
      -> scan("a", [0, 1])
  -> scan("x", [2, 3])

Oh. Kind of ugly. We need to collapse those Selects. We can intercept calls to select too to flatten them.
fn select(self, mut predicates: Vec<Expr>) -> Self {
    if let RelExpr::Select {
        src,
        predicates: mut preds,
    } = self
    {
        preds.append(&mut predicates);
        return src.select(preds);
    }

    RelExpr::Select {
        src: Box::new(self),
        predicates,
    }
}

-> join()
  -> select(@0=100 && @1=200)
    -> scan("a", [0, 1])
  -> scan("x", [2, 3])

What else should we do to fully simplify these expressions?
We'd need to do a couple more simplifications to get these trees fully simplified:

Selects above Joins should merge their predicate sets,
Select should get pushed through Project where possible,
Project should get pushed through Join where possible.

Do that.

Apologies to Friedman and Felleisen.

Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

Start the conversation: