Lowering in Reverse

lower in reverse

        November 3, 2025

Lowering in Reverse

            Recall last week, we talked about how to take a SQL query and turn it into a lower level plan, so this week I want to talk about how we can generate SQL queries. Having a high-quality query generator is an important component of testing SQL engines, because a lot of techniques for testing SQL engines can be reduced to generating queries that you then manipulate for testing purposes.
In a system that processes a language, remember we have multiple stages of processing:

Each stage (before optimization) acts as a sort of "shield" that filters out malformed inputs. Random bytes are not likely to make it past the lexing phase. Random sequences of tokens are not likely to make it past the parsing phase. Random walks of the language's grammar are not likely to make it past the lowering phase. If you're trying to test any of those stages in particular, you probably want a good number of both positive and negative examples of what they should reject, but that's supposing that you've constructed a query that is able to make it past all the stages that come before!
So, in particular, if we want to test the optimization phase, we need to make sure that we're constructing queries that successfully lex, parse, and lower. This could be a tall order!
Lexing is pretty easy: just only generate valid tokens. Parsing too, it's very easy to generate strings matching a particular context-free grammar, given some description of that grammar.
Lowering is a little harder, because we need to be concerned with all sorts of semantic things like scope and typechecking.
Last week we talked about lowering in SQL. The reason I spent so much time in that post talking about the dependency order of the various parts of a SQL query is that those things become a bit less obvious, but just as important, when we want to lower in reverse, by constructing a SQL query programatically.
Writing a query generator is structurally almost identical to lowering. To the extent that I'm sure that some Haskell person could explain to me some sense in which they're two interpretations of the same idea.
Even if that's not true, the similarity between the two processes is so striking that I think it's at least, informally, best to think of them as two sides of the same coin.
In a query like this:
SELECT <cols> FROM <table>

Recall that the scope for <cols> is provided by <table>, so, like before, how we had to:

lower <table> so that we know what columns are available, and then
lower <cols> using that information.

From there, the ideas are very simple:
// Scalar generates a scalar expression for the given scope.
func (g *Gen) Scalar(scope *Scope) Scalar {
    if rand.Intn(2) == 0 {
        // Bare column reference.
        return g.Column(scope)
    } else {
        // Equality.
        return &Eq{
            Lhs: g.Column(scope),
            Rhs: g.Column(scope),
        }
    }
}

// Table returns a random table in the schema and adds its columns to the
// scope.
func (g *Gen) Table(scope *Scope) Rel {
    idx := rand.Intn(len(g.schema.Tables))
    for _, col := range g.schema.Tables[idx].Cols {
        scope.Add(col)
    }
    return &Table{g.schema.Tables[idx].Name}
}

// Rel returns a relational expression.
func (g *Gen) Rel(scope *Scope) Rel {
    v := rand.Intn(4)
    if v < 2 {
        return g.Table(scope)
    } else if v < 3 {
        return &Select{
        // This first call will add the columns to the scope for the second
        // call.
            From:  g.Rel(scope),
            Where: g.Scalar(scope),
        }
    } else if v < 4 {
        return &Join{
            Lhs: g.Rel(scope),
            Rhs: g.Rel(scope),
            On:  g.Scalar(scope),
        }
    }
    return nil
}

Running this (you can see the full code here) we can see we get some sort of reasonable looking queries. Obviously we'd have to do a bit more work to get really high quality queries out of this thing (in particular, ones that typecheck, and that disambiguate table names, but these are all easily fixable), but it's a nice starting point:
((SELECT * FROM (abc INNER JOIN (SELECT * FROM (abc INNER JOIN (((SELECT * FROM ((def INNER JOIN abc ON a) INNER JOIN ((def INNER JOIN ((abc INNER JOIN
def ON f) INNER JOIN ((SELECT * FROM (((SELECT * FROM (abc INNER JOIN def ON (d = c)) WHERE (a = d)) INNER JOIN def ON (e = d)) INNER JOIN ((SELECT * FR
OM abc WHERE c) INNER JOIN (SELECT * FROM (SELECT * FROM def WHERE (d = f)) WHERE (b = f)) ON (b = a)) ON b) WHERE (e = f)) INNER JOIN ((abc INNER JOIN
(SELECT * FROM abc WHERE (c = d)) ON e) INNER JOIN abc ON (c = e)) ON (d = a)) ON a) ON (a = c)) INNER JOIN (SELECT * FROM def WHERE (c = d)) ON e) ON e
) WHERE (e = c)) INNER JOIN abc ON e) INNER JOIN (SELECT * FROM abc WHERE (f = d)) ON c) ON e) WHERE (b = e)) ON b) WHERE a) INNER JOIN def ON c)
abc
((SELECT * FROM def WHERE (f = f)) INNER JOIN (abc INNER JOIN def ON (d = b)) ON (d = c))
abc
def
(SELECT * FROM def WHERE e)
(abc INNER JOIN (SELECT * FROM ((SELECT * FROM def WHERE e) INNER JOIN (abc INNER JOIN (SELECT * FROM abc WHERE (d = a)) ON (c = b)) ON (e = f)) WHERE (
f = a)) ON e)
def
abc
(SELECT * FROM def WHERE (e = f))

                            Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

          Add a comment: