So You Want to Generate SQL Queries (me too)
We have talked before about how to appropriately test query planners.
I wrote there:
I love metamorphic testing for SQL databases because it in large part reduces the problem of testing a database to building a high-quality query generator.
I want to talk a little bit today about building query generators.
The main way I think about this is via those diagrams you see in an introductory compilers course. You know the ones:
The context of this diagram is that each of these is phases that an input goes through as part of processing it. They're checks, or filters that you run over input programs to make sure they conform to some kind of specification, probably along with some additional annotation.
In the context of writing a query generator, these are obstacles, depending on what part of the program you're trying to test.
If you're just generating random bytes, then lexical analysis will catch almost all of your queries and you won't ever touch semantic analysis. This is great if you're trying to test lexical analysis, and not so good if you want to test semantic analysis.
A common technique I've seen to generate test queries is to walk a grammar file. Grammar walkers tend to be a lot of bang for buck—since they're automatically kept in sync with the grammar, changes to the grammar are automatically reflected in changes to the query generator (maybe with some small extra work). The problem is that this usually just makes queries that are guaranteed to parse, and not queries that are likely to do things like:
- typecheck,
- reference valid tables in the schema,
- reference valid columns in scope,
- call real functions,
- actually return any rows,
- etc.
If your goal is to test the "optimization" phase (which is where the bulk of testing tends to need to go, since that's where you tend to be doing the most nontrivial transformations), then you need to be generating queries that are going to make it past both lexical analysis and semantic analysis.
This is where tools like sqlsmith come in, which attempt to actually generate queries that are valid. Whenever I've built a query generator in the past, they've been based on the SQLSmith model, which in some sense walk a randomly generated tree.
Query Simplification
I've always written my query generators from scratch, and for the most part, been happy to do so. For something like this I think it's somewhat important to have full control over the stack and not ship out the mechanics of it to a dependency. The main place I've felt like I really needed some more mature tooling was in simplifying the resulting queries. Oftentimes you can figure out what part of a query is breaking, but it's often nice to be able to get a minimal, or close to minimal breaking case. The most recent version of this I implemented was based on minithesis, but if I tried doing this whole process again I might just use Hypothesis proper in the hopes that it would solve some of the trickier cases I ran into.
Structuring Generators
Generally, my observation has been that generators should resemble the deepest layer of the planner that they are trying to bypass. Things that are trying to parse should look like parsers, things that are trying to be lexically valid should look like lexers. And things that are trying to pass semantic analysis (which from what I've seen, is often combined with the "lowering" phase of going from AST to IR) should look like semantic analyzers: tracking column scopes, schemas, etc.
fn plan_select_operation(scope, sel: SelectNode) -> RelExpr {
...
}
becomes
fn generate_select_operation(scope) -> SelectNode {
...
}
Can we do better?
The elegance of a grammar-walker has always been really cool to me. The fact that we get to just, for free, generate exactly the set of strings that will make it past parsing seems like a cheat-code for testing the parts that come after that (though, of course, you'd use different techniques to test parsing itself).
I've spent some time thinking about how you could extend this model to the rest of semantic analysis, and things like typechecking. Is there a way we could define the semantic analysis phase such that we could run it in reverse and get it to spit out valid SQL queries? This sounds an awful lot like some kind of Kanren-style logic programming. I've come up dry, though, and haven't found a nice abstraction for it. This newsletter is my plea that if you are into logic programming and testing and databases and you think you have a way to make this abstraction work, let me know, because it's something I desperately want!