SQL's Grammar Ambiguity

scalar expressions

                September 2, 2024

            SQL's Grammar Ambiguity

            I was going to write a longer, more involved post this week, but I wound up getting Covid and didn't have it in me to work through all the stuff I had to for that. So here is a smaller issue.
I know of exactly one situation in which the grammar of SQL is "tricky" to parse. I use the word tricky because I've been out of the game too long to remember if this is exactly what "not LL(1)" means or not so I'm going to use the weasel phrasing "it's tricky."
To be more specific about what I mean, when you are hand-writing a parser, you typically need to know what it is you're parsing at any given time. SQL has both scalar expressions, like a + b * 3, representing the values of individual columns within a row, and relational expressions, like select * from t where t.x = 3, which represent an entire relation.
You'll probably have in your parser something like
impl Parser {
  fn parse_scalar_expr(&mut self) { ... }
  fn parse_rel_expr(&mut self) { ... }
}

(SQL's grammar is of course slightly more complicated than this but this is sufficient to understand the point I'm going to make.)
Most of the time, we can in fact determine what it is we're parsing ahead of time. The top level of a SQL query is always a relational expression. If so far, we've parsed (SELECT * FROM t) UNION ALL, the next thing we parse must be a relational expression. If we parsed SELECT 1 +, then the next thing must be a scalar expression.
This is really good for parsing, because it means you can just keep track of what it is you're hoping to parse next and then try to parse that thing, and then if it succeeds you just keep truckin', and if it fails then you report an error.
There is, I think, one situation in parsing SQL where we don't know what we're parsing ahead of time.
Consider the following two SQL queries (stolen from Postgres):
SELECT (((SELECT 2)) + 3)
SELECT (((SELECT 2)) UNION SELECT 2)
       ^^^
      / | \
     a  b  c

For the first query:

opening parenthesis a corresponds to a scalar expression,
opening parenthesis b corresponds to a scalar expression, and
opening parenthesis c corresponds to a scalar expression.

For the second query:

opening parenthesis a corresponds to a scalar expression,
opening parenthesis b corresponds to a relational expression, and
opening parenthesis c corresponds to a relational expression.

So by the time we've finished parsing ((SELECT 2)), we don't actually know what kind of thing it is we've parsed: we don't know if this is a subquery being used in a scalar context, or if it's a subselect being used in a relational context.
The traditional way you'd solve this in a parser would be to use backtracking: set a checkpoint at the beginning of the point of ambiguity and try to parse it one way, and then jump backwards to that if parsing fails so you can try again the other way.
Backtracking is not a perfect solution, though. It requires to you be able to stash and resume the state of your parser, which can sometimes be onerous if you're trying to optimize it, and it means you might have to parse a complicated expression multiple times in the parsing of a query, which is also not desirable. And I think just morally it's nicer if your parser is always marching forwards.
The solution used in Postgres is to introduce a new kind of expression select_with_parens which denotes a SELECT that may or may not actually correspond to a relational or scalar expression, to delay making the decision until more details are known.
Personally in my personal SQL parser I used backtracking, but I think either approach is fine.

Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

Start the conversation: