A Sniff Test for Some Query Optimizers
One important part of query planning is performing transformations over queries. Today I want to see how a couple common databases perform on a completely made-up and unrepresentative benchmark.
I have a lot of caveats for this one. I need to emphasize before we start that the following test is not remotely indicative of overall optimizer quality, and it's not necessarily even a deficiency if an optimizer doesn't pass it, there's a lot of variables that go into deciding which transformations to do. I don't have a deep understanding of how all of these optimizers work, so it's possible they do the transformations I've listed in different situations, or in a cost-based way, or for their use-cases, they're not that useful. I'm also biased because I worked on CockroachDB and am only aware of several transformations because of that.
I have, out of my ass, plucked three SQL Query optimizations. Today we're going to survey a list (which I have also plucked out of my ass) of databases and see which of them do them.
The transformations I have chosen are:
- An
EXISTS
to semijoin decorrelation, DISTINCT
elimination based on primary key, and- a common subexpression elimination.
EXISTS
to semijoin decorrelation
Query
CREATE TABLE ab (a INT PRIMARY KEY, b INT);
CREATE TABLE xy (x INT PRIMARY KEY, y INT);
SELECT * FROM ab WHERE EXISTS (SELECT * FROM xy WHERE a = x);
Explanation
As written, this query suggests evaluating the subquery SELECT * FROM xy WHERE a = x
for each row. This can be transformed into a semijoin, or even a regular inner join to exploit the database's built-in join functionality to do it faster.
DuckDB
Verdict: ✅
EXPLAIN SELECT * FROM ab WHERE EXISTS (SELECT * FROM xy WHERE a = x);
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ HASH_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ SEMI ├──────────────┐
│ a IS NOT DISTINCT FROM a │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ab ││ xy │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ a ││ x │
│ b ││ │
└───────────────────────────┘└───────────────────────────┘
CockroachDB
Verdict: ✅
> EXPLAIN SELECT * FROM ab WHERE EXISTS (SELECT * FROM xy WHERE a = x);
info
---------------------------
distribution: local
vectorized: true
• merge join (semi)
│ equality: (a) = (x)
│ left cols are key
│ right cols are key
│
├── • scan
│ missing stats
│ table: ab@ab_pkey
│ spans: FULL SCAN
│
└── • scan
missing stats
table: xy@xy_pkey
spans: FULL SCAN
(17 rows)
Postgres
Verdict: ✅
EXPLAIN SELECT * FROM ab WHERE EXISTS (SELECT * FROM xy WHERE a = x);
QUERY PLAN
------------------------------------------------------------------
Hash Join (cost=60.85..99.39 rows=2260 width=8)
Hash Cond: (ab.a = xy.x)
-> Seq Scan on ab (cost=0.00..32.60 rows=2260 width=8)
-> Hash (cost=32.60..32.60 rows=2260 width=4)
-> Seq Scan on xy (cost=0.00..32.60 rows=2260 width=4)
(5 rows)
SQLite
Verdict: ❌
> EXPLAIN QUERY PLAN SELECT * FROM ab WHERE EXISTS (SELECT * FROM xy WHERE a = x);
QUERY PLAN
|--SCAN ab
`--CORRELATED SCALAR SUBQUERY 1
`--SEARCH xy USING INDEX sqlite_autoindex_xy_1 (x=?)
DISTINCT
elimination
Query
CREATE TABLE ab (a INT PRIMARY KEY, b INT);
SELECT DISTINCT * FROM ab;
Explanation
Since a
is a primary key of the table ab
, a
must be non-null and distinct for every pair of rows. This means that every row in the table is distinct from every other row, and the query can be simplified to:
SELECT * FROM ab;
DuckDB
Verdict: ❌
D EXPLAIN SELECT DISTINCT * FROM ab;
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ HASH_GROUP_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ #1 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ab.a │
│ ab.b │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ab │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ a │
│ b │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 0 │
└───────────────────────────┘
CockroachDB
Verdict: ✅
> EXPLAIN SELECT DISTINCT * FROM ab;
info
-----------------------
distribution: local
vectorized: true
• scan
missing stats
table: ab@ab_pkey
spans: FULL SCAN
(7 rows)
Postgres
Verdict: ❌
=# EXPLAIN SELECT DISTINCT * FROM ab;
QUERY PLAN
------------------------------------------------------------
HashAggregate (cost=43.90..66.50 rows=2260 width=8)
Group Key: a, b
-> Seq Scan on ab (cost=0.00..32.60 rows=2260 width=8)
(3 rows)
SQLite
Verdict: ❌
> EXPLAIN QUERY PLAN SELECT DISTINCT * FROM ab;
QUERY PLAN
|--SCAN ab USING INDEX sqlite_autoindex_ab_1
`--USE TEMP B-TREE FOR DISTINCT
Common Subexpression Elimination
Query
CREATE TABLE ab (a INT PRIMARY KEY, b INT);
SELECT
(SELECT max(a) FROM ab),
(SELECT max(a) FROM ab)+1,
(SELECT max(b) FROM ab),
(SELECT max(b) from ab)+1;
Explanation
This one is a little contrived. I'm not even sure what the right way to do it is. I kind of just want to see what happens. I'll say a database passes if it doesn't scan the tables (even partially) four times.
DuckDB
Verdict: ❌
Oh god, this one is getting me sent to your spam filter for sure.
┌─────────────────────────────┐
│┌───────────────────────────┐│
││ Physical Plan ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ SUBQUERY │
│ (SUBQUERY + 1) │
│ SUBQUERY │
│ (SUBQUERY + 1) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ CROSS_PRODUCT ├────────────────────────────────────────────────────────────────────────┐
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐ ┌─────────────┴─────────────┐
│ CROSS_PRODUCT │ │ UNGROUPED_AGGREGATE │
│ ├───────────────────────────────────────────┐ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ │ │ │ first(#0) │
└─────────────┬─────────────┘ │ └─────────────┬─────────────┘
┌─────────────┴─────────────┐ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ CROSS_PRODUCT │ │ UNGROUPED_AGGREGATE ││ PROJECTION │
│ ├──────────────┐ │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ │ │ │ first(#0) ││ #0 │
└─────────────┬─────────────┘ │ └─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ UNGROUPED_AGGREGATE ││ UNGROUPED_AGGREGATE ││ PROJECTION ││ STREAMING_LIMIT │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ │
│ first(#0) ││ first(#0) ││ #0 ││ │
└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ PROJECTION ││ PROJECTION ││ STREAMING_LIMIT ││ UNGROUPED_AGGREGATE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 ││ #0 ││ ││ max(#0) │
└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ STREAMING_LIMIT ││ STREAMING_LIMIT ││ UNGROUPED_AGGREGATE ││ PROJECTION │
│ ││ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ││ ││ max(#0) ││ a │
└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ UNGROUPED_AGGREGATE ││ UNGROUPED_AGGREGATE ││ PROJECTION ││ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ max(#0) ││ max(#0) ││ a ││ ab │
│ ││ ││ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ││ ││ ││ a │
│ ││ ││ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ││ ││ ││ EC: 0 │
└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└───────────────────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ PROJECTION ││ PROJECTION ││ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ b ││ b ││ ab │
│ ││ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ││ ││ a │
│ ││ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ││ ││ EC: 0 │
└─────────────┬─────────────┘└─────────────┬─────────────┘└───────────────────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ab ││ ab │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ b ││ b │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 0 ││ EC: 0 │
└───────────────────────────┘└───────────────────────────┘
(Aside, I love this plan format.)
CockroachDB
Verdict: ❌
> EXPLAIN SELECT (SELECT max(a) FROM ab), (SELECT max(a) FROM ab)+1, (SELECT max(b) FROM ab), (SELECT max(b) from ab)+1;
info
-----------------------------------------------
distribution: local
vectorized: true
• root
│
├── • values
│ size: 4 columns, 1 row
│
├── • subquery
│ │ id: @S1
│ │ original sql: (SELECT max(a) FROM ab)
│ │ exec mode: one row
│ │
│ └── • group (scalar)
│ │
│ └── • revscan
│ missing stats
│ table: ab@ab_pkey
│ spans: LIMITED SCAN
│ limit: 1
│
├── • subquery
│ │ id: @S2
│ │ original sql: (SELECT max(a) FROM ab)
│ │ exec mode: one row
│ │
│ └── • group (scalar)
│ │
│ └── • revscan
│ missing stats
│ table: ab@ab_pkey
│ spans: LIMITED SCAN
│ limit: 1
│
├── • subquery
│ │ id: @S3
│ │ original sql: (SELECT max(b) FROM ab)
│ │ exec mode: one row
│ │
│ └── • group (scalar)
│ │
│ └── • scan
│ missing stats
│ table: ab@ab_pkey
│ spans: FULL SCAN
│
└── • subquery
│ id: @S4
│ original sql: (SELECT max(b) FROM ab)
│ exec mode: one row
│
└── • group (scalar)
│
└── • scan
missing stats
table: ab@ab_pkey
spans: FULL SCAN
(57 rows)
Postgres
Verdict: ❌
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Result (cost=76.92..76.94 rows=1 width=16)
InitPlan 2 (returns $1)
-> Result (cost=0.19..0.20 rows=1 width=4)
InitPlan 1 (returns $0)
-> Limit (cost=0.15..0.19 rows=1 width=4)
-> Index Only Scan Backward using ab_pkey on ab (cost=0.15..83.51 rows=2249 width=4)
Index Cond: (a IS NOT NULL)
InitPlan 4 (returns $3)
-> Result (cost=0.19..0.20 rows=1 width=4)
InitPlan 3 (returns $2)
-> Limit (cost=0.15..0.19 rows=1 width=4)
-> Index Only Scan Backward using ab_pkey on ab ab_1 (cost=0.15..83.51 rows=2249 width=4)
Index Cond: (a IS NOT NULL)
InitPlan 5 (returns $4)
-> Aggregate (cost=38.25..38.26 rows=1 width=4)
-> Seq Scan on ab ab_2 (cost=0.00..32.60 rows=2260 width=4)
InitPlan 6 (returns $5)
-> Aggregate (cost=38.25..38.26 rows=1 width=4)
-> Seq Scan on ab ab_3 (cost=0.00..32.60 rows=2260 width=4)
(19 rows)
SQLite
Verdict: ❌
QUERY PLAN
|--SCAN CONSTANT ROW
|--SCALAR SUBQUERY 1
| `--SEARCH ab USING COVERING INDEX sqlite_autoindex_ab_1
|--SCALAR SUBQUERY 2
| `--SEARCH ab USING COVERING INDEX sqlite_autoindex_ab_1
|--SCALAR SUBQUERY 3
| `--SEARCH ab
`--SCALAR SUBQUERY 4
`--SEARCH ab
Conclusion
I dunno. I wouldn't advise drawing any actual conclusions from the outcome of this, besides it being sort of fun to think about. If you know any of these systems intimately and think I misrepresented them, let me know. Or if you have other fun tests like this, also let me know.