Queryability and the Sublime Mediocrity of SQL
I'm going to a small academic workshop on design next week. This got me thinking of the various meanings of "design", which got me thinking about the various purposes of "design", which got me thinking about queryability.
A system is queryable if (and I'm speaking very roughly) you can easily retrieve information you need from the system data. As an example, let's say we record scientists and the papers they worked on. You could represent the data like this:
Scientist {
name: str
department: Department
papers: set of Paper
}
Paper {
name: str
published: date
}
In this model, it's easy to query the papers a scientist worked on, it's just one lookup. On the other hand, it's really hard to get all of the authors of a given paper, we'd have to iterate through all the Scientists to find out. Alternatively, we could do this:
Scientist {
name: str
department: Department
}
Paper {
name: str
published: date
authors: set of Scientist
}
Then it's reversed: really easy to find the authors of a paper, hard to find which papers someone authored. Which data model is more queryable? Depends on usage patterns. Most likely we'll want to both look up the authors of a paper and a given scientist's papers, so neither model is very queryable.
There's also queries that neither model is good at, like "which scientists have coauthored a paper with Mike." One heuristic of system evolution: the diversity of queries is always increasing. So even if one of the above data models is sufficiently queryable for your current use cases, it likely won't be for your future use cases.
(Right now I'm only considering the difficulty of writing the query. Query performance is a whole other rabbit hole.)
The Sublime Mediocrity of SQL
Most of the time you won't be using either model, because you'll be storing your data in a SQL database and representing authorships as a many-to-many table:
Scientists {
id
name: str
}
Papers {
id
name: str
published: date
}
Authorship {
id
paper: Paper
author: Scientist
}
Now getting a scientist's papers is "kinda annoying":1
SELECT papers.id FROM papers
INNER JOIN authorships AS a
ON a.paper = papers.id AND a.author = 27
But— and this is key— getting a paper's authors is also just "kinda annoying":
SELECT scientists.id FROM scientists
INNER JOIN authorships AS a
ON a.author = scientists.id AND a.paper = 96
So instead of one "easy" and one "awful", you have two "kinda annoyings", and that's a big win for queryability! In fact even something like "get all the people who are coauthors" rises to, at most, "pretty annoying":
SELECT s1.id, s2.id
FROM scientists AS s1, scientists AS s2
INNER JOIN authorships AS a1 ON s1.id = a1.author
INNER JOIN authorships AS a2 ON s2.id = a2.author
INNER JOIN papers ON papers.id = a1.paper AND a2.paper = papers.id
WHERE s1.id < s2.id;
That's why SQL is so popular: it doesn't make anything effortless, but it keeps a wide variety of things manageable. And since your diversity of queries is only growing, having most things be manageable gives your the critical flexibility needed to keep your system queryable.
There are also a lot of queries that SQL is awful at, like most things involving graphs. That's why things like graph- and time-series databases have market niches.
Other Notes on Queryability
-
Views lets you store data in one format but make it retrievable in a different format, which adds queryability at the cost of performance. You can use materialized views but then you have caching issues.
-
There are different classes of queries, that as a group have various requirements and best practices. Some of them:
- Analytics: How many papers did each department publish last year?
- Auditing/Historical: What department was X working in when they published Y paper?
- Bitemporal queries: What did we mistakenly record as Y's title when we published a press release about it?
- Metadata: How often do people look up the authors of a given paper?
- Transitive queries: What's Danica McKellar's Erdős number?2
- Administrative: what records would cause problems with this data migration?
Basically, any question someone could reasonably ask is potentially a query someone will need to write.
-
There's probably a tension between thinking of queryability as "what queries you can write" and "what you can easily write". Wikidata uses SPARQL, which seems insanely powerful but is only writable by immortal philosopher-kings. Maybe I should look into the query builder.
-
I haven't said anything about how to actually design for queryability and I have no idea how to do this.
Patreon update
Experiment continues! On the request of a few people I lowered the tiers to something a little less shitposty. A little: now getting chocolates in the mail is "only" 768 dollars a year.
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.