57: Asking the real questions
Okay assume this whole personal database thing works out, and I end up with a lil db stuffed full of information. Well, now I need to query it.
I can use that data to answer all sorts of questions, like: - what’s the recipe for Sorpotel. - which of my friends recommended me this book I loved? - what’s on my todo list today and when did I actually say I’d do those tasks?
But, because computers, I can’t just ask the database these question directly. I have to write them in a language it can understand. A query language.
Together with the data model the query language is at the core of what makes a database a database.
The key goal is to seperate the description of what you want to get from the actual process of getting it.
In a database there are usually many routes you can take to arrive at the answer you want. The query language is a way of describing your destination, and the database software figures out the directions to get there.
A query language is incredibly important as asking questions is the main the thing you actually do with a database!
Constructing my own
Unfortunately, none of the query languages I have experience with quiite fit what I want, so I’m going to have to cobble together my own. I’m going to cut a lot of corners to get there, but my goal is to have a solid foundation to iterate upon.
Starting with Datalog
Continuing the tradition of cribbing from Datomic, my starting point is a query language called Datalog.
The basic intuition for datalog is that you give it a set of facts with gaps in it, define how those gaps relate to each other, and the database finds all the real data that satisfies your “template”.
For example:
{ select: ['?name'], where: [ ['?e', 'person/name', "Jared"], ['?e', 'person/friend', '?friend'], ['?friend', 'person/name', '?name'] ] } //RESULT ["Matthew", "Jaakko", "Marisa", "Charles"]
This query finds the names of all my friends. All the bits that begin with ?
are variables, and any time the same variable is used in multiple places it
must refer to the same value.
To be honest, Datalog is pretty great, and there’s a good chance I’ll end up just implementing it as the query language. But if I had to try and make some improvements…
The main problem I have with this, is that the relationships this data describes is implicitly derived from the variable usage.
An equivalent way of representing the query could be:
query({ "person/name": "Jared", "person/friend": { "person/name": '?' } }) //Result { "person/friend":[ [{"person/name": "Matthew"}], [{"person/name": "Jaakko"}], //... ] }
This format borrows a lot from a language called GraphQL, which was invented at Facebook in 2015 and has been really exploding in recent years.
The key property in this case is that it easily represents hierarchical data, and that result it returns is “shaped” like the query itself. This can get a little verbose, so it’s something you probably want to be able to opt out of,
but it’s really useful if you’re querying many relationships deep, and pulling multiple values.
A problem you might notice is that this format blends together the information I want to retrieve with the information I’m providing. The query in actual GraphQL would look like this:
person(name: "Jared") { friends: { name } }
For one, this is a lot cleaner syntax than the JSON I’ve been doing, but more on
that later. More importantly, it clearly seperates the input the (name:
"Jared")
bit, from the output, which is quite nice.
However this kind of hierarchical model does make it trickier to do some kinds of things datalog is really good at.
For example, finding cousins is a canonical example:
{ select: ['?name'], where: [ ['?jared', 'person/name', "Jared"], ['?parent', 'person/child, '?jared], ['?parent, 'person/sibling, '?parentSibling'], ['?parentSibling', 'person/child', '?cousin'], ['?cousing '] ] }
Because you can put variables in any position in your set of facts, this is really easy to represent in Datalog. But in a GraphQL style language, it gets a little trickier.
Maybe you could do something like :
query({ "?jared": {"person/name": "Jared"}, "?parent":{ "person/child": "?jared", "person/sibling": { "person/child": { "person/name": "?" } } }, })
As you can see, I’m still playing with this and figuring it out. I don’t actually think the query language will be stable any time soon, which is why I’m sticking with JSON as the notation. Treating the query as data makes it a lot easier for me to experiment with different formats, and use code to manipulate it. Eventually I could maybe define a fancy syntax, but for now this will do.
Ditching the query planner
Of course, actually executing the query is going to be tricky, no matter how easy it is to write. But, I think I can make it easier by leaning on the “personal” nature of this database.
In a big, “business logic” database no one person understands the properties of all the data in the database. There’s different teams changing schemas and adding new tables, users interacting with features and creating new data. So when you ask the database a query, the only entity that has all the information about how to best execute a query is the database itself.
This is still technically true for a personal database, but most of the information about patterns in the database is actually in one persons head. So instead of bothering with a fancy query planner we can just execute the query in the order specified by the user, and assume they’ve specified a reasonable order.
The nice thing about having a query language to intermediate this interaction is we can grow a more complex query planner over time. There’s probably a ton of low hanging fruit, but given the amount of data we’ll be dealing with, and the speed of computers nowadays, we probably won’t even need that.
Getting it done
Okay, with this query language relatively set out, the data model in place, the next step is to actually implement the damn thing! There are lots more open questions, but I only have one newsletter left in this batch and so I need to get cracking.
I have an extremely rough, loose collection of thoughts, draft of the essay. Its a tad rougher than I feel comfortable sharing, so if you really want to take a peek, just hit reply and I’ll share it with ya.
I’ve been talking a lot about the specifics of my database in these newsletters but in that essay I want to jump to a higher level, and talk about why I think personal databases even matter, and what the future holds for them.