56: Some Facts about Facts

data model

        February 11, 2021

56: Some Facts about Facts

        My project for the next couple of weeks is implementing a personal database. I
want it to store information about the things I read, the thoughts I’m thinking,
the websites I visit, the people that matter to me, and anything else I can cram
into it.
One of the biggest decisions in implementing this database is figuring out the
data model: How do we organize the data and it’s relationships?
I decided pretty early to go for a facts based data model. In it, all the
information in a system is represented by a set of “facts”.
Each fact has three parts: 

an entity: the “thing” this fact is about
an attribute: the property of the entity this fact is describing
a value: the value of that property

Using this simple building block, you represent a huge variety of useful
contructs. 
Let’s look at an example. 
Here’s a list of facts: 

[42, "person/name", "Jared"]
[42, "place-of-birth", "Muscat, Oman"]

What this means is that the entity 42 has two attributes, name and place of
birth, which have the values “Jared”, and “Muscat, Oman” respectively. 
The values in facts can also be references to other entities. So we could
represent the same information like this: 

[42, "person/name", "Jared"]
[42, "place-of-birth", 58]
[58, "city/name", "Muscat"]
[58, "city/country", "Oman"]

Instead of storing the value of “place of birth” as a piece of data, we store it
as a reference to another entity, 58. That entity then has the attributes
city/name and city/country which define it’s name and country. 
By using these references you could build up a rich graph of data.
Why facts?
Every data model has different trade-offs and is suited for different domains.
I’m making the case that facts are particularly well suited for modelling the
domain of “personal data”; roughly, all the information a single human being
might want to keep and interconnect.
The most important constraint of this domain is that it’s dynamic and
inconsistent. Human lives are messy, and human thoughts even more so. The
information we’re dealing with does not fit into fixed categories, and is
constantly shifting. 
You have much richer information about your closest friends than passing
accquiantances. Same with a website you frequent daily and one you clicked once.
Yet they’re all people, and all websites. Facts let us represent individual pieces
of information instead of fitting things into these large categories. 
That doesn’t mean that it’s a data model completely without constraints. Instead
of constraining what attributes things can have (i.e saying “all people have a
name”), you constrain what values an attribute can have (i.e “all names are
text”). Constraining on the attribute level lets you coordinate with yourself
more easily, without making adding new information too difficult.
Ultimately, I think facts strike the right balance between simplicity and
expressivity, and flexibility and constraints. 
What does this actual look like?
What deciding on a data model doesn’t give you is any clue on how to actually
store and represent that data. Like I mentioned last time, one of my constraints
for this project is to lean on the file system, so that means the question is
how am I going to represent facts in files? 
The main goal is to represent them in such a way that it makes reading and
writing data easy, given the patterns of reading and writing that are going to
occur often.
Grouping facts in files by entity is the simplest answer here, as entities
already correspond to the most “meaningful” relationships between facts. i.e two
facts about the same person are more closely related than, for example, two
names of random people.
I experimented briefly with using YAML as a format to write this data, but it’s
honestly more complicated than I need it to be, and that complexity created some
problems. So I’m defining my own little syntax. It looks something like this: 
name: Jared Pereira
notes: [[[
---
This is a note about me
---
---
This is another note about me!

So much information!
---
]]]

I still have to work out the kinks, but the idea is to represent each fact as  a
property on the entity, with some properties having single values, some having
multiple, and being able to represent both single line and multi-line values. 
I’ve started writing a very simple parser for this format and it’s a lot of fun!
Evocative of the point I was at in Spring last
year, on I suppose, this
very same project.
How does this turn from files to a database?
What’s the difference between a file system and a database? Constraints. You can
dump any kind of information into a file system, but the
purpose of a database is to obstruct access to data ¹, so that all the data
you put in is structured.
The most fundamental constraint is the data model. A file can represent
anything, but the database forces everything to be facts. A higher level
constraint is the ones on the values of attributes I talked about earlier. 
Really, these constraints can get arbitrarily complex. You could say every book
has an author, or you have to rate every book you finish, or the average rating
of every book you’ve rated should be 5/10.
Big databases would call this “business logic” and they’re very good at
representing it. But people aren’t businesses and we need a different kind of
logic, one that’s a lot more flexible and emergent. I’m not yet quite sure how
to implement this. 
One guess I have is that instead of actually enforcing constraints, it’s
enough to make the user aware of them, by creating an interface to them. This
leave the choice up to the user, and maitains the property that they always take
action that changes the database, while still pushing the database into the
direction a past version of the same user wanted it to go. This is one of the
most exciting areas of exploration for me.
One more difference, is that a filesystem has a very simple query mechanism,
“give me a file”, or “give me a folder”. But a database can leverage the
structure they enforce to give you the ability to ask richer questions, like
“give me the name of all books I rated more than 10”. The way you ask those
questions is the query language, which we’ll get into next time!

Last week I forgot to link you to the notes I wrote on this topic for my final
essay. This week I’m preparing a little
outline.

Taken from “The Image of
Postgres”, a talk by r0ml ↩

                            Don't miss what's next. Subscribe to A warm newsletter:

            Email address (required)