Some Rambling About Structured Text
TLA+ Workshop
You know the drill. TLA+ Workshop, three days, May 24-26. Only four slots, so sign up before they're gone!
New Essay
Why do Interviewers Ask Linked List Questions? Most the same as the sneak peek I shared a couple weeks back, but now it's public and you can share it and stuff.
Some Rambling About Structured Text
In addition to writing 'n workshops 'n stuff, I spent last week updating my website. The big change was adding tag categories, so you can go to the Formal Methods tag and see this:
Neat, huh? As part of that I needed to do some tag pruning and updates on old posts, which is something my static site generator makes difficult. In Hugo, post metadata is stored in a "front matter" YAML at the head of every post. For example, here's the front matter for the LL post:
---
date: '2021-03-28'
categories:
- Programming
tags:
- History
title: Why Do Interviewers Ask Linked List Questions?
---
Blah blah blah first line of content
I have 80+ posts on the blog. If I want to remove a tag, I have to go into every single post with that tag and manually delete it from the YAML. That sucks! To make this bearable, here's what I did instead:
- Wrote a script that manually parsed out the tags from every post, and then dumped them into a
post-tags.yml
. Keys are filepaths, values are lists of tags. - Made modifications to the values in
post-tags
. - Wrote a second script that opened every post, split the front matter from the content, loaded it into a python dict, modified the
tags
key, dumped it back into yaml, and stitched it back to the content.
Hugo naturally has "one file = one page", with the page metadata as part of the file. Here I'm storing all of the relevant metadata in one file, so I can look at and edit it in one place. Both of these are valid and useful views of the data. But only one of them can be the canonical source of truth. I have to generate the other myself and risk them getting out of sync.
If we think of the tags as data and not just parseable text, it's evident that there's a many-to-many relationship between posts and tags. If we stored both posts and tags in a SQL database, we could represent each data view as a literal view, different representations of the same underlying data. We could also represent other facets of the blog: show me the LL post and all posts that link to it, for example, or all posts with tag X and/or tag Y. I could do those things in the canonical-text world, but I'd have to write a separate script for each, whereas with a relational database it's trivial.
Why, then, do we store the metadata in text files and not a database? Pragmatic reasons. Text is the lowest common denominator for information, it's cross-platform, and the tooling is vastly, vastly, vastly better. There's nothing as good for editing SQL data that is remotely as good as Vim is for editing text data. I've built a few different tools that try to store structure data in SQL and I always abandon them for that reason; too much friction in using SQL for small-scale, interactive data. And that's not even getting into things like how mainstream VCS is optimized for text files, and the legacy of Unix tooling, and and and...
Structured Metadata
Even though I don't expect canonical text to go away anytime soon, it's still worth exploring what the alternatives could look like. I can accept that canonical text is the best we have right now while also thinking that we can do much better. Let's imagine that we have normal text content files, but store all metadata in a relational database that is always in sync. Let's also assume that we have sufficiently powerful tooling for working with this model, and that this is all mainstream. What does that open up?
For one, we can get rid of those header comments in all of our code. You know, the one with the author, copyright date, and license? Move all of those to the database. We can also add new metadata, like "every time the file was modified", which can be useful for auditing.
We can also move information about the semantics of the code and the purpose of the code. Semantics would be things like "what module is this part of" or "what packages do we import". I don't think that would a good idea, but it's demonstratively possible. Purpose is more interesting. Many testing libraries infer your test files by matching against its name, like foo.test.js
. We're storing data about the file in the filename, which we can move to the database.
Come to think of it, we often encode structural information in the path. In Rails, they use a folder for model files, a folder for controller files, and a folder for view fields. This groups things by structure but separates them by feature: a User
model and controller are in separate folders, and the User
model tests are in a separate folder than that. Django would instead group them by feature: the User
model and the Email
model are potentially far apart. But if we store the structure as metadata, it's now possible to present both folder structures to the developer. They can either group the files by feature or by structure, or one then the other, or feature and then production/test code or vice versa.
(I think that's possible in today's world, too, using symbolic links. I've never tried it. I feel like trying to generate the different layouts becomes difficult if you're inferring structure from the path and filename.)
There's one more bit of metadata we store in the filename: the programming language! We write foo.js
to mean "this is a JavaScript file." That's a very coarse-grained measure and prone to conflicts. Does .ps
mean PowerShell or PostScript? It also makes embedding harder. If I embed some HTML in a python file, how do I indicate at the file is mostly Python but has some HTML in the strings? Right now we have handrolled solutions for some combinations that come up a lot, but that's more complex than adding an embeds: HTML
metadata key.
So we already talked about some uses this opens up: smarter tooling and dynamic file grouping. Those seem the big advantage categories to me, but there's lots of small things in each category. "If you call the interpreter on a file with default-flags
metadata, use those if the command doesn't redefine them." Tag your secrets file with the uncommitable
flag. Show all tests for feature N that were changed in the last ten commits. Again, a lot of this is still possible now, but it's hard. The tools are fragile and error-prone, because they're trying to infer the data they need from the filename and text, when it should be structured for them.
None of this is happening
For one, unstructured text files are already the norm and that's what all the tooling's for, so any different format has to be obviously and vastly better than unstructured text for anybody to adopt it. The advantages I listed above are nice but aren't "vastly" better. We could get more benefits, arguably more dramatic ones, if we structured the text itself, but there's an even bigger gulf between that and the norm than there was with structured metadata. And this newsletter is already 1300+ words.
I like exploring the idea, though. It's a look into how things can be tangibly different. People can think about what's possible in that paradigm that they wish they could do right now, but can't. That's a big change from formal methods, where part of the challenge is convincing developers its worthwhile in the first place.
(That's probably a blogpost/newsletter all of its own: getting people to notice problems. Which is a different thing from noticing they have the problem.)
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.