Code is Data is YAML
This is a weirder theorycrafting post and I don't know how much I actually believe in it, but it seems like something you all'd enjoy reading.
A couple weeks back I was trying out Semgrep, a static analysis tool that matches the structure of the code you're writing. For example, you can write exec(...)
to match all exec
statements in Python, even aliased ones. Then you could write a rule saying to raise errors if it finds that. The rule file would look like this:
rules:
- id: my_pattern_id
pattern: |
exec(...)
message: |
severity: WARNING
Ah, YAML. Lots of people are not a fan of YAML these days. There's the Go problem! The Norway problem! Ambiguous Parsing! So why do we keep using it?
Well, what are the alternatives? What's the space like? YAML is called a "human readable data format". It's most popular (and notorious) for configuration. When people propose alternatives, here's what they usually are:
- Lateral moves in the space to what people consider similar formats, like JSON, TOML, XML.
- Full-on programming language as configuration, like emacs,
setup.py
, or Chef. - Binary configs, like sqlite files.
- Next-gen configuration languages, like HCL and Dhall.
I want to focus on the first bullet point. People use JSON, YAML, TOML, and XML for configuration. But only TOML is actually intended as a configuration language. JSON is a data serialization format and XML is a markup language.1 They are all used in the configuration space, but only one of them is actually a configuration language!
This tells me that YAML "isn't really in" the space of configuration languages. It casts a shadow there, but it's not "in there", just like JSON and XML aren't "in there". The alternative to YAML isn't JSON, or Dhall, or a sqlite database.
The alternative to YAML is a parser.
Structured Text
Here's another semgrep rule:
patterns:
- pattern-not-inside: assert(...)
- pattern-either:
- pattern: $X == $X
- pattern: $X != $X
- pattern-not: 1 == 1
Do you see that pattern-either
? We're using the YAML to say A || B
. Combined with the pattern-not
, that's (A || B) && !C
. If the semgrep
language stored rules in a language, they'd need to write a parser to convert it into an AST. The YAML form is already parsed.2
This is eerily similar to Lisp s-expressions (sexprs). Parsing Lisp is really easy in part because the Lisp code is already in the form of an AST. Lisp also makes it easy to treat code like data. Can we do the same with YAML? I think so! Here's a very quick demo:
add:
- 2
- sub:
- 3
- 4
Loading that in Python gives us {'add': [2, {'sub': [3, 4]}]}
. Now I can write:
# yes this is fragile as hell
def change(y):
if type(y) == dict:
for key, val in y.items():
if key == "add":
# Change to multiplication
return {"mul": change(val)}
elif key == "sub":
# Reorder the terms
return {"sub": list(reversed(change(val)))}
elif type(y) == list:
return [change(x) for x in y]
elif type(y) == int:
return y+1
And get a new YAML:
mul:
- 3
- sub:
- 5
- 4
Oh wait, new idea. Let's add notes to our syntax!3
# Python
def to_eqn(y):
if type(y) == dict:
for key, val in y.items():
if key == "note":
return to_eqn(val[-1])
elif key == "add":
return f"({to_eqn(val[0])}) + ({to_eqn(val[1])})"
elif key == "sub":
return f"({to_eqn(val[0])}) - ({to_eqn(val[1])})"
elif type(y) == int:
return y
# YAML
add:
- 2
- note:
- blah blah blah
- look at me
- sub:
- 3
- 4
# Output
(2) + ((3) - (4))
The note isn't present in our final output! There no longer needs to be a 1-1 mapping between the code as written and the code as executed. This makes adding metadata really easy. A core problem with metadata is that anything added to the code file becomes part of the code. We need to simultaneously embed it in the code and keep it conceptually distinct, when the operating assumption in every single programming tool is that One File = One Conceptual Domain. Structured text editing circumvents that problem by allowing code and not-code to coexist.4
I'm being a little sloppy here. There's two things going on with YAML. First, it preparses the content for us. This isn't a big enough draw to YAMLize complex programming languages, but it's a huge huge deal for small DSLs. I suspect that's the main reason why semgrep
uses YAML for its rules and why so many technologies jury-rig it into a configuration language. Second, it allows for much more orthogonality in its content than you'd get with raw text. This doesn't seem to be something people care about; I'm mostly into it as a theoretical idea. But still. It's fascinating. Adding value to the domain of coding and the domain about coding. It could be something that, in the long run, makes structured text appealing for things more complex than DSLs.
Prior Art
There's a couple of languages out there that are natively written in a structured format:
- JANI: A probabilistic modeling language. Written in JSON.
- SMT-LIB: standardized input language for SMT solvers. Written as sexprs.
(I believe event-B specifications are stored as XML, but I'm not sure.)
Oh yeah, there's also every Lisp ever made. Sexprs are really polarizing, you either love them or you hate them. Now I'm wondering if some of the love for sexprs are "really" love for structured text.
There's also been work on structured text editors. The main one I know off the top of my head is Leo, which I tried and gave up on. The problem with a lot of these is there's no way to edit the structured representation in the raw format. You're cut off from the fifty years of tooling we've developed for working with rawtext. At least YAML is also a raw text file, just one that defines a structure. It's "less powerful" than Leo but also more compatible with our current workflows.
(There's a tradeoff here: you now need to distinguish "text indicating structure" from "text that's just text". Kind of the same problem with escaping quotes in a string, or strings with commas in a CSV. But losing access to tooling is a bigger problem.)
The problems with YAML
So first of all, programming in YAML would be godawful. Maybe you could do it in a lisp but other than that, no chance in hell. Encoding a DSL as structured text is magnitudes easier than encoding a full programming language.
Second, structured text needs to be as lean as possible. You add structure on top of your content, but there's no additional meaning to the content: it's just text. But YAML can assign information to the text, too. For example, the string "no" used to be parsed as boolean FALSE as part of the spec.
Third, node anchors turn YAML from flat text into a directed graph, adding basic behavior to the format. There's no longer a 1-1 mapping between the text in the file and the data you load in.
Many of these are the same things that make YAML problematic for configuration formats, too. They do this by adding additional behaviors, like types or functions. These make them better as configuration languages but worse as structured text formats. It'd be akin to a code generator for our DSL. I think there's space for a simple structured text format that makes writing DSLs and metadata easy without YAML's baggage.
One more thought: YAML puts a hierarchical structure on text. Anchor nodes mess with this, but overall it's hierarchical. There's other kinds of structures too, like graphs. I don't think I've ever seen a decent graph format, but if one exists it could be an interesting structured text format for a dataflow DSL. There's also table structures like CSV, although I have no idea what that would be useful for.
That's all I've got. Enjoy your week!
-
There's probably a formal definition somewhere, but my intuition says that a markup language is distinguished by the capacity to add inline structure. ↩
-
Of course they still have to parse the individual patterns, but that's a vastly simpler problem. They offloaded language features to YAML structure. ↩
-
Yes, the parser is terribad and the YAML structure is godawful. This is for explanation purposes only, please don't write something like this in real life. ↩
-
The widespread substitute for this is comments. Lots of languages put pragmas in comments, or use special glyphs to indicate structured information inside a comment.
Also goddamn there is so much more I want to write about "canonical text" and metadata but I need to figure out what I'm actually thinking about it first ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.