AI is a gamechanger for TLA+ users
There has never been a better time to learn formal specification.
New Logic for Programmers Release
v0.10 is now available! This is a minor release, mostly focused on logic-based refactoring, with new material on set types and testing refactors are correct. See the full release notes at the changelog page. Due to conference pressure v0.11 will also likely be a minor release.
AI is a gamechanger for TLA+ users
TLA+ is a specification language to model and debug distributed systems. While very powerful, it's also hard for programmers to learn, and there's always questions of connecting specifications with actual code.
That's why The Coming AI Revolution in Distributed Systems caught my interest. In the post, Cheng Huang claims that Azure successfully used LLMs to examine an existing codebase, derive a TLA+ spec, and find a production bug in that spec. "After a decade of manually crafting TLA+ specifications", he wrote, "I must acknowledge that this AI-generated specification rivals human work".
This inspired me to experiment with LLMs in TLA+ myself. My goals are a little less ambitious than Cheng's: I wanted to see how LLMs could help junior specifiers write TLA+, rather than handling the entire spec automatically. Details on what did and didn't work below, but my takeaway is that LLMs are an immense specification force multiplier.
All tests were done with a standard VSCode Copilot subscription, writing Claude 3.7 in Agent mode. Other LLMs or IDEs may be more or less effective, etc.
Things Claude was good at
Fixing syntax errors
TLA+ uses a very different syntax than mainstream programming languages, meaning beginners make a lot of mistakes where they do a "programming syntax" instead of TLA+ syntax:
NotThree(x) = \* should be ==, not =
x != 3 \* should be #, not !=
The problem is that the TLA+ syntax checker, SANY, is 30 years old and doesn't provide good information. Here's what it says for that snippet:
Was expecting "==== or more Module body"
Encountered "NotThree" at line 6, column 1
That only isolates one error and doesn't tell us what the problem is, only where it is. Experienced TLA+ users get "error eyes" and can quickly see what the problem is, but beginners really struggle with this.
The TLA+ foundation has made LLM integration a priority, so the VSCode extension naturally supports several agents actions. One of these is running SANY, meaning an agent can get an error, fix it, get another error, fix it, etc. Provided the above sample and asked to make it work, Claude successfully fixed both errors. It also fixed many errors in a larger spec, as well as figure out why PlusCal specs weren't compiling to TLA+.
This by itself is already enough to make LLMs a worthwhile tool, as it fixes one of the biggest barriers to entry.
Understanding error traces
When TLA+ finds a violated property, it outputs the sequence of steps that leads to the error. This starts in plaintext, and VSCode parses it into an interactive table:
Learning to read these error traces is a skill in itself. You have to understand what's happening in each step and how it relates back to the actually broken property. It takes a long time for people to learn how to do this well.
Claude was successful here, too, accurately reading 20+ step error traces and giving a high-level explanation of what went wrong. It also could condense error traces: if ten steps of the error trace could be condensed into a one-sentence summary (which can happen if you're modeling a lot of process internals) Claude would do it.
I did have issues here with doing this in agent mode: while the extension does provide a "run model checker" command, the agent would regularly ignore this and prefer to run a terminal command instead. This would be fine except that the LLM consistently hallucinated invalid commands. I had to amend every prompt with "run the model checker via vscode, do not use a terminal command". You can skip this if you're willing to copy and paste the error trace into the prompt.
As with syntax checking, if this was the only thing LLMs could effectively do, that would already be enough1 to earn a strong recommend. Even as a TLA+ expert I expect I'll be using this trick regularly.
Boilerplate tasks
TLA+ has a lot of boilerplate. One of the most notorious examples is UNCHANGED
rules. Specifications are extremely precise — so precise that you have to specify what variables don't change in every step. This takes the form of an UNCHANGED
clause at the end of relevant actions:
RemoveObjectFromStore(srv, o, s) ==
/\ o \in stored[s]
/\ stored' = [stored EXCEPT ![s] = @ \ {o}]
/\ UNCHANGED <<capacity, log, objectsize, pc>>
Writing this is really annoying. Updating these whenever you change an action, or add a new variable to the spec, is doubly so. Syntax checking and error analysis are important for beginners, but this is what I wanted for myself. I took a spec and prompted Claude
Add UNCHANGED <
> for each variable not changed in an action.
And it worked! It successfully updated the UNCHANGED
in every action.
(Note, though, that it was a "well-behaved" spec in this regard: only one "action" happened at a time. In TLA+ you can have two actions happen simultaneously, that each update half of the variables, meaning neither of them should have an UNCHANGED
clause. I haven't tested how Claude handles that!)
That's the most obvious win, but Claude was good at handling other tedious work, too. Some examples include updating vars
(the conventional collection of all state variables), lifting a hard-coded value into a model parameter, and changing data formats. Most impressive to me, though, was rewriting a spec designed for one process to instead handle multiple processes. This means taking all of the process variables, which originally have types like Int
, converting them to types like [Process -> Int]
, and then updating the uses of all of those variables in the spec. It didn't account for race conditions in the new concurrent behavior, but it was an excellent scaffold to do more work.
Writing properties from an informal description
You have to be pretty precise with your intended property description but it handles converting that precise description into TLA+'s formalized syntax, which is something beginners often struggle with.
Things it is less good at
Generating model config files
To model check TLA+, you need both a specification (.tla
) and a model config file (.cfg
), which have separate syntaxes. Asking the agent to generate the second often lead to it using TLA+ syntax. It automatically fixed this after getting parsing errors, though.
Fixing specs
Whenever the ran model checking and discovered a bug, it would naturally propose a change to either the invalid property or the spec. Sometimes the changes were good, other times the changes were not physically realizable. For example, if it found that a bug was due to a race condition between processes, it would often suggest fixing it by saying race conditions were okay. I mean yes, if you say bugs are okay, then the spec finds that bugs are okay! Or it would alternatively suggest adding a constraint to the spec saying that race conditions don't happen. But that's a huge mistake in specification, because race conditions happen if we don't have coordination. We need to specify the mechanism that is supposed to prevent them.
Finding properties of the spec
After seeing how capable it was at translating my properties to TLA+, I started prompting Claude to come up with properties on its own. Unfortunately, almost everything I got back was either trivial, uninteresting, or too coupled to implementation details. I haven't tested if it would work better to ask it for "properties that may be violated".
Generating code from specs
I have to be specific here: Claude could sometimes convert Python into a passable spec, an vice versa. It wasn't good at recognizing abstraction. For example, TLA+ specifications often represent sequential operations with a state variable, commonly called pc
. If modeling code that nonatomically retrieves a counter value and increments it, we'd have one action that requires pc = "Get"
and sets the new value to "Inc"
, then another that requires it be "Inc"
and sets it to "Done"
.
I found that Claude would try to somehow convert pc
into part of the Python program's state, rather than recognize it as a TLA+ abstraction. On the other side, when converting python code to TLA+ it would often try to translate things like sleep
into some part of the spec, not recognizing that it is abstractable into a distinct action. I didn't test other possible misconceptions, like converting randomness to nondeterminism.
For the record, when converting TLA+ to Python Claude tended to make simulators of the spec, rather than possible production code implementing the spec. I really wasn't expecting otherwise though.
Unexplored Applications
Things I haven't explored thoroughly but could possibly be effective, based on what I know about TLA+ and AI:
Writing Java Overrides
Most TLA+ operators are resolved via TLA+ interpreters, but you can also implement them in "native" Java. This lets you escape the standard language semantics and add capabilities like executing programs during model-checking or dynamically constrain the depth of the searched state space. There's a lot of cool things I think would be possible with overrides. The problem is there's only a handful of people in the world who know how to write them. But that handful have written quite a few overrides and I think there's enough there for Claude to work with.
Writing specs, given a reference mechanism
In all my experiments, the LLM only had my prompts and the occasional Python script as information. That makes me suspect that some of its problems with writing and fixing specs come down to not having a system model. Maybe it wouldn't suggest fixes like "these processes never race" if it had a design doc saying that the processes can't coordinate.
(Could a Sufficiently Powerful LLM derive some TLA+ specification from a design document?)
Connecting specs and code
This is the holy grail of TLA+: taking a codebase and showing it correctly implements a spec. Currently the best ways to do this are by either using TLA+ to generate a test suite, or by taking logged production traces and matching them to TLA+ behaviors. This blog post discusses both. While I've seen a lot of academic research into these approaches there are no industry-ready tools. So if you want trace validation you have to do a lot of manual labour tailored to your specific product.
If LLMs could do some of this work for us then that'd really amplify the usefulness of TLA+ to many companies.
Thoughts
Right now, agents seem good at the tedious and routine parts of TLA+ and worse at the strategic and abstraction parts. But, since the routine parts are often a huge barrier to beginners, this means that LLMs have the potential to make TLA+ far, far more accessible than it previously was.
I have mixed thoughts on this. As an advocate, this is incredible. I want more people using formal specifications because I believe it leads to cheaper, safer, more reliable software. Anything that gets people comfortable with specs is great for our industry. As a professional TLA+ consultant, I'm worried that this obsoletes me. Most of my income comes from training and coaching, which companies will have far less demand of now. Then again, maybe this an opportunity to pitch "agentic TLA+ training" to companies!
Anyway, if you're interested in TLA+, there has never been a better time to try it. I mean it, these tools handle so much of the hard part now. I've got a free book available online, as does the inventor of TLA+. I like this guide too. Happy modeling!
-
Dayenu. ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.