LLMs are bad at vibing specifications
When it fails to do your thinking for you
No newsletter next week
I'll be speaking at InfoQ London. But see below for a book giveaway!
LLMs are bad at vibing specifications
About a year ago I wrote AI is a gamechanger for TLA+ users, which argued that AI are a "specification force multiplier". That was written from the perspective an TLA+ expert using these tools. A full 4% of Github TLA+ specs now have the word "Claude" somewhere in them. This is interesting to me, because it suggests there was always an interest in formal methods, people just lacked the skills to do it.
It's also interesting because it gives me a sense of what happens when beginners use AI to write formal specs. It's not good.
As a case study, we'll use this project, which is kind of enough to have vibed out TLA+ and Alloy specs.
Looking at a project
Starting with the Alloy spec. Here it is in its entirety:
module ThreatIntelMesh
sig Node {}
one sig LocalNode extends Node {}
sig Snapshot {
owner: one Node,
signed: one Bool,
signatures: set Signature
}
sig Signature {}
sig Policy {
allowUnsignedImport: one Bool
}
pred canImport[p: Policy, s: Snapshot] {
(p.allowUnsignedImport = True) or (s.signed = True)
}
assert UnsignedImportMustBeDenied {
all p: Policy, s: Snapshot |
p.allowUnsignedImport = False and s.signed = False implies not canImport[p, s]
}
assert SignedImportMayBeAccepted {
all p: Policy, s: Snapshot |
s.signed = True implies canImport[p, s]
}
check UnsignedImportMustBeDenied for 5
check SignedImportMayBeAccepted for 5
Couple of things to note here: first of all, this doesn't actually compile. It's using the Boolean standard module so needs open util/boolean to function. Second, Boolean is the wrong approach here; you're supposed to use subtyping.
sig Snapshot {
owner: one Node,
- signed: one Bool,
signatures: set Signature
}
+ sig SignedSnapshot in Snapshot {}
pred canImport[p: Policy, s: Snapshot] {
- s.signed = True
+ s in SignedSnapshot
}
So we know the person did not actually run these specs. This is somewhat less of a problem in TLA+, which has an official MCP server that lets the agent run model checking. Even so, I regularly see specs that I'm pretty sure won't model check, with things like using Reals or assuming NULL is a built-in and not a user-defined constant.
The bigger problem with the spec is that UnsignedImportMustBeDenied and SignedImportMayBeAccepted don't actually do anything. canImport is defined as P || Q. UnsignedImportMustBeDenied checks that !P && !Q => !canImport. SignedImportMayBeAccepted checks that P => canImport. These are tautologically true! If they do anything at all, it is only checking that canImport was defined correctly.
You see the same thing in the TLA+ specs, too:
GadgetPayload ==
/\ gadgetDetected' = TRUE
/\ depth' \in 0..(MaxDepth + 5)
/\ UNCHANGED allowlistedFormat
/\ decision' = "block"
NoExploitAllowed == gadgetDetected => decision = "block"
The AI is only writing "obvious properties", which fail for reasons like "we missed a guard clause" or "we forgot to update a variable". It does not seem to be good at writing "subtle" properties that fail due to concurrency, nondeterminism, or bad behavior separated by several steps. Obvious properties are useful for orienting yourself and ensuring the system behaves like you expect, but the actual value in using formal methods comes from the subtle properties.
(This ties into Strong and Weak Properties. LLM properties are weak, intended properties need to be strong.)
This is a problem I see in almost every FM spec written by AI. LLMs aren't doing one of the core features of a spec. Articles like Prediction: AI will make formal verification go mainstream and When AI Writes the World's Software, Who Verifies It? argue that LLMs will make formal methods go mainstream, but being easily able to write specifications doesn't help with correctness if the specs don't actually verify anything.
Is this a user error?
I first got interested in LLMs and TLA+ from The Coming AI Revolution in Distributed Systems. The author of that later vibecoded a spec with a considerably more complex property:
NoStaleStrictRead ==
\A i \in 1..Len(eventLog) :
LET ev == eventLog[i] IN
ev.type = "read" =>
LET c == ev.chunk IN
LET v == ev.version IN
/\ \A j \in 1..i :
LET evC == eventLog[j] IN
evC.type = "commit" /\ evC.chunk = c => evC.version <= v
This is a lot more complicated than the (P => Q && P) => Q properties I've seen! It could be because the corresponding system already had a complete spec written in P. But it could also be that Cheng Huang is already an expert specifier, meaning he can get more out of an LLM than an ordinary developer can. I've also noticed that I can usually coax an LLM to do more interesting things than most of my clients can. Which is good for my current livelihood, but bad for the hope of LLMs making formal methods mainstream. If you need to know formal methods to get the LLM to do formal methods, is that really helping?
(Yes, if it lowers the skill threshold-- means you can apply FM with 20 hours of practice instead of 80. But the jury's still out on how much it lowers the threshold. What if it only lowers it from 80 to 75?)
On the other hand, there also seem to be some properties that AI struggles with, even with explicit instructions. Last week a client and I tried to get Claude to generate a good liveness or action property instead of a standard obvious invariant, and it just couldn't. Training data issue? Something in the innate complexity of liveness? It's not clear yet. These properties are even more "subtle" than most invariants, so maybe that's it.
On the other other hand, this is all as of March 2026. Maybe this whole article will be laughably obsolete by June.
Logic for Programmers Giveaway
Last week's giveaway raised a few issues. First, the New World copies were all taken before all of the emails went out, so a lot of people did not even get a chance to try for a book. Second, due to a Leanpub bug the Europe coupon scheduled for 10 AM UTC actually activated at 10 AM my time, which was early evening for Europe. Third, everybody in the APAC region got left out.
So, since I'm not doing a newsletter next week, let's have another giveaway:
- This coupon will go up 2026-03-16 at 11:00 UTC, which should be noon Central European Time, and be good for ten books (five for this giveaway, five to account for last week's bug).
- This coupon will go up 2026-03-17 at 04:00 UTC, which should be noon Beijing Time, and be good for five books.
- This coupon will go up 2026-03-17 at 17:00 UTC, which should be noon Central US Time, and also be good for five books.
I think that gives the best chance of everybody getting at least a chance of a book, while being resilient to timezone shenanigans due to travel / Leanpub dropping bugfixes / daylight savings / whatever.
(No guarantees that later "no newsletter" weeks will have giveaways! This is a gimmick)
-
Are you sure the LLM doesn't push the difficulty up? Because in my experience, reviewing software is harder than writing software (even the same software). Even more so if the software may have obscure bugs that a human typically wouldn't make.
Add a comment: