What if the spec doesn't match the code?
Address people's biggest concern about formal methods
Whenever I talk about formal methods, I get the same question:
Can I use the spec to generate my code?
People are worried about two things. One, that they'll make a mistake implementing the specification and have bugs. Two, that over time the implementation will "drift" and people won't update the spec, leading to unexpected design issues.
These are valid fears, and both are things that happen when writing specifications! Which is why it extra sucks that my answer is usually "you can't". This newsletter will be a quick explanation as to why, plus why it's still a net win to write a spec, and an alternate trick you can do instead to keep things in sync.
Synchronization is too expensive
There's two ways we can keep code in sync with a spec. The first way is program extraction, where you take a spec and generate a complex program from it. Lots of theorem provers, like Coq and Agda, have this functionality. The second way is refinement, where you write code independently and then prove it somehow matches the spec. This is what the seL4 people did with their C code.
Extractable/refineable spec are much, much harder to write than an "unsynced" spec because you lose a level of abstraction. Specifications can exist at a higher level of abstraction than a program can, which is why they can be so much more concise and focused than a program can. For example, here's a quick PlusCal snippet1 for a reader pulling a value off a queue and adding it to that reader's total:
await Nonempty(queue);
with r \in Readers do
total[r] := total[r] + Head(queue);
queue := Tail(queue);
end with;
What would that look like in a program? You'd probably have code for setting up the readers, code for reading from the queue, code for persisting the value of total
somewhere. If it's in another system like Postgres or redis, you'd have a client library for that, too. And around all of that you'd have logging, heartbeats, event hooks, a whole bunch of orthogonal concerns.
You might not even want to implement the whole spec! This is the machine vs world dichotomy: the spec can encode both the system you're designing and also the environment around that system. Let's say you're using AWS SQS for the queue, so there's a chance you can't connect to it, and a chance it duplicates a message. You can't implement either of those things in your system, only the code that detects and reacts to them. But adding those are par for the course in specification. It might look like this:
await Nonempty(queue);
with r \in Readers do
either
\* unreachable!
status[r] := "error";
or
\* duplicate!
total[r] := total[r] + Head(queue);
or
\* happy path
total[r] := total[r] + Head(queue);
queue := Tail(queue);
end either;
end with;
To synchronize spec and code, you need to bring the spec closer to the level of code, which means losing a lot of abstraction power. It's definitely possible, but considerably more expensive. Synchronization is not affordable for most companies.
But specs are still worth it
Even if you can't keep the spec in sync, it's still a net benefit. Consider what happens if you have a bug in production:
- With a spec: either you didn't implement the design correctly, or an assumption in the design ("the queue is perfectly reliable") was invalid
- Without a spec: you didn't implement the design correctly, an assumption in the design was invalid, or the design itself has a flaw.
That last one is the most dangerous and the most expensive to fix — this talk estimated that one design flaw cost their team a year of rework. They're also really hard to correctly diagnose without a spec. I've often seen people think a design bug was actually an implementation bug. They'd "fix" the bug and it'd go away for just long enough for everybody to forget about it before suddenly resurfacing.
And then you gotta make sure that the new design isn't buggy, either.
That's why formal methods are a net win, even without syncing with code. With a verified design you can build systems faster and you can fix bugs faster, because you're confident that the bugs you got are "only" implementation errors.2
We're working on it
A confession: the "this is impossible" section was glazing over a lot of interesting work.
First of all, cheaply generating code from specs might not be viable in general, but it's possible for specific subsets of specification. This is mostly the realm of research projects like PGo, which converts (a variant of) PlusCal to Go code. You can see some examples here. I don't think it's going to be production ready for a long time, but at least it's being worked on.
A more exciting approach is generating tests from specs. I know a few companies who've done this, with the most public example being MongoDB. They discuss their approach in eXtreme modeling in practice:
We wrote a new TLA+ specification for this purpose and used the state space explored by the TLC model checker to generate C++ and Golang test cases for aspects of the OT algorithm. These test cases can be automatically generated and run after any change to the model, and ensure that both the C++ and Golang implementations make appropriate changes to maintain equivalence.
More specifically, they took a TLA+ spec and dumped the state graph, then decomposed it into a set of linear behaviors and converted those into about 5,000 tests. This works better than code generation because tests are, in a strange way, "more abstract" than code. Lots of implementations pass the same test suite. So it's less of a jump.
The TLA+ model checker has a bunch of features to make test generation easier: simulation mode lets you grab random traces and ALIAS can fine-tune the trace outputs. There's also an IO extension for calling subprocesses from model checking, which can be used to drive a test harness.
Two important caveats. One, the MongoDB team wrote a bunch of custom code to make this work; there's no off-the-shelf solution for turning specs into tests. Second, they were specifying a low-level, deterministic program, so were able to keep the TLA+ relatively close to the code. Writing a test harness for a distributed system is a lot harder. It's still a promising avenue!
Comments Section
I got a lot of responses to my graph types essay. I compiled some of the most interesting ones into a "Comments Section", here. There's some cool stuff about datalog and Rubik's Cubes there!
-
PlusCal is a DSL that compiles to pure TLA+. I'm using it here because the equivalent TLA+ constructs are a little harder for outsiders to read. ↩
-
One weak point in this is the "An assumption in the design was wrong" problem, but generally it's much easier to adapt a working design that fails under one broken assumption than an ad hoc design that might not have worked in the first place. ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.