Three ways formally verified code can go wrong in practice

should

                        October 10, 2025

            Three ways formally verified code can go wrong in practice
            "Correct" doesn't mean "correct" when correctly using "correct"

                    New Logic for Programmers Release!
v0.12 is now available! This should be the last major content release. The next few months are going to be technical review, copyediting and polishing, with a hopeful 1.0 release in March. Full release notes here.

Three ways formally verified code can go wrong in practice
I run this small project called Let's Prove Leftpad, where people submit formally verified proofs of the eponymous meme. Recently I read Breaking “provably correct” Leftpad, which argued that most (if not all) of the provably correct leftpads have bugs! The lean proof, for example, should render leftpad('-', 9, אֳֽ֑) as ---------אֳֽ֑, but actually does ------אֳֽ֑.
You can read the article for a good explanation of why this goes wrong (Unicode). The actual problem is that correct can mean two different things, and this leads to confusion about how much formal methods can actually guarantee us. So I see this as a great opportunity to talk about the nature of proof, correctness, and how "correct" code can still have bugs.
What we talk about when we talk about correctness
In most of the real world, correct means "no bugs". Except "bugs" isn't a very clear category. A bug is anything that causes someone to say "this isn't working right, there's a bug." Being too slow is a bug, a typo is a bug, etc. "correct" is a little fuzzy.
In formal methods, "correct" has a very specific and precise meaning: the code conforms to a specification (or "spec"). The spec is a higher-level description of what is supposed the code's properties, usually something we can't just directly implement. Let's look at the most popular kind of proven specification:
-- Haskell
inc :: Int -&gt; Int
inc x = x + 1

The type signature Int -> Int is a specification! It corresponds to the logical statement all x in Int: inc(x) in Int. The Haskell type checker can automatically verify this for us. It cannot, however, verify properties like all x in Int: inc(x) > x. Formal verification is concerned with verifying arbitrary properties beyond what is (easily) automatically verifiable. Most often, this takes the form of proof. A human manually writes a proof that the code conforms to its specification, and the prover checks that the proof is correct.
Even if we have a proof of "correctness", though, there's a few different ways the code can still have bugs.
1. The proof is invalid
For some reason the proof doesn't actually show the code matches the specification. This is pretty common in pencil-and-paper verification, where the proof is checked by someone saying "yep looks good to me". It's much rarer when doing formal verification but it can still happen in a couple of specific cases:
The theorem prover itself has a bug (in the code or introduced in the compiled binary) that makes it accept an incorrect proof. This is something people are really concerned about but it's so much rarer than every other way verified code goes wrong, so is only included for completeness.
For convenience, most provers and FM languages have an "just accept this statement is true" feature. This helps you work on the big picture proof and fill in the details later. If you leave in a shortcut, and the compiler is configured to allow code-with-proof-assumptions to compile, then you can compile incorrect code that "passes the proof checker". You really should know better, though.

2. The properties are wrong
Galois
This code is provably correct:
inc :: Int -&gt; Int
inc x = x-1

The only specification I've given is the type signature Int -> Int. At no point did I put the property inc(x) > x in my specification, so it doesn't matter that it doesn't hold, the code is still "correct".
This is what "went wrong" with the leftpad proofs. They do not prove the property "leftpad(c, n, s) will take up either n spaces on the screen or however many characters s takes up (if more than n)". They prove the weaker property "len(leftpad(c, n, s)) == max(n, len(s)), for however you want to define len(string)". The second is a rough proxy for the first that works in most cases, but if someone really needs the former property they are liable to experience a bug.
Why don't we prove the stronger property? Sometimes it's because the code is meant to be used one way and people want to use it another way. This can lead to accusations that the developer is "misusing the provably correct code" but this should more often be seen as the verification expert failing to educate devs on was actually "proven".
Sometimes it's because the property is too hard to prove. "Outputs are visually aligned" is a proof about Unicode inputs, and the core Unicode specification is 1,243 pages long.
Sometimes it's because the property we want is too hard to express. How do you mathematically represent "people will perceive the output as being visually aligned"? Is it OS and font dependent? These two lines are exactly five characters but not visually aligned:
|||||
MMMMM
Or maybe they are aligned for you! I don't know, lots of people read email in a monospace font. "We can't express the property" comes up a lot when dealing with human/business concepts as opposed to mathematical/computational ones.
Finally, there's just the possibility of a brain fart. All of the proofs in Nearly All Binary Searches and Mergesorts are Broken are like this. They (informally) proved the correctness of binary search with unbound integers, forgetting that many programming languages use machine integers, where a large enough sum can overflow.
3. The assumptions are wrong
This is arguably the most important and most subtle source of bugs. Most properties we prove aren't "X is always true". They are "assuming Y is true, X is also true". Then if Y is not true, the proof no longer guarantees X. A good example of this is binary sort search, which only correctly finds elements assuming the input list is sorted. If the list is not sorted, it will not work correctly.
Formal verification adds two more wrinkles. One: sometimes we need assumptions to make the property valid, but we can also add them to make the proof easier. So the code can be bug-free even if the assumptions used to verify it no longer hold! Even if a leftpad implements visual alignment for all Unicode glyphs, it will be a lot easier to prove visual alignment for just ASCII strings and padding.
Two: we need make a lot of environmental assumptions that are outside our control. Does the algorithm return output or use the stack? Need to assume that there's sufficient memory to store stuff. Does it use any variables? Need to assume nothing is concurrently modifying them. Does it use an external service? Need to assume the vendor doesn't change the API or response formats. You need to assume the compiler worked correctly, the hardware isn't faulty, and the OS doesn't mess with things, etc. Any of these could change well after the code is proven and deployed, meaning formal verification can't be a one-and-done thing.
You don't actually have to assume most of these, but each assumption drop makes the proof harder and the properties you can prove more restricted. Remember, the code might still be bug-free even if the environmental assumptions change, so there's a tradeoff in time spent proving vs doing other useful work.
Another common source of "assumptions" is when verified code depends on unverified code. The Rust compiler can prove that safe code doesn't have a memory bug assuming unsafe code does not have one either, but depends on the human to confirm that assumption. Liquid Haskell is verifiable but can also call regular Haskell libraries, which are unverified. We need to assume that code is correct (in the "conforms to spec") sense, and if it's not, our proof can be "correct" and still cause bugs.
These boundaries are fuzzy. I wrote that the "binary search" bug happened because they proved the wrong property, but you can just as well argue that it was a broken assumption (that integers could not overflow). What really matters is having a clear understanding of what "this code is proven correct" actually tells you. Where can you use it safely? When should you worry? How do you communicate all of this to your teammates?
Good lord it's already Friday

                    If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.

Don't miss what's next. Subscribe to Computer Things:

Join the discussion:

                            Soatok Dreamseeker

                    Oct. 12, 2025, afternoon

A good example of this is binary sort, which only correctly finds elements assuming the input list is sorted. If the list is not sorted, it will not work correctly.

Maybe that first "binary sort" should be "binary search"?

Reply
Report