Formalizing Stability and Resilience Properties
Via TLA+
Sent to me via mailbag:
Have you used TLA+ to model resilience as in resilience engineering (systems in general)? — Feodrippe
Resilience Engineering, to my understanding, refers to building systems that can function in the presence of disruptions. For example, if a backend service gets overloaded, its dependencies can automatically switch to fallback logic to keep serving customers and give the service time to recover.
I've not done much work personally with "resilience engineering". But the idea comes up a lot in formal methods, in the form of resilience properties. These are goals of the systems that we want (and use resilience engineering) to achieve. And I think the details of resilience properties are interesting enough for a deep dive. I'll use syntax, but it should be applicable to other specification languages, too.
Formalizing the Properties
Let x
be some measurable value (temperature, number of servers online, queue size, weight, etc) and S be a range of desirable values. The goal of the system is to keep x in the range, or in TLA+ x \in S
. We'll call this being "in control": Control == x \in S
.
The strongest possible property is the invariant STRONG: []Control
, "in all states x is in control". If for a single state x is outside control, this property fails. This is undesirable, as often the system starts with x out of the range and has to get it in first. So []Control
will automatically fail on the starting step.
(A stronger property is one that's harder to guarantee.)
On the other side of the spectrum, WEAK: <>Control
is the weakest possible property: "in at least one state, x is in control". This allows x to start outside the range and go inside, but it only needs to do so once, then everything can break down. While there are cases for both the strongest and weakest versions of the prop, we usually want something in between.
So first, we could say that we don't start in control, but once we reach it, we never lose control (TRIPWIRE
). In TLA+, this is <>Control /\ [](Control => []Control)
. Notice that by conjoining a property to WEAK
we make the property stronger. TRIPWIRE
is better than STRONG
but still too sensitive. If there's some random variance in x
, a tiny fluctuation might trigger the tripwire and then immediately fail the property.
We could instead allow a bit of jostle but say eventually we converge on being in control (STABLE
). Formally, "eventually always" we're in control, or there's eventually a point where, from that point on, we are forever in control. We can write STABLE as <>[]Control
.
Finally, resilience. It's possible to achieve stability in a closed system, where the only thing that matters is the other components of the system, but in an open system, the outside world has a vote too. The temperature can be thrown out of the right range by a sudden heat wave, or a storm knocks out one of our servers, and so puts us out of control. Often the best we can do is guarantee we can recover from a shock: RESILIENT
is the property that we always eventually return to control, or in TLA+ []<>Control
.
Relationships between properties
Formal methods isn't just about how properties relate to systems. It's also useful to talk about how properties relate to other properties. A => B
means that A is stronger than B:1 it is plausible for a system to satisfy B without satisfying A, but not the other way around.
Property strength is a partial order: Most properties aren't comparable to each other. A and B can be incomparable even if they both imply C. Fortunately, partial orders are transitive: A => B
and B => C
means that A => C
. Here's the current spectrum
STRONG => TRIPWIRE => STABLE => RESILIENT => WEAK
The important one here is STABLE => RESILIENT
, or <>[]Control => []<>Control
. To explain this intuitively, if you eventually converge on reaching control, then nothing in the future kicks you out of control, so you will successfully recover from all (zero) shocks.
These are relationships between different types of properties. We can also create stronger or weaker properties within the same type by tweaking the size of the target range. Let's redefine all of our properties to be parameterized on a set of values, so f.ex STABLE(1..3) == <>[]Control(1..3) = <>[](x \in 1..3)
. Then any system satisfying STABLE(1..3)
also satisfies STABLE(1..10)
, so the former property is stronger. For every property P on our spectrum,
IF narrow \subseteq wide
THEN P(narrow) => P(wide)
This follows because every value of narrow
is also a value of wide
, so being restricted to narrow
means we're restricted to wide
too. This doesn't hold if narrow
isn't a subset, even if it's much smaller: STABLE(1..3)
doesn't automatically guarantee STABLE(2..1000)
.
Resilient Systems
Given the resilient system, the ideal property to guarantee is STABLE(narrow)
. In practice, that's infeasible, because the reason we're designing a resilient system in the first place is because we're expecting disruptions. There's going to be some level of strength that's beyond our reach, either impossible or too expensive. There are two weaker properties we can aim for:
RESILIENT(narrow)
STABLE(wide)
These are not necessarily comparable. A system can satisfy (1) without (2) or (2) without (1). So we can make both of them goals:
Goal == RESILIENT(narrow) /\ STABLE(wide)
Metastability
There's one more way we can weaken a property: we can precondition it on another property.
STABLE(wide) => RESILIENT(narrow)
"If we can guarantee we're stable in the wide
range, then we can guarantee we're resilient in narrow
range". This creates a "metastable state": As long as we remain in a certain value range, we can always return to control. If we're pushed too far, though, we leave the metastable state and all bets are off.
We can define multiple metastable states in our system:2
Metastable(wide, narrow) == STABLE(wide) => RESILIENT(narrow)
Props ==
\A <<x, y>> \in {<<w1, n1>>, <<w2, n2>>}:
Metastable(x, y)
Time to Recovery
What about saying "we recover quickly"? That's trickier. One thing we can do is say there's an implicit "time horizon" in our system, so that a failure of []<>Resilient
could mean both "we never recover" and "we recover the next day, which is equivalent to 'never recovering' for our purposes."
If we want stricter time bounds, we need to embed a notion of time in our model of the system, and then track how many time units various system actions take. This is a LOT hard to do so you probably won't do it unless you absolutely need to.
Conclusions
I should probably turn this into a learntla topic or something
Also reminder that I'm running a TLA+ workshop on March 20. Use the code C0MPUT3RTHINGS
for 15% off!
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.