Computer Things

Subscribe
Archives
March 8, 2023

Formalizing Stability and Resilience Properties

Via TLA+

Sent to me via mailbag:

Have you used TLA+ to model resilience as in resilience engineering (systems in general)? — Feodrippe

Resilience Engineering, to my understanding, refers to building systems that can function in the presence of disruptions. For example, if a backend service gets overloaded, its dependencies can automatically switch to fallback logic to keep serving customers and give the service time to recover.

I've not done much work personally with "resilience engineering". But the idea comes up a lot in formal methods, in the form of resilience properties. These are goals of the systems that we want (and use resilience engineering) to achieve. And I think the details of resilience properties are interesting enough for a deep dive. I'll use syntax, but it should be applicable to other specification languages, too.

Formalizing the Properties

Let x be some measurable value (temperature, number of servers online, queue size, weight, etc) and S be a range of desirable values. The goal of the system is to keep x in the range, or in TLA+ x \in S. We'll call this being "in control": Control == x \in S.

The strongest possible property is the invariant STRONG: []Control, "in all states x is in control". If for a single state x is outside control, this property fails. This is undesirable, as often the system starts with x out of the range and has to get it in first. So []Control will automatically fail on the starting step.

(A stronger property is one that's harder to guarantee.)

On the other side of the spectrum, WEAK: <>Control is the weakest possible property: "in at least one state, x is in control". This allows x to start outside the range and go inside, but it only needs to do so once, then everything can break down. While there are cases for both the strongest and weakest versions of the prop, we usually want something in between.

So first, we could say that we don't start in control, but once we reach it, we never lose control (TRIPWIRE). In TLA+, this is <>Control /\ [](Control => []Control). Notice that by conjoining a property to WEAK we make the property stronger. TRIPWIRE is better than STRONG but still too sensitive. If there's some random variance in x, a tiny fluctuation might trigger the tripwire and then immediately fail the property.

We could instead allow a bit of jostle but say eventually we converge on being in control (STABLE). Formally, "eventually always" we're in control, or there's eventually a point where, from that point on, we are forever in control. We can write STABLE as <>[]Control.

Finally, resilience. It's possible to achieve stability in a closed system, where the only thing that matters is the other components of the system, but in an open system, the outside world has a vote too. The temperature can be thrown out of the right range by a sudden heat wave, or a storm knocks out one of our servers, and so puts us out of control. Often the best we can do is guarantee we can recover from a shock: RESILIENT is the property that we always eventually return to control, or in TLA+ []<>Control.

Relationships between properties

Formal methods isn't just about how properties relate to systems. It's also useful to talk about how properties relate to other properties. A => B means that A is stronger than B:1 it is plausible for a system to satisfy B without satisfying A, but not the other way around.

Property strength is a partial order: Most properties aren't comparable to each other. A and B can be incomparable even if they both imply C. Fortunately, partial orders are transitive: A => B and B => C means that A => C. Here's the current spectrum

STRONG => TRIPWIRE => STABLE => RESILIENT => WEAK

The important one here is STABLE => RESILIENT, or <>[]Control => []<>Control. To explain this intuitively, if you eventually converge on reaching control, then nothing in the future kicks you out of control, so you will successfully recover from all (zero) shocks.

These are relationships between different types of properties. We can also create stronger or weaker properties within the same type by tweaking the size of the target range. Let's redefine all of our properties to be parameterized on a set of values, so f.ex STABLE(1..3) == <>[]Control(1..3) = <>[](x \in 1..3). Then any system satisfying STABLE(1..3) also satisfies STABLE(1..10), so the former property is stronger. For every property P on our spectrum,

IF narrow \subseteq wide
THEN P(narrow) => P(wide)

This follows because every value of narrow is also a value of wide, so being restricted to narrow means we're restricted to wide too. This doesn't hold if narrow isn't a subset, even if it's much smaller: STABLE(1..3) doesn't automatically guarantee STABLE(2..1000).

Resilient Systems

Given the resilient system, the ideal property to guarantee is STABLE(narrow). In practice, that's infeasible, because the reason we're designing a resilient system in the first place is because we're expecting disruptions. There's going to be some level of strength that's beyond our reach, either impossible or too expensive. There are two weaker properties we can aim for:

  1. RESILIENT(narrow)
  2. STABLE(wide)

These are not necessarily comparable. A system can satisfy (1) without (2) or (2) without (1). So we can make both of them goals:

Goal == RESILIENT(narrow) /\ STABLE(wide)

Metastability

There's one more way we can weaken a property: we can precondition it on another property.

STABLE(wide) => RESILIENT(narrow)

"If we can guarantee we're stable in the wide range, then we can guarantee we're resilient in narrow range". This creates a "metastable state": As long as we remain in a certain value range, we can always return to control. If we're pushed too far, though, we leave the metastable state and all bets are off.

We can define multiple metastable states in our system:2

Metastable(wide, narrow) == STABLE(wide) => RESILIENT(narrow)

Props ==
  \A <<x, y>> \in {<<w1, n1>>, <<w2, n2>>}:
    Metastable(x, y)

Time to Recovery

What about saying "we recover quickly"? That's trickier. One thing we can do is say there's an implicit "time horizon" in our system, so that a failure of []<>Resilient could mean both "we never recover" and "we recover the next day, which is equivalent to 'never recovering' for our purposes."

If we want stricter time bounds, we need to embed a notion of time in our model of the system, and then track how many time units various system actions take. This is a LOT hard to do so you probably won't do it unless you absolutely need to.

Conclusions

I should probably turn this into a learntla topic or something

Also reminder that I'm running a TLA+ workshop on March 20. Use the code C0MPUT3RTHINGS for 15% off!


  1. "At least as strong", actually. => is also the symbol for boolean implication: this is intentional, and in fact identical! "A is stronger than B" is the same as "A implies B", where A and B are both booleans. ↩

  2. This is valid TLA+, but I'm not sure if TLC can check this and am too lazy to check. ↩

If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.

My new book, Logic for Programmers, is now in early access! Get it here.

Don't miss what's next. Subscribe to Computer Things:
Start the conversation:
This email brought to you by Buttondown, the easiest way to start and grow your newsletter.