New Workshop, Test Inference

this

                August 11, 2020

            New Workshop, Test Inference

            Delayed newsletter next week
I'm moving Monday, so newsletter should go out Tuesday. This is entirely unrelated to why the newsletter was delayed this week, which was due to a combination of tornado warning and Frog Fractions 3.
New Alloy Workshop!
Sept 21, 10 AM - 6 PM CST.  Use the discount code [redacted] for $500 off.
(Going forward I'm planning alternate Alloy and TLA+ months. So two Alloy workshops in September, one TLA+ workshop in October, etc.  I'm also sketching out a two hour predicate logic workshop; I'll slot that into the schedule when it's ready.)
Test Inference
One of the nice things about modern static FP languages is that they
have good type inference:
for the most part, you don't have to annotate the types.
The language can figure it out for you.
This is a huge affordance that goes a long way to making static typing useful.
I wondered if we could transplant the same idea to other forms of correctness, like contracts and tests.
Below are some sketches of what that could look like for testing.¹
Since this is mostly just me experimenting with ideas,
I'm playing fast and loose with the syntax and assuming the existence of a suitably-powerful property based testing library.
I figure any first-stage project would be a mix of manual labor and code generation, where the human writes the hard bits and the computer writes the boring bits.
Round-trip properties.
A common thing in property based testing
is a round-trip property:
we have two functions that are inverses of each other
such that applying both functions to an input
returns that input.
For example,
encode_json and decode_json
form a round-trip.
Round-trip properties are usually introduced as an easy property test
for people just starting out with PBT.
But two functions being inverses of each other
is a semantic thing and can be annotated in our code.
def encode_json(dict) -> str:
  ...

@inverse_of(encode_json)
def decode_json(str) -> dict:
  ...

(We place it as a decorator, or program-level annotation, so our tool can directly import the functions and examine their annotations as data.)
By explicitly listing these two functions as inverses we should be able to infer the following two tests:²
@given(json_strings())
def test_round_trip1(js):
  assert js == encode_json(decode_json(js))

@given(json_dicts())
def test_round_trip2(jd):
  assert jd == decode_json(encode_json(jd))

We might want to do this only one way: sqrt(x)**2 == x should always be true, but sqrt(x**2) == x might not be. Presumably we could have round-trip transformers, like sqrt(x**2) == abs(x), but that's at the level of polish/UX and not the core idea.
Preconditions and postconditions
Historically Design By Contract (tee em) has focused on two kinds of invariants:
preconditions, things that must be true for the function call to be valid,
and postconditions, things that should be true after the function is called.
# `out` is the returned value
def sqrt(x):
  requires: x >= 0
  ...
  ensures: out >= 0

Usually this is done with assertions.
The problem is that you need set quantifiers to write good contracts, which most languages don't provide.³
Preconditions and postconditions also rarely "fit" in a single expression.
So it might make more sense to extract it out to a different function:
def sqrt(x):
  ...

@precondition_of(sqrt)
def is_positive(x) -> bool:
  return x >= 0

@postcondition_of(sqrt)
def is_sqrt(x, out) -> bool:
  return out*out == x

(Yes, I'm being sloppy with how we get out in is_sqrt. Oh well ¯\_(ツ)_/¯ ) 
We can then infer several tests:

All call stacks involving sqrt should satisfy its preconditions. 
Whenever we are running tests and we see a call to sqrt anywhere, we immediately insert assert is_positive(x) as a check.
Similarly, we can insert assert is_sqrt(x, out) whenever sqrt(x) is called.

This is where making these dynamic annotations becomes really handy, as we can analyze existing tests as they run to create new ones.
Oracles
A common PBT technique is to use an oracle:
a function that has identical behavior to what you're writing
but for some reason cannot be used in production.⁴
Then our property becomes
"our function is identical to the oracle."
This is inferable:
def fast_solver():
  ...

@oracle(fast_solver)
def brute_force_solver():
  ...

# This is what we infer
@given(some_input_strategy())
def test_oracle_matches(x):
  assert fast_solver(x) == brute_force_solver(x)

Note that our function and the oracle don't necessarily
take the same input, so we might need to finagle between input formats. 
If the two inputs are identical,
we can replace our function with the oracle during testing call stacks and see if it gives identical output.
Instrumented Functions
Alternatively, instead of testing the equivalence of a function and its oracle,
we can show the equivalence of a function and its
"instrumented form", or i_f. The i_f carries additional metrics and debugging info as part of its implementation:
def binary_search(l, x) -> Optional[int]:
  # find code
    return index
  # couldn't find it
  return None

@instruments(f)
def i_b_search(l, x):
  number_of_loops = 0
  # ^^ incremented by one each step
  ...
  return index, number_of_loops

This is information that is useless to us
in the behavioral/production use of the function,
but is very useful in testing or debugging.
With the extra information,
we can write more useful
preconditions and postconditions.
# These two are on f
@precondition(binary_search)
def is_sorted(l, x) -> bool:
  return sorted(l)

@postcondition(binary_search)
def is_index(l, x, out) -> bool:
  if out is None:
    return x not in l
  else:
    return l[x] == out

# This is on just i_f
@postcondition(i_b_search)
def ologn(l, x, out) -> bool
  return 2**number_of_loops <= len(l)

This seems like it would interact in an interesting way with the rest of the test inference chain. First,
we test the preconditions and postconditions on binary_search. Then we verify i_b_search is an oracle for binary_search. 
Finally, we test all the contracts on i_b_search. Presumably we can also apply all of the contracts of the original to the instrumented form, but that should be redundant with the oracle tests.
(I am being a little handwavey here.
We need some way to expose the instrumentation,
which means changing the output to expose it.
So there's no way for a function to have identical output to
its instrumented form.
Ideally we'd want some kind of additional output channel besides return to extract data without breaking call stack substitutability. 
I'm waving this away as a
(fairly important)
implementation detail.)
TODO
Let's end with a sillier example.
I have a function that has a temporary fix for temporary problem, like working around a third-party vendor bug that we know will be patched in a couple months.
I want to ensure that I don't forget to remove the fix after the problem goes away.
So I give it the following annotation:
@todo("Remove this", 2020-12-12)
def f():
  ...
  some_horrible_kludge
  ...

From this,
we should be able to infer
the following test:
# AUTOGEN "Remove this" file foo.py:linno
def test_remove_the_kludge():
  assert date() <= 2020-12-12

Now our build fails after the deadline.
This forces us to either remove the kludge or extend the deadline.  Either way, we're guaranteed to at least look at it.
Limitations
The obvious limitation with a lot of this is purity.
If functions have side effects
we cannot guarantee that f(x) == f(x), so call stack substitution and instrumentation break down. I see this being easier to prototype in a pure FP language, though I also worry that the sheer amount of weird runtime code analysis could be tricky to statically type. Maybe Clojure or Racket would be good starting candidates?
Another limitation is exponential test blowup. How do we know what tests are useful and what are a waste of time? I'm imagining this as the inferrer actually spitting out test templates a human can customize, but that raises issues of keeping them in sync. Implementation details.
Miscellaneous thoughts
I haven't really thought about how we test class invariants, but I imagine we can infer things about those too.
One thing I like about this is that it decouples
a function from its contracts.
Having contracts inline is ergonomically very nice,
but most languages can't inline complex ones (single expressions, etc).
The usual approach is to avoid complexity and 
write overly simple contracts,
but we can do more complex things
by abandoning the inlining.
Additionally, we can write more generic contracts and apply them to multiple different functions. I haven't thought about what happens when we want to take is_positive and apply it more generally.

Why a sketched idea and not a prototype?  I'm drowning in projects already and figure at least this way I can share the idea. ↩

I'm using a pseudo-hypothesis syntax for the input generators. ↩

"Is every element of the list L unique?" 

Without quantifiers: all([(L[i] != L[j] if i != j else True) for i in range(len(L)) for j in range(len(L))]).
With quantifiers: ∀ i, j ∈ 1..Len(l): i < j ⇒ L[i] != L[j]

↩

Some possible reasons: it could be slower, it could be unsafe, it's only correct for a limited domain, it takes a different format. ↩

            If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.

Don't miss what's next. Subscribe to Computer Things:

Start the conversation: