Ergonomic APIs, channel invariants, and data views
Musings on a tempting antipattern
Hi everyone! It's two days after the March 20 TLA+ workshop, which means it's time to start getting feedback and revising things for the May 15th workshop. At some point I want to talk about my "workshop technology" I use, but that'll take some time to write up, so maybe next week? We'll see.
(If you're on the fence about the workshop: attendees can share specs they write after the class and I give them feedback. So not only are you learning TLA+, you get help applying it to your job!)
Anyway, here's a software engineering idea I've been playing with but haven't tried in practice yet, so I don't know how well it works. Constraint solving1 has this technique called "channeling constraints", where you represent data in multiple connected ways. Like if you're constraint solving a schedule, you have two possible representations of "which talk is in slot T and room R" (in MiniZinc):
% for a time/room combo, which talk
array[TIMES, ROOMS] of var talks: sched;
% for a talk, which time & which room
array[talks] of var TIMES: talk_time;
array[talks] of var ROOMS: talk_room;
Which do you choose? It depends on what constraints you want to express. Some, like "a slot can have at most one talk", are easier to express in one or the other, but also some are more efficient to check in one form or the other. The right constraint and format can be the difference between the solver taking 8+ hours and taking five seconds.
Instead of choosing one, choose both! Then add a "channeling constraint" that says the two representations are identical. Then you can target each constraint against the most efficient representation.
constraint forall(t in talks) (t = sched[talk_time[t], talk_room[t]]);
By having redundant data representations, we make the constraint model simpler and more efficient. Constraint solving is a very different domain from regular programming. But maybe there's a way to use this?
Ergonomic Data Layouts
All data is stored in some layout, which is a collection of structures + format + schema. A hash table is a data structure, hash table object (or a JSON) is a format, "every talk has a time and a room" is a schema. Combining them all, a hash table object mapping "talks" to times is a layout.
Your choice of layout has two consequences: it makes invariants easier to enforce and it makes operations (queries and transforms) easier. Let's say we're modeling a schedule of which talks are in which rooms at which times, here are two ways we could layout that data:
// Map of time + room to talk
sched1 = {(Time, Room): Talk}
// Map of talk to time + room
sched2 = {Talk: (Time, Room)}
There are other ways we can layout the data, like {(Room, Time): Talk}
, {Time: {Room: Talk}}
, or [(Time, Room, Talk)]
, etc. We'll focus on just the two above. Which is better? Depends! Some are better for one or the other!
First, they implicitly maintain different invariants, by making illegal states unrepresentable. In sched1
, it's impossible two talks to be scheduled in the same room/time "slot". In sched2
, it's impossible for a talk to have two separate assignments. If we want both invariants, regardless of what representation we choose, we have to encode one explicitly at the code level.
Second, they make different queries easier. In sched2
I can find when a talk is happening with a simple lookup, sched2[talk]
. But to find which talk is in a given slot, I have to iterate through every talk and check their value. sched1
is the reverse, making finding the talk for a slot easy and the slot for a talk hard. And for both, finding all the talks happening at time t
isn't easier (though it's harder for sched2
).
Finally, they make different transformations easier. Let's say I want to swap the slots of talk1
and talk2
. Easy in sched2
, hard in sched1
. Swapping the talks in two slots is the reverse. Swapping talk1
with whatever's in slot (time, room)
is hard in both.
All of these would be easier if we could just have both sched1
and sched2
and use whichever is most appropriate for the task. But then it's really, really easy for them to fall out of sync, and the problems that would cause are so major it's not worth the risk.
Could we reduce the risk? Maybe.
Channel invariants
Let's import the idea of channel constraints. Assume we could enforce the following invariant in our code:
forall talk, time, room:
sched1[time, room] = talk
if-and-only-if
sched2[talk] = (time, room)
If this invariant is enforced, like by an assertion, we can guarantee that both are always updated in sync. If they're not, the program crashes. This also gives us both invariants explicitly. If in sched2
two talks are assigned to the same slot, then it must be out of sync with sched1
, so the program crashes.
In an actual program, we could represent this by placing both schedules in a class, and then add a class invariant. Most programming languages don't have this built-in, but you can always find a library!
class Schedule {
sched1: (Time, Room) -> Talk
sched2: Talk -> (Time, Room)
invariant {
for talk in sched2.keys() {
assert(sched1[sched2[talk]] == talk)
}
for (t, r) in sched1.keys() {
assert(sched2[sched1[t, r]] == (t, r))
}
}
}
We can even make sched1
and sched2
implementation details. Schedule[talk]
maps to Schedule.sched2[talk]
, and Schedule[t, r]
is Schedule.sched1[t, r]
.
The hard part of course is updating, since you have to update both in sync or the program crashes. Let's say you update the time of a talk by doing sched2[talk] = (t, r)
. Now you have to set sched1[t, r] = talk
. You have to make sure this doesn't also trigger another "update" on sched2
, because then you'd "update" sched1
again in an infinite loop.
But even after we solve that, what if this unintentionally changes data? Like if sched1[t, r]
was already assigned to talk2
, so setting sched1[t, r] = talk
assigns that slot away from talk2
without us knowing. Isn't this just as bad as data going out of sync?
Fortunately, this can't unintentionally happen. Consider that if sched1[t, r] == talk2
, then sched2[talk2] == (t, r)
. After the assignments sched2[talk] = (t, r)
and sched1[t, r] = talk
, we check the class invariant and assert sched1[sched2[talk2]] == talk2
, which evaluates to talk1 == talk2
, which fails and the program crashes.
To update sched2
without breaking the invariant, we have to first check that (t, r)
isn't a key in sched1
, and prevent the update if it is. But then something like swapping two talks becomes more difficult to implement. Things start to look more like database transactions.
Something I see here is that two representations + channel invariant makes the implementation considerably more complicated, and in return the API becomes more streamlined for users. In other words, we're trading implementation simplicity for interface simplicity. This is the exact approach that Richard Gabriels called "the Right Thing", which he said consistently loses out to "Worse is Better". I don't know if that's an argument for or against.
wait isn't this just database views
You could argue this approach doesn't actually make queries any easier to write. Because if we're writing a bunch of logic anyway, we can just do this:
class Schedule {
sched1: (Time, Room) -> Talk
def sched2(): Talk -> Time, Room {
out = {}
for r, t in sched1.keys() {
out[sched1[r, t]] = (r, t)
}
return out
}
}
ie make sched1
the single source of truth and sched2
something computed from it. Sure we don't get the free invariant from sched2
but we've already added in explicit class invariants at the code level, just add one more.
Aaaaaaaand we just reinvented database views. Views are tables you can query on that themselves come from queries, just like we sched2
is a function that returns a value we can query. In some specific cases, views can be "bidirectional", and you can write an insertion statement on a view to update the original table.
Most databases with views also support materialized views. If you are querying a view a lot you're recomputing the view table over and over again, so if you don't expect it to change much, this is a waste of resources. Materializing views are computed once and then have to be refreshed to update them. Basically a cache. If we make sched2
a cached function, and wipe the cache every time we update sched1
, then we get that same benefit.
Now that the only difference between having a variable sched2
and a cacheable function sched2()
is you can't update the schedule by modifying the latter. If you're working with immutable data, or are fine only updating sched1
, it seems like there's no benefit to having a sched2
variable.
...Oh man I just remembered that aliasing is a thing. Consider what happens when one part of the code is using sched1
and the other updates sched2
, which means sched1
changes. We've replaced a sync bug with an aliasing bug.
This is probably a bad idea
This all seems like a lot of work for very little benefit. Which is a shame. Channel invariants seemed like a cool idea before I started poking at it, but that's how it goes sometimes. I do still like the idea of providing alternate data layouts in an API. Especially when you have to do a lot of different things with the same data: having it in the right forms can cut down on a ton of boilerplate.
-
Constraint solvers solve constraint problems. Like "what's the most area we can fence given we pay X per yard of fence and Y per acre fenced in?" ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.