Git Workflow is Snapshot-Isolated
We are apparently continuing our series of posts borne from thinking about Amazon DSQL. Today we are talking about skew.
Amy and Bill are two developers working on a shared codebase via GitHub. They use pull requests, and have a CI server that verifies that their code builds and all of their tests pass before they merge any code.
This is, in principle, intended to keep the codebase valid at all times. Anyone who pulls it down and checks out any commit should be able to run it and have all the tests pass.
Unfortunately, a CI check is insufficient for this goal: since git will merge a commit as long as there are no merge conflicts, the following sequence of events is possible:
The initial state of the codebase is this:
struct Foo {
a: i64,
b: i64,
}
fn bar() {
...
}
Abby adds a new field to Foo
:
struct Foo {
a: i64,
b: i64,
c: bool,
}
fn bar() {
...
}
and also updates all the locations where Foo
is constructed. This commit passes CI.
At the same time, without Abby's commit, Bill constructs a new instance of Foo
in bar
:
struct Foo {
a: i64,
b: i64,
}
fn bar() {
...
let foo = Foo{
a: 1,
b: 2,
}
...
}
This commit also passes CI, since it was made against the version of the codebase that didn't have the c
field.
Now, once both Abby and Bill click the merge button, the code looks like this:
struct Foo {
a: i64,
b: i64,
c: bool,
}
fn bar() {
...
let foo = Foo{
a: 1,
b: 2,
}
...
}
which doesn't pass CI, since Foo
doesn't have the field c
, and the code no longer compiles.
I've heard this phenomenon called merge skew.
On teams with enough code movement for it to be a problem, it's sometimes fixed by having a bot that asynchronously serializes all pull requests to make sure they all pass CI in aggregate.
Now let's talk about something completely different.
Snapshot isolation is a transactional isolation level common in SQL databases. A "transactional isolation level" is a specification that describes to what extent transactions can observe the effects of other transactions that are running concurrently.
Speaking precisely, a history obeys snapshot isolation if:
* every transaction can be assigned a read timestamp t_r
and a write timestamp t_w
such that t_r
and t_w
are globally unique, and a transaction can see exactly the writes of transactions whose write timestamps precede its read timestamp, and
* no two committed transactions which are concurrent (meaning their [t_r, t_w]
intervals intersect) have overlapping write sets.
Additionally, if t_r = t_w
for all transactions, the history is said to be serializable.
That's a very declarative definition and it's sometimes easier to understand this sort of thing via an algorithm that implements it.
To implement snapshot isolation: * Every transaction is given a timestamp when it begins. All of its reads are to occur as of this timestamp (usually using MVCC). * When a transaction tries to commit, it checks all of the keys it wrote to. If anyone else wrote to them since it started (i.e, any of them have newer values than this transaction would have read), the transaction must abort instead of committing. (This is actually a fine way to define snapshot isolation too, despite being operational rather than declarative: any algorithm that admits a subset of the histories allowed by this algorithm is snapshot isolated)
There are some possible optimizations on top of this to allow more transactions to commit but this is the gist of it: you see a consistent snapshot, when you go to write, if anyone wrote to the same stuff as you, you have to start over from the beginning.
It's not super easy to see why this is not serializable ("serializable" is "perfect" isolation: you can't distinguish concurrency from no-concurrency).
Consider a simple data model with two registers, x="A"
and y="B"
. Transaction T1 wants to copy the value of x
to y
, and transaction T2 wants to copy the value of y
to x
:
def t1():
x = read("x")
write("y", x)
def t2():
y = read("y")
write("x", y)
If these two transactions are executed serially, there are two possible histories: T1, T2:
T1 reads x="A"
T1 writes y="A"
T2 reads y="A"
T2 writes x="A"
result: x=y="A"
or T2, T1:
T2 reads y="B"
T2 writes x="B"
T1 reads x="B"
T1 writes y="B"
result: x=y="B"
In either history, both x
and y
end up with the same value.
Under snapshot isolation, a third outcome is possible, which is that the two values swap:
T1 reads x="A"
T2 reads y="B"
T1 writes y="A"
T2 writes x="B"
result: x="B", y="A"
This kind of deviation is called write skew.
Seeing merge skew and write skew laid out next to each other, it's clear that they're exactly the same concept: in Git, a programmer gets a consistent view of the codebase that doesn't get updated while they're working (because they work off of a particular commit). If they modify the same lines as another person, they'll get a merge conflict and have to rework their changes, but if the two changesets don't have any physical conflicts (i.e., their write sets are disjoint), then they can both be merged. This can lead to skew, when one person's changes were dependent on reading something that was changed concurrently by another person.
It took me a long time to gain intuition for how snapshot isolation functioned but it's actually very natural, so natural that it's part of the workflow that most programmers use every day.
Links
- This paper is the best source I've found for feeling out snapshot isolation.