Ordering types in SQL
A common problem that comes up in apps that use relational databases is the need to maintain some kind of user-defined order over rows. Imagine you have a drag-and-drop list of items that the user is free to reorder, and you need to persist the order that they use in the database. The relational model, it turns out, is not really so good at this, as a consequence of it not having a particular coherent notion of ordering in the first place, ignoring that which it inherits from the ordering over values.
One article I like on this topic is User-defined order in SQL, because it arrives at something of a platonic ideal for what a type that represents this kind of structure must look like, in the Stern-Brocot tree. Practically speaking, the operations we really need to represent this kind of user-defined order are:
The extreme values -inf and +inf,
some operation to take the midpoint of two values x
and y
, mid(x, y)
,
satisfying a handful of natural properties like:
values are totally ordered,
-inf <= x <= inf
for all x
,
* mid(x, y) = mid(y, x)
, and
* for x <= y
, x <= mid(x, y) <= y
.
But you don't actually need like, to know the concrete underlying type to make this sort of thing work, you can just have the abstract set of operations on it that obey these properties.
Most of the ways you might end up representing this kind of suck for one reason or another though. It's generally pretty easy to come up with pathological reorderings that cause problems, for example, if you use arbitrary precision arithmetic and have mid(x,y) = (x+y)/2
, you can force the representations of the numbers to be large, or if you use fixed-size floats or integers, you can run out of bits without performing too many swaps.
I think in practice, the best thing to do is to just live with the pain of O(n)
updates, bite the bullet, and number your rows [0, n)
.
But I think this approach is only really the best way to do things if we accept that we are limited to the types and paradigms that SQL provides natively. If we're willing to go a level deeper, and provide a feature like this natively at the SQL level, we can do a little better, though not without a couple pain points.
I haven't really worked out all the kinks in this design, but after some discussion with Alex Miller, I don't think it breaks anything foundational in SQL.
Imagine a new type, ORDERING
, which has the properties we described above: the only way to construct an ORDERING
is like we said, there is a constant -infinity
and a constant +infinity
, and you can construct new ORDERING
s by taking the midpoint of two existing orderings, which is guaranteed to order between them.
You can't print out an ORDERING
—you can't even project it anywhere it would reach the user's screen. The only thing you're allowed to do with it is ORDER BY
it and read it, probably passing it to ordering_mid(...)
to construct a fresh one.
Now, this breaks a handful of possibly foundational assumptions in SQL, namely things like, every datum has a stringified form. This can still sort of be true, maybe, but the datum won't necessarily be canonical.
You also have to specify which columns precede an ORDERING
in an index, and it can only appear in a single index (perhaps, then, it would be better as some kind of annotation on an index). Something like this:
```
CREATE TABLE items (
name TEXT,
preference ORDERING(name),
INDEX ordered_idx (name, preference)
)
```
What this all buys you is that you get around the problem with all of the solutions in pure SQL: because the database is not committed to the actual values that the ORDERING
type is storing—only their relative orders—which means it's free to rewrite those values silently.
During compaction, or maybe amortized on an update, the database can "re-smooth" out the ordering values. If it stores them as spread out integers, say:
```
10
20
30
40
50
```
and you've done a bunch of inserts that gum up the innards, and wind up with
```
10
11
12
13
14
15
16
17
18
19
20
...
```
The database, by lack of commitment to an actual underlying value, is free to rewrite that whole thing at its leisure to actually be
```
100
110
120
130
140
150
160
170
180
190
200
...
```
and make updates fast again, going forward.
I'm sure there are problems with this approach, I haven't seen any databases that implement this sort of thing but it seems like a natural idea so I can't believe that nobody has ever added it as a feature. I think it would be really cool and interesting to design fully, though!
Is any of this worth it? In practice, is it totally fine to just do the O(n)
update thing? Probably! But it's a fun thought experiment.