Notes on some PostgreSQL implementation details

in the header

                March 23, 2021

            Notes on some PostgreSQL implementation details

            Ranger update

Seriously friends he has gotten so large. He knows his name and comes when called, like, 80% of the time. He turned four months old today, just in time to celebrate National Puppy Day!
A Mysterious PostgreSQL Performance Bug
This week is a bit different (insofar as we have a “normal” over here). I spent a week or so digging into a PostgreSQL performance issue last week, and I’ve decided it was worth writing up the things I learned while digging into the cause, in case they are ever helpful for anyone in the future.
I find these kinds of writeups interesting because I love learning how software systems of all sorts work, and comparing and contrasting them with other implementations and exploring the implications and interactions between various different design choices.
However, if you don’t care much about Postgres implementation details, you can jump ahead to the “Takeaways” section, where I have some musings about my broader lessons and conclusions from this incident.
The symptoms
Under load, we saw our query latency to our Postgres database spiking precipitously, and our throughput plummeting virtually to 0; the database would almost completely lock up and make virtually no progress at all for a while. During these lockups, we would dig into pg_stat_activity and other tools, and see a large number of queries blocked on a number of apparently-related wait_event/wait_event_type pairs:

LWLock:MultiXactMemberControlLock
LWLock:MultiXactOffsetControlLock
LWLock:multixact_member
LwLock:multixact_offset

Some googling quickly revealed that LWLocks are internal mutexes used to protect shared data structures inside the Postgres implementation; but we had much less luck understanding what those specific locks meant. The documentation contained only the most spare descriptions of these locks, which did not really help us understand why we were so contended on them.
We eventually came to understand what we were seeing, by way of digging into a lot of Postgres implementation details, which I will now review for your benefit.
Some implementation details
Postgres’ internals documentation in the source code are actually generally quite good. Much of the following is adapted from the documentation on tuple-locking; head there if you want to read along.
Row-level locks
Postgres’ MVCC concurrency model allows for high concurrency for many operations without the need for locking records. However, Postgres also implements row-level locking, which is used to support explicit row-level locking via SELECT FOR UPDATE, and also internally to serialize write operations.
As a design goal, Postgres needs to support locking arbitrarily-many rows at a time. This implies potentially locking more rows than fit into memory (e.g. SELECT * from [some_massive_table] WHERE [some condition matching half the records] should succeed, and should also not degrade to a full-table lock). As a result, row locks cannot be stored exclusively in memory, but must be (potentially) persisted to disk. Thus, Postgres actually stores row-level locks in the header of the on-disk tuple that stores a table row. The “xmax” field is repurposed to store the transaction ID of the locking transaction, and a bit in the “infomask” flag word denotes that this transaction ID is a lock-holder.
One consequence already of this design is that SELECT FOR UPDATE actually dirties on-disk data; for this reason, the performance of SELECT FOR UPDATE — even uncontended, and even without later performing an UPDATE — is much closer to that of an UPDATE on the locked row than it is to a bare SELECT! If you’re locking a row because you intend to update it anyways this typically doesn’t matter all that much, but it’s good to know if you were considering using SELECT FOR UPDATE for some other purpose.
MultiXact IDs
In addition to SELECT FOR UPDATE, Postgres supports SELECT FOR SHARE, which implements reader/writer locks: Any number of transactions can SELECT FOR SHARE the same row concurrently, but no row can be selected FOR SHARE and FOR UPDATE concurrently.
This poses a challenge for the lock implementation: With SELECT FOR SHARE, an arbitrary number of transactions may have the same tuple locked, simultaneously. How do we represent this? We’ve already seen that row locks are stored in the tuple header, but the tuple header is a fixed size, and is by design quite compact.
Postgres solves this by introduction a concept called MultiXact IDs. If multiple transactions lock the same row, the transaction ID in the “xmax” field is replaced with a separate type of ID, known as a MultiXact ID. A MultiXact ID represents a set of locking transaction IDs (and some flags for each); the locking engine maintains, in a separate store, a lookup table mapping each MultiXactID to the corresponding set of transaction IDs.
The multixact store
The MultiXact mapping is stored in a dedicated directory (pg_multixact) inside Postgres’ data directory, using a pair of “SLRU” (“Simple LRU”) stores. One store (the “members” store) is a tightly-packed flat array of transaction IDs and flag bytes, and the other store (the “offset” store) maps MultiXact IDs to ranges of transaction IDs in the members store.
One fun fact about the MultiXact store is that entries are immutable; if a new transaction locks a row that is already locked, a new MultiXact ID must be created, containing all the old transaction IDs as well as the new ID. This means that having N processes locking a row concurrently requires potentially quadratic work, since the list of transaction IDs must be copied in full, each time! This behavior is usually fine, since it’s rare to have many transactions locking the same row at once, but already suggests that SELECT FOR SHARE has some potentially-disastrous performance cliffs.
Locking
Looking at our blocking data from earlier, we were blocked on the following four LWLocks:

LWLock:MultiXactMemberControlLock
LWLock:MultiXactOffsetControlLock
LWLock:multixact_member
LwLock:multixact_offset

We can now explain what each of those is; each SLRU region protects metadata with a “Control” lock, and protects the actual page data using a set of LWLocks per page, which are the lower-cased locks we see.
The ControlLocks, in particular, must be taken for any access to the MultiXact store! So, under heavy access to the MultiXact store — which, for whatever reason, our application code was triggering — all queries essentially end up contending on a single global mutex!
Vacuuming
Another phenomenon we noticed in our application was that, even when the application was not under heavy load, performance would steadily degrade over time, and we would see more and more of the above locks showing up in wait profiles.
It turns out that, when a row has been locked by a MultiXact ID, that ID is not synchronously removed; even if all locking transactions have completed, the ID stays around, and future transactions must consult the MultiXact store in order to determine that the lock has expired. Thus, as our application accrued these MultiXact locks on more and more rows, over time more and more transactions would have to refer to the MultiXact store, and would become contended on the above locks.
These IDs are periodically cleaned up by Postgres’ VACUUM process; the Postgres VACUUM documentation explains its role in cleaning up MultiXact IDs pretty well. However, the first rule of PostgreSQL tuning is that you aren’t VACUUMing enough, and indeed our database’s autovacuum process was not configured aggressively enough to fully keep up with all the MultiXact traffic that our application was apparently generating. So that explained one mystery of our performance troubles.
Subtransactions
At this point in our investigation, though, we weren’t clear why we were generating MultiXact IDs, at all. Our application did make use of explicit row-level locks using SELECT FOR UPDATE, but we were definitely not using SELECT FOR SHARE. MultiXact IDs only come into play if multiple transactions actually hold a lock concurrently, but SELECT FOR UPDATE is an exclusive lock, meaning only one transaction can hold the lock at a time!
The tuple locking documentation includes this tantalizing hint, though:

In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
locked in shared mode by multiple transactions.  This is no longer the case; a
MultiXact may contain an update or delete Xid.

But it did not elaborate on precisely when that could happen.
After a bunch of experimentation, though, we eventually found the culprit. Our application ended up inadvertently containing recursive calls to Django’s transaction.atomic, which Django implements by emitting the SQL SAVEPOINT statement, implemented in Postgres using subtransactions. After a bunch of experimentation, we discovered that the following sequence suffices to make Postgres generate a MultiXact lock;
SELECT [some row] FOR UPDATE;
SAVEPOINT save;
UPDATE [the same row];

I am not completely clear on the details, but locking a row explicitly and updating a row both take out locks, which must be accounted for slightly differently. If both locks are held by the same transaction this can be managed by just updating the tuple flags, but when the update is performed in a subtransaction, Postgres degrades to storing the two locks separately using a MultiXact ID.
This was, in fact, the sequence our application was using. We changed the inner transaction.atomic to pass savepoint=False, which turns it into a no-op if already inside an open transaction, and our problems immediately disappeared.
Debugging multixact issues
How did we determine which sequences do or don’t produce MultiXact IDs? It turns out Postgres comes with the very useful pageinspect extension, which allows easy inspection of the actual on-disk format of a table. We can observe the above sequence creating a MultiXact ID in a test database like so:
postgres=# create extension pageinspect;
CREATE EXTENSION
postgres=# create table test(i int);
CREATE TABLE
postgres=# insert into test (i) values (1);
INSERT 0 1

postgres=# select lp, t_xmin, t_xmax, t_ctid, t_infomask, (t_infomask&4096)!=0 as is_multixact from heap_page_items(get_raw_page('test', 0));
 lp | t_xmin | t_xmax | t_ctid | t_infomask | is_multixact
----+--------+--------+--------+------------+--------------
  1 |    488 |      0 | (0,1)  |       2048 | f
(1 row)

postgres=# begin;
BEGIN
postgres=*# select * from test where i = 1 for update;
 i
---
 1
(1 row)

postgres=*# savepoint a;
SAVEPOINT
postgres=*# update test set i = i+1 where i = 1;
UPDATE 1
postgres=*# commit;
COMMIT
postgres=# select lp, t_xmin, t_xmax, t_ctid, t_infomask, (t_infomask&4096)!=0 as is_multixact from heap_page_items(get_raw_page('test', 0));
 lp | t_xmin | t_xmax | t_ctid | t_infomask | is_multixact
----+--------+--------+--------+------------+--------------
  1 |    488 |      1 | (0,2)  |       4416 | t
  2 |    490 |    489 | (0,2)  |       8336 | f
(2 rows)

heap_page_items(get_raw_page(…)) lets us inspect the actual on-disk tuples in a physical page in a postgres table. I won’t explain all the fields here, but I will note that bit 0x1000 (4096) in t_infomask records whether a row has a MultiXact ID, which lets us select (t_infomask&4096)!=0 as is_multixact to determine which rows have been locked using MultiXact IDs. We can also see that t_xmax  on the MultiXact-locked row is much smaller than the other t_xmin  and t_xmax values, because it is coming from the MultiXact namespace, and not the normal XID namespace.
Takeaways
Well, that was a lot of diving deep into PostgreSQL implementation details that I hope most of you will never have to care about. I want to wrap with some high-level conclusions I take away from this deep dive.
Operating PostgreSQL at scale requires deep expertise
This experience really reinforced my pre-existing experience that Postgres is enormously complex and full of poorly-labelled operational landmines. Postgres is incredibly powerful and sophisticated, but it also contains a million and one ways to accidentally tank your performance, lock up your database for extended periods, and even leave your entire database nearly-unrecoverably messed up. And when you do encounter these failure modes, there may or may not be any warnings that you are approaching or have passed the point of disaster, or why.
It is entirely possible to operate Postgres safely and with high performance and throughput at scale … but essentially the only way to do it is to have ready access to deep PostgreSQL experience and expertise on your team; people who Just Know where the landmines are because they’ve seen them before. The only alternative is to repeatedly run into these issues, experience painful outages, and spend time and money recovering and learning the lessons yourself.
If you don’t have an expert DBA on your time, you probably need a contact at and a contract with an expert Postgres consultancy. PGExperts and 2ndQuadrant both come well recommended to me, although I haven’t worked with either one personally.
I continue to hate performance cliffs
A “performance cliff” is any time a small change to a workload or application can result in a dramatic performance degradation. They are often introduced as a result of optimization: If you optimize a program for some common case, you may introduce a performance cliff for future programs that stray off of that garden path.
I consistently believe we don’t take performance cliffs seriously enough. In some ways, it’s worse to have a performance cliff — with a fast sweet spot, and slow performance outside of it — than it is to be uniformly slow. If a program is uniformly slow, then its behavior is predictable and no one will depend on it being fast. If there’s a fast sweet spot, users will build architecture relying on that performance, and are now at risk of incidents if they inadvertently stray off of the fast path.
Performance cliffs are probably unavoidable in modern software; our very hardware contains steep performance cliffs like exceeding the size of the L3 cache or the TLB. However, I think we can mitigate them and reduce their risk in a few ways:

Make performance degradation smooth where possible. Instead of sharp cutoffs at some threshold, can we design adaptive or heuristic systems that degrade somewhat gracefully? That will provide early warning as we approach a performance cliff.
Consider failing instead of providing a slow path. If your code has to have a special-case fast path, consider adding an option to hard-fail if you exit the fast path, instead of becoming slow. That will make failures much more visible, and, importantly, more likely to show up in testing. Imagine if Postgres, by default, threw an error any time it would presently create a MultiXact ID, with a link to a page documenting how to enable that feature, and including a caveat about the performance costs. Most users would never hit this case, since most applications never need to concurrently lock a row; we would have hit that error in CI and known we were treading on dangerous ground; but users who do need SELECT FOR SHARE or subtransactions and have low throughput needs could still explicitly opt-in for the present behavior.
Provide really good visibility! Surface in any way you can when code runs out of an optimized fast-path into a slow fallback. Maybe it’s error logs or metrics or custom debugging tools, but make sure to provide your users with visibility so they can understand why their application just got 10x slower. As much as is feasible, make these tools always-on and enabled by default, so that they can surface errors live, instead of requiring custom configuration, re-builds, or otherwise too much fiddling.

Tactical conclusions
More tactically, if you do run Postgres, here are a few conclusions I take away about designing your application and working with Postgres.
Be wary of explicit row-level locks
This experience has made me very wary of SELECT FOR UPDATE, and extremely skittish about SELECT FOR SHARE. As mentioned earlier, the implementation of both locking mechanisms makes them comparably expensive to an UPDATE in latency — even when uncontended — and they both enable scary performance cliffs in the wrong circumstances.
I will never write code using SELECT FOR SHARE, and I will be very cautious about reaching for SELECT FOR UPDATE in the future, preferring to find some other means of concurrency control.
Avoid subtransactions
I told a version of this story to a friend with years more Postgres experience at scale than I will ever have, and he commented:

Subtransactions are basically cursed. Rip em out.

While they are technically supported, they definitely seem to have very scary performance behaviors in some cases. Just don’t. I’ll add them to the list of features I wish I could hard-disable in database configuration.
Postgres has pretty good internals documentation
I found the official Postgres docs almost useless for understanding what behavior we were running into. However, once I started digging into the source to try to understand the implementation, I was quite pleased by the amount of very readable documentation I found. Most directories have README files containing notes on the implementation, and most files have good overall comment blocks explaining their function and purpose. If you do run Postgres, don’t fear the source! Get familiar with it, and it will serve you well understanding your application’s weird performance issues.

Don't miss what's next. Subscribe to Musings on computer systems:

Musings on computer systems

Notes on some PostgreSQL implementation details

Ranger update

A Mysterious PostgreSQL Performance Bug

The symptoms

Some implementation details

Row-level locks

`MultiXact` IDs

The `multixact` store

Locking

Vacuuming

Subtransactions

Debugging multixact issues

Takeaways

Operating PostgreSQL at scale requires deep expertise

I continue to hate performance cliffs

Tactical conclusions

Be wary of explicit row-level locks

Avoid subtransactions

Postgres has pretty good internals documentation