Public postmortem: database connection exhaustion

TL;DR

Yesterday (March 31st), Buttondown experienced two periods of downtime — approximately seven minutes and thirteen minutes, respectively — both stemming from the same root cause: database connection exhaustion.

To be specific: our database itself was healthy. Queries were fast, CPU and memory were fine, replication lag was nominal. The problem was simpler and, frankly, more embarrassing than that: we hit the ceiling on the number of connections our database was configured to accept. Once that ceiling was hit, new requests couldn't acquire a connection and failed.

First off, apologies for the disruption — particularly because this happened twice in one day.

How did we detect the issue?

This is where things get uncomfortable. Our health checker mostly reported things as fine. The health check endpoint returned 200s for the majority of requests because the checker's requests happened to land on workers that already held open connections. Think of it like a house party where the front door has collapsed: if you're already inside, or you know where the back door is, everything seems fine. Our external monitoring was essentially getting lucky — squeezing through just often enough to not trigger alerts.

We were alerted by user reports and our own manual observation, not by automated systems. That's not acceptable.

How did we mitigate the issue?

First incident: We identified the connection exhaustion, killed active queries to free up connections, and earmarked follow-up work for later. Downtime: ~7 minutes.

Second incident: Same root cause, but this time the connection count was so thoroughly saturated that we couldn't even connect to the database to kill queries. The tool we'd used to fix the first incident required the very resource that was exhausted. We had to restart the database to force-kill the ongoing queries. Downtime: ~13 minutes.

How will we prevent this from happening again?

Five things, roughly in order of "should have already existed" to "genuinely new investment":

Reserved administrative connections. This is the single highest-leverage change. Most Postgres configurations support reserving a small number of connections specifically for administrative access. If we'd had even one reserved connection for ops, we could have killed queries during the second incident the same way we did during the first. The reason the second incident lasted nearly twice as long as the first is that we were locked out of our own fix. That won't happen again.
Database-level alerting. We're adding direct monitoring on connection count as a percentage of the configured maximum. This is distinct from our end-to-end health checking — it doesn't care whether HTTP requests are succeeding. It watches the database itself and alerts when we're approaching capacity. By the time you read this, this should be live.
Health check hardening. Our health check endpoint currently returns a static 200 — it doesn't actually verify that the process can acquire a database connection. We're changing it to attempt a lightweight query so that connection exhaustion is immediately visible to our health checker and load balancer. A health check that can't detect the most common failure mode isn't much of a health check.
Connection headroom review. The configuration that was "too low" had been set a long time ago and never revisited as our traffic patterns changed. We're adding connection limits to our quarterly capacity review so this kind of slow drift doesn't catch us off guard again.
Out-of-band recovery tooling. Beyond reserved connections, we're documenting and scripting a fallback for when even administrative access is blocked: force-recycling workers to release connections without needing a database connection at all. Not as surgical as killing individual queries, but it releases connections immediately and gives us a way out when nothing else works.

Public postmortem: database connection exhaustion

Our public postmortem for the incident on March 31st, 2026.

How did we detect the issue?

How did we mitigate the issue?

How will we prevent this from happening again?