Public postmortem: app downtime

TL;DR

On Thursday, from 20:31 to 20:50 UTC Buttondown's backend was down, causing the app, API, sending and automations to be offline temporarily.

We pushed an update with a database migration that removed a schema constraint to improve the performance in certain scenarios. Counterintuitively, this is an extremely heavy operation for Postgres locks the entire database indefinitely if it's done in a big table. This caused all queries to be blocked on this operation, therefore filling up all of the database's connection slots.

How did we detect the issue?

Our automated monitoring for our API and database notified us within minutes and paged a team member to take a look.

How did we mitigate the issue?

We identified that the migration was the issue at 20:34, and by 20:39 we realized the problem was connection slot exhaustion and started trying to disconnect database clients to improve the situation. Unfortunately, because of the connection slot exhaustion, we had a hard time connecting to the database ourselves, which delayed the fix. We ended up getting the database restarted by 20:49, at which point service was immediately restored.

How will we prevent this from happening again?

We're investigating exactly what kind of impact this kind of migration has, and figuring out how to prevent them systematically from causing downtime (for example, by timing out if it takes too long.)
Literally the day after this incident, PlanetScale (our database provider) added better tools to mitigate an incident like the one we had without having connection slot issues. We're documenting these and making sure we have them on hand if this happens again, to be able to investigate and recover much faster.

Public postmortem: app downtime

Our public postmortem for the incident on June 11th.

How did we detect the issue?

How did we mitigate the issue?

How will we prevent this from happening again?