Public postmortem: external events backlog

A misconfiguration on one of our self-hosted SMTP servers led to a crash that was difficult to recover from, causing many emails to become “stuck” in the system. The effects varied—some emails were delayed, some were sent multiple times, and some never went out at all. We have since corrected the configuration, are actively investing in improved tooling and alerting, and are building safeguards to prevent this kind of situation going forward.

What happened?

Buttondown uses various providers to deliver emails from authors to subscribers. Alongside third-party vendors, we also run our own fleet of servers for this purpose—referred to here as postal servers, in reference to the open source project we rely on.

On a recent Wednesday morning, we received automated alerts from our monitoring system indicating an unusual backlog of emails. Further investigation showed that one postal server was taking an excessive amount of time to send each message, eventually reaching a point where it did little but time out repeatedly. Logging into the server, we identified the culprit: a database handling pending messages was malfunctioning.

While we initially suspected overall message volume as the problem, we discovered the real issue was excessive connection attempts to the database from too many worker threads. The database was not configured to recover cleanly from this, nor to properly alert downstream connections about the situation. Our immediate solution was straightforward: we rebooted the database, scaled down worker count, and restored connections to a manageable state.

This left us with a challenging recovery: about 70,000 messages were stuck in limbo. Some were marked as pending but in reality had been sent, others were erroneously marked as sent, and so on.

Essentially, we entered a state where we couldn’t trust our sources of truth. Our standard operating procedure in such cases is to act conservatively. This meant isolating the affected server, shutting down its workers, leaving its messages as pending, and shifting traffic elsewhere—ensuring we didn’t worsen the situation or make decisions based on unreliable information.

This is what we did: traffic was shifted, the backlog queue was drained, and we resent only those emails we were certain had not gone out. Once the problematic server was cleared, we returned it to service.

How are we fixing it?

You might be wondering what we’re doing to improve things. The first step—already in progress as of this post—is to implement much more rigorous monitoring and alerting. Previously, we relied on broad integration-level metrics, which suffice for well-defined, obvious problems, but not for more nuanced or structural issues.

To be specific: we already had per-server alerts for pending or stuck messages, but these relied on an active database connection—which we didn’t have during this incident.

The broader effort is to give authors better visibility into delivery patterns. One of the worst experiences as an author is seeing an email marked as sent, but never receiving it. We intend to be more transparent about email states, so you can look into the dashboard and understand if delays or problems are happening.

Customer impact

During this incident, approximately 13,000 subscribers across 40 authors were affected. They experienced:

Delays of multiple hours before receiving messages
Not receiving emails at all (although we have since retried these deliveries)
Receiving duplicate emails

Looking ahead

Frankly, we’ve experienced too many incidents recently.

We’ve spent the last six months fixing bugs and improving stability at a granular level, but haven’t invested enough in end-to-end, infrastructure-wide reliability. These last few weeks have emphasized the need for this. Our primary responsibility is to reliably deliver your writing, and we’re now dedicating significant resources over the next six months to improve our observability, diagnostics, and resolution capabilities. If you’ve read this far, it’s probably out of frustration, not just curiosity—we know you’ve entrusted us with your work, and when we fall short, we take it seriously. We’re addressing this with urgency and commitment.

Public postmortem: external events backlog

Our public postmortem for Incident #0016.

What happened?

How are we fixing it?

Customer impact

Looking ahead