Public postmortem: archive downtime

Our public postmortem for Incident #0023.

Justin Duke
Justin Duke
January 2, 2026

On Friday, January 2nd, archives were down for seven minutes from approximately 9:38pm to 9:45pm Eastern. The root cause was a commit that, amongst many other things, included template tags for auto-reloading within debug mode in Django. The problem wasn't with the tags themselves, but with the fact that the third-party package containing those tags was only installed in dev mode — a classic Django anti-pattern.

How did we detect the issue?

Our 500 checker alerted us immediately, and we pinpointed the issue almost instantly.

How did we mitigate the issue?

This is where things got interesting. We attempted three different remediation strategies before landing on one that worked:

  1. Rollback attempt #1: We ran a command to roll back the deploy. This failed because the same commit included a migration to drop the welcome_message column on our newsletter table. Django does not handle this kind of rollback gracefully.

  2. Hotfix attempt #1: We tried committing a fix and pushing directly to Heroku, bypassing CI. This also failed because our deployment process immediately pulls built assets for each commit, which are only built in GitHub Actions.

  3. Manual database fix + rollback: We manually added the welcome_message field back onto the Django table and rolled back to the previous commit. This stopped the bleeding while the hotfix worked its way through CI.

A few minutes later, the hotfix passed CI, deployed successfully, and the incident was resolved.

How will we prevent this from happening again?

This incident highlighted a few things we need to improve:

  1. Staging deploys with health checks: Going forward, we'll change our deployment setup to first deploy to the demo site and run a naive health check against it. Such a health check would have caught this specific error before it reached production. Once that health check passes, we deploy to production and then to all other auxiliary deploy targets.

  2. Better rollback tooling: The front-end build issue — where we couldn't deploy without CI-built assets — is a true own goal. This is largely a documentation issue, since the workaround was there all along, but we need to make it more obvious and accessible in the heat of the moment.

  3. Commit hygiene: The size and complexity of this commit (bringing in an external dependency and dropping a database field) made remediation harder than it needed to be. This was a post-code-freeze commit, so not indicative of a larger pattern, but worth noting.

Buttondown is the last email platform you’ll switch to.