One of the scariest parts about hitting "send" on an email is the feeling that there's no takesies-backsies. As soon as it starts hitting people's inboxes, you are left without a mechanism to fix any mistakes. A broken link in a blog post is a minor embarrassment; a broken link in an email to ten thousand subscribers is a small catastrophe.
This is why Buttondown checks every link in your email before you send it, and why the machinery behind that check is more involved than you might expect.
Our link checking infrastructure makes the most sense viewed through the lens of its development: just like in so many things, we started naively and grew increasingly robust over time.
- First, the simplest version: extract every URL from the email body, make an HTTP request, see if you get a 200 back.
- Then you discover some sites aggressively block automated requests (LinkedIn, Amazon, Codepen, the list is literally infinite). We maintain a denylist of domains we know will lie to us, and we spoof a browser User-Agent for everything else.
- Then you realize you're sending requests for obviously malformed URLs, so you move some of the checking logic clientside to do a better job filtering out bad inputs.
- We start with a HEAD request to politely and performantly check things, but many servers return 404s and 405s for HEADs even though a GET for the same URL returns a 200 — so we added fallback logic and a mild backoff.
- Then we added aggressive timeouts for servers that chew up the clock.
- Then, we realized that for good and bad senders alike we needed to check outbound links against a handful of external services (Google Web Risk, SURBL, Spamhaus, etc.) We don't want to block the actual client-side link checking on this, but we do want to block sending. So we introduced a standalone link model that can hold onto some state and ensure that no emails go out without us verifying it won't trigger a flag in some database we care about.
- Finally, we cache the hell out of it. Many of our authors send emails with dozens of links, and it's important to keep the performance loop tight even for them. So we cache both client-side with slightly varying logic depending on prior performance of both the author and the link itself, as well as server-side, which lets us dedupe link checking across newsletters.
All in all, it looks something like this:
