Musings on computer systems

Subscribe
Archives
July 3, 2024

New blog post: Finding near-duplicates with Jaccard similarity and MinHash

New blog post: On Jaccard similarity and the MinHash trick

I learned about this algorithm and hashing trick while reading about LLMs and GPT-3, and thought it was really cool, and something I hadn't encountered before. I also found it hard to find a good self-contained writeup online, so I made one.

I was hoping this would be a quick post, but it ended up sucking more time and wordcount than I'd intended. Oh well. I learned a lot of new things in the process, though, including some neat connections to HyperLogLog, everyone's favorite sketch.

As is tradition, Ranger, this time waiting very patiently for mom while she orders our sandwiches. Ranger sitting patiently

Until next time,

- Nelson

Don't miss what's next. Subscribe to Musings on computer systems:
Powered by Buttondown, the easiest way to start and grow your newsletter.