New blog post: Finding near-duplicates with Jaccard similarity and MinHash
New blog post: On Jaccard similarity and the MinHash trick
I learned about this algorithm and hashing trick while reading about LLMs and GPT-3, and thought it was really cool, and something I hadn't encountered before. I also found it hard to find a good self-contained writeup online, so I made one.
I was hoping this would be a quick post, but it ended up sucking more time and wordcount than I'd intended. Oh well. I learned a lot of new things in the process, though, including some neat connections to HyperLogLog, everyone's favorite sketch.
As is tradition, Ranger, this time waiting very patiently for mom while she orders our sandwiches.
Until next time,
- Nelson
Don't miss what's next. Subscribe to Musing in Computer Systems: