Rails Against the Machine
There are many legitimate reasons not to like the NCAA, which governs most college athletics in the United States. Maybe you don’t like that your favorite athlete just transferred to a rival. Or that your conference now includes schools that are 3,000 miles away. Good reasons, both of them.
For my money, though, a strong contender is the alleged “website” for NCAA Stats. This is strange for two reasons: first, I do love college sports stats, and second, this app is built using Ruby on Rails, one of my favorite web frameworks.
When I started at Maryland, I knew I wanted to teach a sports data class. So I did the honorable thing and basically took my friend Matt Waite’s textbook and, well, de-Nebraskified it, for obvious reasons.
The trouble is that I needed both Maryland stats and those from other NCAA schools. Luckily, Matt had me covered there, too, with code that scraped women’s volleyball and women’s soccer stats, along with game and player data.
Eventually I adapted those scrapers, written in R, so that I had them working for lacrosse and field hockey in addition to the original sports Matt had done.
And then the NCAA, as it so often does, made my life difficult. Those R scrapers? Now blocked. Not just mine, but basically any that use rvest to do the job. Apparently the NCAA is filled with, idk, Stata people?
But I’ve fought back in this cat and mouse game, and now I’m pleased to announce that I’ve got another set of scrapers, this time in Python. So here are the updated repositories, with both code and data:
Women’s volleyball (2018-forward)
Lacrosse (men’s and women’s)
Women’s soccer (2018-forward)
Field Hockey (2020-forward)
With more to come. One of the good things about the NCAA’s data system is that it has unique IDs for a season within a sport. To keep track of those, last year I forked and updated a GitHub repository that did a bunch of scraping, mostly on football and basketball. One of its key features is a file that contains the season IDs for various sports.
That repo now helps generate the URLs I need for scraping the NCAA stats. At some point I’ll consolidate all of this, but in the meantime you can get CSV files from each of the sport-specific repositories.
Sharp readers will notice that most of these repositories are about women’s sports. That’s intentional, from the standpoints of scarcity and of teaching. Most of my sports data students are men. They should know more about women’s sports. If you want to, I can help.
Washington Post Alerts
If you’re a Washington Post subscriber, you might have gotten a notification on Saturday night about the attack at the White House Correspondents Association dinner.
In a way, I did, too, although I haven’t signed up for actual SMS notifications from the Post. What I do have is more than 7,000 news alerts that the Post has sent since March 1, 2024, give or take some number when my collection efforts broke down. These come via an undocumented API on washingtonpost.com that includes details about the alert, including the story it points to, the timestamp and the category, what the data calls the alert title.

By far the largest number of alerts - more than 1,700 - are categorized as “Breaking News”, which makes sense. But that leaves thousands that didn’t meet that criteria, including alerts on advice columns, opinion pieces and a regular news quiz.
What’s interesting about this data, other than it conveys a sense of what Post editors think is worth interrupting your day to let you know about, is that you’d think those categories would be fairly broad. But there are more than 400 of them, many single-use categories that are much more specific than “Breaking News”, including “Nor’easter Updates” and “Inside FEMA”. So either editors don’t really have a good collection of categories or they don’t use them. One hint: the last alert under the “Editor’s Picks” category went out on March 22.
Last fall, I had Claude Code build a slightly unserious website that allows you to explore the alerts called Post Ping. Silly? Sure. Might actually be some interesting patterns in the data? Definitely.