Parsing Pennsylvania Election PDFs

April 12, 2026


            
        April 12, 2026
    
    
Parsing Pennsylvania Election PDFs


        April 12, 2026
OpenElections Pennsylvania
I spent the most time this week on openelections-data-pa, working with Claude Code to produce more precinct-level results parsers for the 2025 general election. Using Natural PDF, I added parsers for results files produced by the Electionware system, updated existing parsers and added a vote_for column to the parser output to identify races where voters can choose more than one candidate.
I also wrote new county-level precinct parsers using natural-pdf for Cameron, Huntingdon, Mifflin, and Snyder counties, then refactored the shared Electionware parser into a reusable module so future counties are easier.
All of that fed into new precinct-level general election results for more than a dozen counties, including Berks, Blair, Centre, Chester, Clearfield, Elk, Franklin, Lawrence, Northampton, Northumberland and Washington.
NCAA Sports Data
I expanded NCAALacrosseData (which formerly covered only women’s lacrosse) to cover men's lacrosse match and player stats. With Claude Code doing most of the work, I added men's 2026 team URLs, wrote a generator script, and published 2026 match stats for men alongside women's player stats and match stats. The code for this originally was written in R, but now is in Python.
Separately, I switched NCAAWomensSoccerData's tooling to Python from R and updated the 2025 match data.
Congress Press
The daily automated task of updating congressional press releases kept running, but the hands-on work this week was about data quality and historical depth. I backfilled press release text from 2009–2012 and updated text for 2025–2026. I fixed and replaced releases for several members (Trahan, Brecheen, Trent Kelly, Arrington, Griffiths), removed bad DocumentQuery URLs, wrote a script to remove a member's releases cleanly when it was necessary to replace systemic errors and rebuilt the dashboard. Nearly all of that work was Claude Code-assisted.
Scraper/library work
python-statement got several fixes: I merged a PR switching to urljoin for relative URL resolution, fixed some DocumentQuery scrapers, and added RSS support for Houlahan's office (plus a fix to that scraper). I had Claude Code debug and fix a scraper in religion-data that pulls United Methodist Church clergy assignment data.
Uplink
IRE used to have a print newsletter called Uplink dedicated to the practice of computer-assisted reporting. It had a brief but lively run during the late 1990s and early 2000s. I’ve obtained a nearly complete collection of those newsletters, scanned them in and had Claude extract the text and stitch together individual articles into JSON files. This week I reorganized the repo, backfilled some missing issues, and added a README. Looking forward to doing more with this text-as-data project.
You can see a full list of commits in my dataset-related repositories here.
    

                                Don't miss what's next. Subscribe to SELECT *:
                            
                        
            Email address (required)
            
            
                    ← Newer
                
                Fundraising Emails as Data and WBB Games