Label data by hand
We often start coding to avoid "by hand" work. Here's why labeling your own data is worth the time
Welcome to Data Dash
Compressing an avalanche of thoughts about data into byte-sized chunks. In your inbox every two weeks on Wednesdays.
Wait, what?
Avoiding “by hand” work is often why we got into coding. Labeling data when there’s already piles of it sounds like a fool’s errand.
I almost turned my nose up when my job proposed this kind of project. Now, I can’t get enough of labeling my own data. Here’s how I think you could benefit from doing the same.
Expanding what’s solvable
I don’t want to know what a podcast’s average rating is on Spotify. I want to know if I’ll think the latest episode is a banger. We often have to settle for an indirect proxy of our desired outcome. When we label data by hand we can start down the road toward the answer we want.
If you happen to have clout in your organization, advocating for this extra step can pay dividends. And if you’re doing a project for fun, you can explore what you want to know instead of a less interesting substitute.
You can label data in a way no one else can
Only you have your background, skills, experiences, and current environment. “The great man theory of history is hot garbage...you are irreplaceable.” This is just as true for you as it is for anyone.
Creating data is work, and there’s work out there only you can do. You’re not obligated to, and everyone will benefit if you give data labeling a shot.
Activation energy is low
If you want to label the data, you can just do it. You can get cranking in a spreadsheet, code up a quick web app, or whatever you want. If you hate the idea of a spreadsheet and don’t want to code the web app from scratch, take this Shiny for Python one I threw together and retrofit it.
Worst case this exercise goes nowhere and you’re better acquainted with your data. Best case you get to build something unique that makes you proud.
Bonus: Avoiding horrific mistakes
We’ve all heard horror stories of data that seemed fine until someone looked at it. Getting into the habit of labeling your data has the great side effect of forcing you to look at the data.
This practice isn’t a substitute for robust, automated checks. This practice is a safety net for stuff we swore those checks should have caught and somehow didn’t.
To argue against myself
Of course labeling your own data isn’t always the right move. More often than not the indirect proxy will do well enough. And the work is non-zero even in ideal circumstances.
Still, switching my approach from “never do anything by hand” to “what’s a good opportunity to label some data” has helped me build stuff I find more compelling.
Maybe you’ll have the same experience, maybe not, and either way I’d love to hear how it goes for you. Feel free to reach out on Bluesky if you give data labeling a shot!
A data thing I liked
A tour de force overview of data viz accessibility by Sarah L. Fossheim
A non-data thing I liked
The sham legacy of Richard Feynman