What I look for in empirical software papers

studying

                May 16, 2024

            What I look for in empirical software papers

                Oh God it's already Thursday

            Behind on the talk and still reading a lot of research papers on empirical software engineering (ESE). Evaluating a paper studying software engineers is a slightly different skill than evaluating other kinds of CS papers, which feel a little closer to hard sciences or mathematics. So even people used to grinding through type theory or AI papers can find ESE studies offputting. 
So here are some of my thoughts on how I approach it. This is mostly unedited and unorganized, just getting my thoughts down.
The high-level view
The goal of ESE is to understand how we develop software better in the real world. Compared to "regular" computer science, ESE has two unique problems:

As with all kinds of knowledge work, software development is highly varied and lots of factors affect the final outcome.
Software engineers develop new tools/processes/paradigms incredibly quickly and there's maybe a 1000:1 ratio of developers to researchers. 

This makes it hard to say anything for certain with a high level of confidence. Instead of thinking of whether papers are correct or incorrect, I look for things that make me trust it more or less. I'll use this interchangeably with "like/dislike": "I like this study" is the same as "I trust it more".
Quantitative Metrics
I like metrics that are clear, unambiguous, and sensible to practitioners. "Quality" and "defects" are too ambiguous, "coupling" and "cohesion" aren't sensible. "Number of support tickets filed" and "distribution of model code lengths" are better.
In practice, clear and unambiguous metrics end up being more narrowly-scoped. I trust research on narrow metrics more than research on broad metrics, even if the former are less applicable to software at large.
Clinical Trials
I don't like "clinical trials", where developers are split into control and experiment groups and asked to complete a task. The problem is sample size: actual clinical trials in medicine are run on thousands of people at a time and cost millions of dollars. Most software clinical trials study less than 50 engineers at a time. With such a small sample size, there's just too much room for confounding factors.
It doesn't help that most studies are done on university students, which may not generalize to more experienced professionals.
There's a couple of exceptions, though. I'm more inclined to trust clinical trials how how to teach CS, as students are the target population anyway. And I pay more attention to clinical trials that find enormous effects. For example, Need for Sleep finds that a night of sleep deprivation reduced the trial group's coding quality by fifty percent.¹ That's not the kind of effect you can write off as a confounding factor!
(I'd still like to see it replicated, though. Also, does mild sleep deprivation effect coding skill?)

Other Quantitative Research
Scraping GitHub for data is fine, as long as the researchers are careful to audit the data. Don't publish a groundbreaking paper claiming a correlation between language and program quality but accidentally count each bitcoin bug as 17 C++ bugs.
Developer surveys are good for measuring things like developer perceptions, emotional state, how widespread a technology is, etc. I don't think they're good at measuring the effectiveness of practices or determining if those opinions match reality. IE if a survey showed that 90% of developers think that their Java code is buggier than their Kotlin code, this tells us that developer perceive Kotlin as higher quality, not it actually is!
Qualitative Studies
I like qualitative studies, things that aren't gathering hard metrics. If a quantitative study of correctness imperative vs logic programming would be "which has fewer bugs" or "what kinds of bugs does each type have", qualitative would be more like "how do imperative and logic programmers debug stuff differently" or "let's dissect this one unusual bug and see if we learn anything interesting."
IME science fields undervalue the impact of qualitative research. I find a lot less of it than quantitative papers, but I tend to trust the qualitative papers more. Maybe they're just overall higher quality; more likely it's because they're more narrowly scoped. And I believe (without evidence) that a field needs a strong foundation of qualitative knowledge to do stellar quantitative work.
Some papers do both qualitative and quantitative work. The researchers in Abbreviated vs. Full-word Identifier Names started with a clinical trial, and then shadowed individual developers to see how they navigated a code file. I trust the quantitative sections of hybrid papers a little more than just-quantitative papers, but it's not an enormous factor.
"Quantitative" and "qualitative" aren't words anymore. 
Reading the Paper
I use the same basic approach with all CS papers: read the abstract and intro, read the conclusion and post-results discussion, then decide if I care enough to look further. If I like the results, I read the methodology more carefully. I skim background info and "related work" mostly to see if they make any claims I'd want to read up on.
Sometimes I'll use an AI to generate a paper summary. I only do this if I'm going to read the full paper; the summary is just to help me digest it. 
I trust papers that share their data.
Threats to validity
Most software papers have a "threats to validity" section, where they talk about possible confounding factors. In practice it's mostly a way for the writers to show they covered their bases, by explaining why said confounding factors aren't a problem. I like papers which include TtVs they can't dismiss, or discussing unusual-but-important TtVs. I don't like papers which have bad explanations for why TtVs aren't a problem, or that miss really obvious TtVs.
Example of doing it wrong: Quantifying Detectable Bugs in JavaScript claims that typescript catches 15% of "public bugs for public projects on GitHub". Their threats to validity were "we only looked at public bugs" and "we may have been too inexperienced in typescript to catch even more bugs." One TtV they do not list is that they uniformly sampled all of GitHub for bugs and didn't filter the projects at all. When I checked some of their projects, a bunch were simple lesson repos by novices in code bootcamps.² Why didn't they mention this TtV?
Example of doing it right: oh God it's already Thursday and I haven't gotten this newsletter out yet
References
For many papers, the most useful part is the reference section, because odds are I'll find a more interesting paper there. Even if none of the references look interesting, I'll still check a few to make sure the original paper is using them correctly. I don't like it when a paper that say reference P makes claim C about topic T, and then I read R and it never even mentions T.
I wish journals let papers have two reference sections, one for the really essential references and one for the incidental ones, like background information or related work. 
...Okay, that's all I got this week. Back to the talk mines.

Things I'm curious about

DORA Report: What's the state of research about the DORA report? Has anybody been able to replicate their findings? Has anyone evaluated them critically?
Puzzle Games: Puzzle games like flow free have thousands of levels. Are these programmatically generated? If so, how is it done, and how do they control the difficulty?

Measured as as "number of tests passed", from a researcher-provided test suite for the task. ↩

I checked this back in 2019 or so, but it seems like the data is now offline. I checked this back in 2020 or so. Props to making the data available! ↩

            If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.

Don't miss what's next. Subscribe to Computer Things:

Start the conversation: