Entscheidungsprobleme

One of the boulders that I am most commonly pushing uphill in my statistical work for clinicians is the recognition that missing data can be meaningful, depending on how they arise. This a subroutine of a larger program, one in which I aim to impart some kind of coherent global understanding of What We’re Doing When We Do Statistical Inference. This global understanding involves much more than just the internal mechanics of statistical estimation (t-tests and regression coefficients); it involves thinking about data distributions, how they arise, and the patterns of missing within them. In short, it involves thinking a bit more like an epidemiologist, and the way I phrase it is always some variation of “it’s more of an art than a science!” in a fake-ass chirp that gets me, more often than not, a vacant Gen Z stare.

The epidemiology jargon for the categories of missing data is murderously stupid and confusing (“missing completely at random,” “missing at random,” “missing not at random”) and so I completely avoid using it. It’s easier and more clarifying to talk about mechanisms of missingness. Most of the data we analyze is collected observationally, which is to say, without an investigator controlling the experimental conditions and assigning an experimental treatment to different subjects at random. Usually, we’re just passively collecting data on people as they do stuff, making note of important characteristics, like exposures they might have that we’re interested in, clinical outcomes, demographic and social characteristics, and so forth. It is often the case that some of these data are incomplete – missing. This could be for completely meaningless reasons, like somebody just forgot to enter something, or a random Excel error turned a couple of fields into unreadable nonsense. Or, it could be for reasons that are, as we say, “systematic,” related somehow to the characteristics of the populations under investigation.

Not knowing why the pattern of missing data looks the way it does can introduce serious bias when, in Doing Statistical Inference, we’re using the tools of statistics to estimate the magnitude of association between an exposure and a clinical or health-related outcome we care about. Here’s an extremely basic example. Say we want to estimate the association between a certain type of exercise and a cardiovascular disease outcome. If participants in the study with worse cardiovascular problems or general health to begin with are unable to complete the exercise, their data for the exercise field will be missing, and – if the investigator is not careful – these folks with missing exercise data will be grouped, in the analysis, with the people who could have done the exercise, but didn’t. In this case, the estimated association between the exercise and the cardiovascular outcome will (likely – it depends) be understated.

There’s a bit of a wrinkle here. The distinction between “missing at random” and “missing completely at random” in the epi jargon (ugh, it’s so horrible) has to do with whether you can use the information that you do have in the data set you’re using (what we call the observed variables) to generate a reasonable estimate of the probability of missingness. So in our example, if we had collected information about the severity of existing cardiovascular disease, we could use that to predict whether the exercise field would be missing for a given participant. If we hadn’t, then we’d be out of luck; the exercise data would be “missing not at random.” (Various strategies for dealing with this exist, which I won’t get into here.) Crucially, this little illustration raises some general issues about what you can tell about the mechanism of missingness from within the data itself, versus what you can infer about the mechanism by looking outside the data. Investigators I work with constantly want to do formal statistical tests that will give them a p-value to decide whether their data is missing completely at random, at random, and not at random, but I never promote or encourage these tests, because the closed system of a data set and the variability therein is not really all that informative as to why certain patterns of missingness are what they are. Sooner or later, you’re better off supplying some kind of outside-the-data explanation, even if it’s just educated conjecture.

Douglas Hofstadter’s 1983 nerd classic Gödel, Escher, Bach makes a distinction between such in-system and out-system thinking, designated “M mode” (for mechanical, or in-system) and “I mode” (for intelligent, able to surpass the interior workings of the system to think about the system). There’s also a very interesting illustration, using the concepts of figure and ground from visual perception, of theorems in a given formal system covering a subset of a larger set of so-called unreachable truths, and negations of those theorems covering a subset of a larger set of so-called unreachable falsehoods. I’m not going to go further into this book, or into Gödel’s incompleteness theorems, both to save myself the pain of it and to avoid inappropriately extending Gödel’s reasoning (or Hofstadter’s for that matter) – narrowly concerned with formal logical systems – into the pseudo-philosophical nether realms I’m skipping around in here. I just want to establish this point about different modes of thinking about (and approaching) the universe of truths and non-truths, and which truths (or non-truths) are reachable from within a given system or at a given level of analysis (M vs. I).

Historian and poet Peter Dale Scott – the unsung Rosetta Stone of a certain brand of very-online political Twitter posting that was popular around 2016 – is often lumped in with conspiracy theorists. PDS coined the term “deep state” and his many books give the “deep politics” treatment to everything from the JFK assassination to 9/11. I don’t think he minds being identified with the ‘noided and the conspiracy-minded in the slightest, but to call him an outright conspiracy theorist is, to my mind, a little bit unfair. His is an epistemologically interesting kind of historiography; his books are hard to read because they meticulously document gaps and redactions in official documents, and aim, through even more meticulous sifting of different eyewitness accounts, supplemental documents, testimonies, and other flotsam, at constructing narratives that can explain the patterns of suppression in the official record. (This is conspiracist, sure, but on a different level than a book like SK Bain’s The Most Dangerous Book in the World, which argues that 9/11 was literally a mass occult ritual.) (And just what was Cheney doing for those 15 minutes in the tunnel outside the PEOC?) I mention this in the first place because it’s cool, and because I like his books, and because I think his method of reading history is an interesting one, and often more transparent about its methodology and assumptions than the official stories. I also mention it because it’s a way of reading historical data that is similar to the epidemiological way of reading observational data. Some things are not necessarily discernible or provable from within the system, whatever that system is, and recourse to another level of analysis – in both the conspiracy-history and epidemiological cases, a higher level of analysis, thinking about the record or the body of data – is necessary.

This leads to interesting places because we don’t always have formal systems or objective rules for these outside-system explanations. Of course, conspiracy thinking, unconventional historiography, and epidemiologic data analysis are not mutually exclusive. This is why, to my thinking anyway, it doesn’t make sense to try to counter, say, health-related conspiracy thinking with explainer-type scientific reading of different experimental data or results. The conspiracy thinking resides at the higher level, the I-level, where some kind of outside-system explanation has to be supplied, at least provisionally, to make sense of what the data actually say.

You just read issue #74 of Closed Form. You can also browse the full archives of this newsletter.

Closed Form

Aug. 20, 2025, 5:26 p.m.