“All models are wrong, but some are useful.”
Are the Pandemic Mitigation Collaborative models — which are wrong — useful?
I wrote a post about this already, about the wastewater-to-cases conversion that the model purports to do; I found this to be extremely unconvincing, to say the least. Here, I want to get a few quick thoughts in about the forecasting model. A new version of the forecasting model launched today. Here is Dr. Hoerger’s X post about it. The only thing that is different is that the underlying data sources have changed; instead of relying exclusively on Biobot wastewater data, they are now relying on Biobot and CDC wastewater data. According to the post: “Essentially, we link all three data sources, which have been active over different points of the pandemic to derive a composite ‘PMC’ indicator of true levels of transmission.”
These three sources are, again, Biobot and CDC wastewater data, and modeled IHME case estimates from prior to April 2023. No matter which way you slice it, this is not good data; converting wastewater to cases remains impossible to do reliably, and the PMC do not have access to any actual “ground truth” data about COVID transmission. They are as in the dark about it as any member of the general public. Claims about the greatness of the data are simply exaggerations. I remain skeptical that reliable modeling of wastewater concentrations to daily new cases is even possible, let alone forecasting based on those new case estimates.
I watched the video on the website again yesterday, before the “new” model dropped, to critique the forecasting procedure. I am very concerned by what I saw. The forecasting model extrapolates cases four weeks into the future accounting for two things: “stability” and “change.” I guffawed at first because it sounds a bit unscientific to me, but really, stability and change aren’t bad as metaphorical groupings for abstract modeling parameters. I’ll concede this one, haha.
This part of the video is pretty vague, and it would be most fruitful to look at the software code used to run the models (I will reiterate this call to share the code for the models again at the end), but in the absence of that here is what I gather about how it works. First, Dr. Hoerger splits the current year up into 26 2-week intervals. Then — here comes the stability — he looks back to what case rates were for the same interval in years prior, and then “updates” that based on the circumstances of the previous four weeks. But not the previous four weeks of this year; rather, the previous four weeks (the four weeks preceding a given 2-week increment) in years prior. If transmission was higher than currently in years prior, the forecast estimate is adjusted up; if lower, it is adjusted down. The video is vague about how the model “takes into account” what is happening in the most recent four weeks, including the slope of the case estimate line, but apparently it does — I’m guessing this is the “change” part.
(Just a note — this is learning from the past in an inflexible way that may not be appropriate for something that really hasn’t settled into a predictable pattern, like COVID, though COVID has been generally lower over the past year or so than in previous years. To paraphrase Levins and Lewontin, “things are the way they are because they got that way,” and it’s not clear that this forecasting model has any way to account for qualitative transformations in the course, nature, or pattern of the pandemic.)
In the video, Dr. Hoerger narrates: “If I run a regression model using these variables to predict what transmission will be like in a week, it accounts for 96% of the variation in what transmission will be like in a week.” Then he goes into an (in my opinion) unnecessary analogy using a pie/pie chart to explain what 96% means. This is not what I am confused about at this point. What I’m confused about is: what does this regression model look like?
To be clear, my first and most important issue with this model is that the data and assumptions going into it are fundamentally flawed. The wastewater to case translation is extremely unreliable and there are good reasons to believe that the case estimates are extremely inaccurate. These inaccurate case estimates are then going into the forecasting model to generate predictions for the future. Accurate prediction is not possible without accurate “ground truth” data inputs, though.
My second issue — “what does this regression model look like?” — is that the forecasting methodology seems wrong. It’s possible that I am wrong about this and there’s more going on than what is talked about in the video (if so, please correct me). But based on what I saw and heard, it sounds like these case estimates are just being put into a regression model with some different variables. These might be something like… one variable representing cases last year at this time, one representing case estimates last week, one representing the direction of the change from last week to this week (0 for decreasing, 1 for increasing), and so on. (Again, I cannot stress enough, these are not cases!!! They are essentially numbers from out of a hat. Using made-up numbers to predict other, similarly made-up numbers!)
The 96% number is the proportion of the variance in the outcome “explained” by the variables in the model, also known as R-square. This does not have a causal interpretation. (Although Dr. Hoerger notes in his X post that the new forecasting model has an R-square of 98%! Wow, a two point improvement!) This is a statistical metric summarizing how close the data points are to the regression line that was “drawn” through them by the model. Which brings me to another point…
Is this just a linear regression model? This is the core of my second issue about the forecasting model. A linear regression model would not be appropriate to use here for a number of reasons. Here’s just one: COVID cases (or, in this case, the made-up case estimates the model uses) are a time series. Time series data usually exhibit “autoregression” — case estimates that are closer together in time, like on successive days of the same week, are generally more similar to each other than case estimates more distant in time, like today vs. one year ago today. This autoregressive structure has to be addressed analytically in order to get accurate predictions out of the model. (I wrote exactly one paper as a postdoc building a forecasting model; I am no expert in this and not specifically trained to know how to do it, so I was learning from scratch as I went and LET ME TELL YOU, it was a big pain in the ass to prepare the time series data to appropriately meet all the requirements for time series/predictive modeling. Just mentioning this to say that I know from painful experience what a heavy lift it is to model these data accurately enough to use them for forecasting. The technique I used is called ARIMA, which is much more lovely-sounding than its companion and sometimes-preferable approach GLARMA.)
Speaking of accuracy, I guess it doesn’t really matter because all the numbers are made up anyway, but typically, some degree of predictive accuracy is desired. If your model predicts that there will be X cases a few weeks from now (and you’re all over Twitter parroting the figures you derived from it, like 1.2 million new cases per day), you want to be pretty sure that there will be X cases, give or take some margin of error, a few weeks from now. Forecasting models are usually rigorously evaluated in terms of their predictive accuracy before they are actually used to make predictions about the future. One common approach to this is a training/testing approach.
Say you want to build a forecasting model. The first thing you need is good data that you are confident reliably represent some kind of truth. (The PMC model, as I will say over and over until the end of time, does not have this.) But if we did have this, we would build the forecasting model on only part of the historical data we have — the training set. Then, we’d use the remainder of that historical data — the “test set” — to test it out and see how accurately the model can make predictions based on data that it hasn’t “seen” before. Splitting your original data to do this is a best practice because while you don’t know the ground truth in the future (which hasn’t happened yet), you do know the ground truth in your historical data (or you would, if you were modeling correctly). If your predictive model generates reasonably accurate predictions on the portion of data you “hid” from it (which you can check, because that portion of historical data has the true values you are trying to predict), then you’re good to go. If it doesn’t, you need to go back and tweak the model parameters, repeating the process iteratively until you get reasonable accuracy.
A linear regression model is also not appropriate because it imposes a strong assumption on the data. One of these is that change happens in a linear fashion (y = mx + b, where y is the outcome, x is the predictor variable, m is the slope of the regression line, and b is the intercept) when outbreaks of infectious disease usually exhibit a pattern of exponential (a.k.a. nonlinear) growth or logarithmic (also nonlinear) decrease. Another important assumption linear regression imposes is that the events we are modeling (in this case, new COVID infections per day) are independent. That someone else becoming a COVID case has nothing to do with my risk of being exposed to or infected by COVID.
This is obviously, uh, a highly inappropriate assumption for infectious diseases (remember, once we’re in the realm of supposed cases, we’re modeling or at least pretending to model COVID transmission; infectious disease modelers have a whole suite of methods that they’ve adapted from traditional statistics to account for the uniqueness of infectious disease data). We are dealing with events that are very, very much dependent, but not straightforwardly so. There are spatial and temporal components to this “statistical dependence,” the non-independence of infection risk between people. Someone getting infected with COVID today in Washington State has basically zero impact on my risk of contracting COVID myself. Someone getting COVID on my block, in my family or social group, or among my coworkers impacts my risk of contracting COVID dramatically. The PMC model ignores these dependencies in the data during modeling, and in the reporting of the modeling, artificially smooths them away. The estimates of “% chance someone in a room is infectious with increasing numbers of people in the room” are misleading because they ignore this dependency, spatial and temporal clustering, and treat COVID transmission as if it’s something that happens randomly and uniformly among billiard balls whose interaction does not affect the risk — rather than something that happens in a highly structured way among living organisms enmeshed in multiple overlapping systems of contact and dependence.
Fundamentally, we just do not know what the modeling approach is. The data quality issues and the incorrect approach to estimating cases from wastewater data is enough to put this model to bed forever — it is not reliable, and no one should trust it — but for the sake of transparency, and because this model is being cited and peddled so widely, I think it’s time for the PMC to publicly post their modeling code and details about the modeling approach. By this I mean, whatever R scripts or Stata code (or whatever the case may be) were written to generate and evaluate the model, along with information about how accuracy is evaluated before forecasts are made, how model performance is monitored, what types of modeling approaches are being applied to deal with the autoregressive and dependent structures of infectious disease time-series data, and so forth.
I know this is a bit technical. If you have questions about any specific component of the model, whether I have touched on it here or not, please feel free to email me. I’ve been told that even mentioning this model in a critical light is akin to “doing DoorDash discourse,” which is ridiculous. It’s all too easy to make a scientific career out of lying to people with semi-technical language. The motivations of the PMC seem fine, but it is not acceptable to peddle scientific-sounding falsehoods to the general public, and especially not to the segment of the general public that is most desperate for usable information. I don’t like it when antivaxxers do it, I don’t like it when COVID deniers do it, and it turns out I still don’t fucking like it when well-intentioned COVID commentators and influencers do it. Numbers are seductive because numbers are powerful. But if you want COVID to be in the news again, spreading numerically precise but completely false information is a self-defeating way to do it. Sooner or later the emperor will be revealed to have no clothes.