A short disclaimer about Death Panel up top: I am no longer affiliated with the podcast. The TL;DR is: my views do not reflect the views of the show, don’t harass them if you don’t like this post. Don’t harass them if you do, either. Basically, don’t harass them at all. If you have any feelings about the podcast in relation to this post, I encourage you to go subscribe to the show, or write an entry in your journal about how much you hate my guts, or both.
The aim of this post is to give you, the reader, some tools to understand what you’re looking at when you see figures and estimates from the Pandemic Mitigation Collaborative (PMC). If you are online at all, you have probably seen these figures and estimates, as I have many times (just in the past few days, I have seen this model cited in outlets as diverse as the World Socialist Web Site, Current Affairs, Self Magazine, and People Magazine). Before I launch into my detailed critique of the model, a few general considerations. In my opinion, it is not “COVID minimization” to want to know the truth about COVID, to want other people to know the truth about COVID, even when this truth is less comforting than fabricated certainty. People have a lot of different ideas, some unconscious, about how scientific claims translate into political or social activity, or change in “the real world.” I think exaggeration and outright fabrication of claims about COVID, its impacts, and the level of transmission currently underway in the country right now correspond to a well-meaning but (in my opinion) incorrect theory of change – the idea that if enough people understand how bad it is, some kind of corrective action will be taken (just look at climate change). If the idea is to use scientific data to develop programs for political organizing, it is crucial to subject the truth claims supported by the data to verification. Organizing that is built around incorrect interpretations of scientific data is self-undermining, sooner or later. You don’t need to agree with my critique or my analysis, because the point is not that I have some kind of correct answer or magic bullet (one weird trick to solve the pandemic!) that other people don’t. I’m making this critique because it is part of my praxis as a leftist and a scientist to subject scientific claims with political import to scrutiny. That’s it.
Proper identification of the problem is key to strategizing about the problem politically. “I have to wear a mask everywhere and individually take on the burden of COVID precautions because COVID is airborne HIV and it is surging out of control worse than at almost any time previously” is a different articulation of the political pickle we are in now than “I must always assume the worst and individually take on the burden of COVID precautions because we have only very vague and belated indicators of where the virus is spreading thanks to the federal government’s abandonment of COVID monitoring.” If the idea is to make continuing or reinstating COVID precautions seem reasonable, I think a more effective way to do that is to emphasize the uncertainty people are being forced to live with rather than asserting a false certainty that things are secretly much worse than anyone knows. To be extremely, excruciatingly clear, I have no issue with people continuing to take whatever COVID precautions they can. I even think it’s a good thing to take COVID precautions right now and in general (I’m an epidemiologist, after all!). At the same time, a lot of the precautions discourse (much of which hinges on things like estimates from the PMC model) smooths over how burdensome and outright impossible it is to access “the tools” at this point in time. Masks are really expensive, and people are really broke. I was trying to find rapid antigen tests in the grocery store the other day and straight up couldn’t – four grocery stores, two drug stores, and half a tank of gas later, my friend came through in the clutch with some tests he had at home.
Finally, I am absolutely not saying that COVID infections aren’t up or increasing right now. From where I live, and from my anecdotal experience, it is clear that COVID is on the rise again. Just how much is unclear, and this is the crux of the problem. I look at my local county health department website and see that cases are increasing in recent weeks, but still lower than cases during this past winter; I don’t know if this is because transmission is really lower, or because more people are asymptomatic, or because more people are doing rapid antigen tests or not testing at all. It’s probably some combination of the three, which is all I can really say – and that is a problem.
Okay, with this throat-clearing out of the way, let’s actually talk about the PMC model. My critiques are based on watching the 27-minute video about the model on their website. In this post, I will focus just on the first part of the video, in which Dr. Michael Hoerger describes the PMC’s COVID infection estimation model. I will do a follow-up post on the second part, the forecasting part, at a later time. My comments here are based on the information presented in the video, which is not completely comprehensive; if anyone can tell me why I’m reading the modeling approach wrong, or if I’m missing something, please do so that I can update this post.
The PMC case estimation model claims to provide estimates of the level of COVID transmission in the country based on Biobot wastewater data (the number of copies of viral RNA in a given sample of wastewater from one of a hundred or so wastewater sampling sites nationwide). As far as I know, no one else has really tried to do this. Also as far as I know, this is probably because it’s really an impossible task. It’s not possible to estimate transmission from existing wastewater data. This post will explain why.
The PMC case estimation model is based on two data components: 1.) the Biobot wastewater data, and 2.) estimates of COVID transmission from the Institute for Health Metrics and Evaluation (IHME). These data sources are both intractably flawed in terms of the stated objective here (using wastewater data to estimate transmission).
First, wastewater: as of 5/30/2024, Biobot has gotten rid of their COVID data dashboard. They are, as far as I can tell, still producing “respiratory risk reports” – combined weekly surveillance, by US region, of influenza A and B, respiratory syncytial virus (RSV), and COVID-19 in wastewater from participating sites. There are intractable limitations to these datasets. In the first place, it is absolutely not clear what is causing increases in COVID RNA in wastewater. It could be a raging outbreak in humans, it could be an undetected outbreak in animals (wild or domestic), it could be that COVID is out there circulating but subclinically, not translating into a huge number of infections due to shifting characteristics of evolving variants and fluid population immunity (more on this last point shortly).
Second of all, the issue of representativeness. Biobot samples wastewater from a large handful of participating sites – I seem to recall it’s around 150 or something like that, but I can’t find a list of sites on their website. Already, this does not give a granular picture of what’s happening across the United States, or in a particular location. Furthermore, the sites that are included are, as far as I can tell, municipal sewer systems. This introduces an element of selection bias into the data. By this, I mean that the data systematically exclude any locations without a sewer grid – which are much more likely to be rural, poor, and majority non-white, all things that we know are associated with higher risks of COVID transmission and poorer long-term outcomes from COVID. It’s not a knock on Biobot, their program, or wastewater data itself – it’s just what it is, an intractable limitation of the data that limits the inferences that can be drawn from it. Further issues involve standardization and variability in the number of people and animals contributing to a given watershed at a given time.
The case/transmission estimates come from the IHME models, as Dr. Hoerger (the narrator of the explanatory video) says. Dr. Hoerger says that these are “not reported cases, but actual cases.” I think he is wrong about this. According to my reading of the IHME website, IHME paused their COVID-specific modeling in December of 2022 and added COVID to their Global Burden of Disease Study (which has been widely criticized for its modeling strategy elsewhere, including early in the pandemic when IHME was providing modeled estimates of projected infections and hospital capacity that were highly unstable and, in the end, incorrect). The COVID modeling they were doing through April of 2023 (I think?) was modeling, meaning that the estimates are not “actual” cases per Dr. Hoerger but rather synthetic estimates produced by applying statistical assumptions and models to some kind of real data. The sources of real data to inform these models have become more and more scarce over time, so it is safe to say that the estimates have become more and more abstracted from the reality on the ground. It’s not clear what the state of the estimates is now, post-April 2023, but I think we can safely say they are less reliable than at previous points in the pandemic because the real data used to inform the model have gone away due to administrative sunsetting and the like.
In order to come up with a model relating wastewater data to COVID transmission, you would need several things. The first two things are reliable wastewater and case data for the relevant jurisdiction (in this case, the United States overall) – which we do not have. The other things you need are some kind of parameters relating the wastewater concentrations to active infections. This would be extremely difficult even with really good data. Transmission of COVID occurs when somebody exhales enough virus to infect you, and you breathe it in. That “enough” is probably different both between variants and subvariants of COVID, and between individual people. My “enough” might be radically different from your “enough.” (In epi parlance, this “enough” is referred to as the “infectious dose.”) This could be informed, crudely but adequately, with some parameters about population immunity, different variants, the infectious dose and R-naught of different variants, etcetera. Problem is, we don’t have anything like this anymore. We are not tracking COVID variants to this level of detail anymore, and the picture of population immunity is highly muddied. Immunity doesn’t last forever, and the vaccines aren’t fully sterilizing (they don’t prevent all infections). Vaccine uptake is dogshit thanks to Biden’s anemic federal response, but most people have had COVID, but then again not all of them have had COVID recently, and so on. The number of unknown parameters becomes large very fast in this kind of information environment, so large as to effectively circumscribe any type of modeling effort like this.
Let’s pause here and review. The PMC has built a model that purports to look at wastewater data and transmute it into case/transmission data for the United States. That model rests on two very limited data sources, reflecting highly abstract and/or synthetic modeled estimates of the relevant quantities. This model is already a bit of a house of cards, distant from the reality of what is being modeled and dependent upon a scaffold of highly simplifying and highly questionable assumptions. If any one of those assumptions is even slightly wrong, as at least one almost certainly is, the house of cards collapses. But let’s continue, and see what method is being applied to convert wastewater into cases.
I am not an expert on this type of modeling, far from it. As far as I know, though, there isn’t really a method for converting wastewater data into case data – as the foregoing discussion shows, both wastewater and COVID transmission are too geographically and temporally heterogeneous, and too complex, at least without collecting better data to inform the model. We don’t have the information we would need to have on the parameters we would need to estimate a reasonably accurate (if crude) model. No burying the lede here: the method is a simple multiplier. What this means is that the “model” consists of multiplying the wastewater number by a constant number in order to get the case estimate. In the explanatory video, Dr. Hoerger talks about how this multiplier was developed. First, he explains that he correlated some observations from the Biobot and IHME data: “I took a sample of data from the Biobot wastewater that’s available online, took a sample of data from IHME and ran a correlation between them. The correlation between the samples I analyzed was 0.94, which is very high. It’s perhaps the most impressive correlation I’ll ever see in my life.”
A few words about correlation, because this seems to be an important motivator for the multiplier approach. You can calculate the correlation (co-relation) between two variables by a simple mathematical formula. In the numerator, you put the sum of the deviations of each observation from the mean of each variable; in the denominator, the square root of the sum of the squares of the same quantities. Don’t worry too much about this. What happens is that correlation is high if the two variables “vary” around their means together (this is called “covariance”). An obvious way to get a high correlation coefficient is for two variables to have very high covariance. This seems reasonable with something like wastewater copies/mL and incident COVID infections, especially in earlier years of the pandemic. Another way to get a high correlation coefficient is to have mismatched units between the variables you are correlating, which is also the case here – wastewater data are in copies/mL and COVID transmission is in cases per some unit of population (usually 100,000 people). These are highly different scales and it’s not clear how the units can be converted one to the other; this type of thing can cause spuriously high correlation and other weird behavior in correlation/regression problems. There are other considerations but I will stop here – this is just to give you the sense that this high correlation coefficient isn’t the slam dunk it is being presented as.
Here is how Dr. Hoerger describes how he constructed the multiplier the model uses:
“Go take a sample of data from the Biobot dashboard. Go pick out 10 or 12, 20, 25, whatever, data points you want from their dashboard, try to get a representative series of dates – don’t just pick the dates with the highest levels or the lowest levels – prior to April 1st and put in an Excel file what those wastewater levels were for those dates. Then go to the IHME model and figure out what the estimated number of daily cases were for those dates… Take that IHME case estimate value and divide by the Biobot data, and do that for each of the dates… what you can do is then you’ll have this value, this multiplier, for each of those dates. You can take the average, you can take the median, you can take a trimmed mean if you know how to do that, but you’ll get some value for converting these raw Biobot estimates into daily cases… I get a value of 1,455 [with the trimmed mean].”
A few concerning things here. The multiplier is being developed using a non-random sample of dates; and what even is a “representative sample” of dates during the 4 years of the pandemic? The PMC model divides some case estimates by wastewater numbers, averages or takes a median or trimmed mean of the dividends over however many estimates were chosen (why not use all the data available?), and this is the multiplier: 1455. The PMC model is, literally, just multiplying wastewater numbers (in copies/mL) by 1455, and we are supposed to believe that this is giving us an estimate of new daily cases. This obviously cannot be true, because the unit of X copies/mL multiplied by a unit-less constant, 1455, is still in copies/mL. The unit has not been converted into cases/population.
This seems pedantic, but it is actually meaningful. At this point, what should concern you the most is that this is an extremely crude method applied to highly flawed and inadequate data – like building an expensive addition on the house of cards (concrete cantilevers, Frank Lloyd Wright style). The model takes some convenient sample of a few case estimates, divides them by a few corresponding wastewater estimates, and then uses the result as a multiplier. This approach is totally agnostic about the actual mechanisms relating wastewater to cases – all the stuff discussed earlier about how this relationship is probably changing over time due to new variants, fluctuating immunity, and so on. This approach also uses crude math to simply smooth over considerable and consequential heterogeneity. We all remember from the last four years of COVID that COVID transmission doesn’t rise linearly and simultaneously in all locations of the country. Outbreaks blister up in one place then subside, but not before seeding outbreaks in other, distant places, related to the complex alchemy of travel and especially commercial patterns. But according to the PMC model, this homogeneity must be the case. It is an assumption of the methods used that this is how COVID transmission works: linear, 1:1, the same in space and time, uniformly rising or falling nationwide.
At this point, I think an example is warranted and probably a relief for the beleaguered reader. Let’s imagine an analogous situation but about something nice, baked goods, instead of something horrible, like the novel coronavirus. Let’s say that we have information about how many bananas are sold in one grocery store or produce wholesaler apiece in about a hundred cities in the United States. From this, we want to try to estimate how many banana muffins are bought each day. Our first problem is the missing data problem – this is analogous to the situation with wastewater. Our picture of how many bananas are being sold is too limited to make inferences about what’s going on in other areas of the country – it could be that other areas of the country have no banana sales, or that other areas of the country not measured with our banana surveillance are selling bananas at a rate that is… well… bananas (sorry, I couldn’t resist). See the problem? Right off the bat, we have a severely circumscribed picture of banana sales. Our second problem is the muffin data. To make this analogous to the PMC model, let’s say that we used to count every muffin, but now we don’t, and we rely on modeled estimates of past muffin sales, and maybe some other stuff that is correlated with muffin sales (like coffee sales) to estimate the number of muffins being sold today. Now, to relate the two to each other, we follow the same procedure for the PMC model. We divide the number of muffins by the number of bananas on some non-random handful of dates we choose. We take an average of the resulting numbers to get just one number, multiply that number by each day’s banana sales, and call the result the number of muffins sold each day. Do you see the problem? Bananas and banana muffins are in different units, reflecting the qualitative transformation that a banana must undergo to become a muffin (mashed up, mixed with other ingredients, baked, packaged, delivered to the point of sale). Multiplying the number of bananas by the constant number we got (from dividing muffins/bananas), we still just have… a bunch of bananas. We have actually learned nothing about the number of muffins, we’ve just done some division and multiplication and told ourselves we have.
With this in mind, let’s wrap up my remaining critiques of this model. It does not have what we in the biz call face validity. Face validity is a concept usually applied to psychometric constructs and things like that, and it’s a subjective thing: does it seem to you like this construct accurately represents what it is supposed to represent? (For example, if you ask different people what the face validity of something like IQ is, you’ll probably get different answers; somebody who really believes “intelligence” is a unitary thing that can be quantitatively measured will think IQ has high face validity, while somebody like me who doesn’t believe that will think that IQ has low face validity). The model relating wastewater estimates to case estimates shows 6.5 million new infections per day at the peak of the Omicron wave in the winter of 2021-2022. This is obviously false. There were a shit ton of new infections per day then, something like 1-2 million; this is probably an undercount, but likely not a super dramatic one, because home testing was not widely accessible at that point.
Additional “metrics” reported by the PMC seem to be extrapolations based on this approach – further additions on the house of cards. One of the claims with most currency is that “1 in X” people in the US are currently infectious with COVID. This claim is based, first, on the flawed cases-from-wastewater estimation procedure, plus some additional unjustifiable assumptions. I am unable to replicate Dr. Hoerger’s “1 in X” calculation from the video, which demonstrates this procedure using data from November 2023. I feel confident in saying, though I am not sure of the exact mathematical procedure he used to arrive at this calculation, that this metric also assumes uniformity and homogeneity of COVID transmission in space and time which we know from long experience is not a reasonable assumption. Just as of right now, I would guess that COVID transmission is higher in Los Angeles than it is here in Pittsburgh; even here in Pittsburgh, wastewater indicators are very low right now even as I know a lot of people who have had COVID in the last few weeks.
The assumptions made about infectiousness are also inappropriate. The five-day infectious period Dr. Hoerger assumes in making the calculations is arbitrary but fine – this is not the problem. The problem is that the model has no method (as far as I can tell from the video – please correct me if I am wrong here) for subtracting people who are no longer infectious from the count. (The most basic epidemiologic model of infectious diseases, the SIR model, takes this as one of its three parameters – SIR stands for Susceptible, Infectious, Recovered or Removed.) People who contract COVID do stop being infectious at some point, whether it’s after five days, or ten, or fifteen. The exact number of days is a coarsening modeling assumption, and different estimates could be tested. But if the PMC model accounts for attrition from the infectious population via recovery or death (recovery, in the sense of no longer being contagious, is mercifully much more common than death) it is not addressed in the video. As it is, this inability to account for the cessation of infectiousness will lead to inflated estimates of the number and proportion of people who are infectious in the United States. And the simple, aspatial mathematics smooth the concentration of infectious people evenly over the whole country. It’s likely that in some places in the US right now, there are even more than one in 75 people currently infectious with COVID, and in other places, there are substantially less.
Finally, the model purports to estimate new weekly long COVID cases based on the estimated level of transmission. This is, again, a crude multiplicative manipulation lacking face validity. In order to model this correctly, you would first need an accurate count of how many new cases there are each day or week (whatever unit of time you like) – which we don’t have. On top of that, you would need some kind of reasonable parameter capturing how many new cases result in long COVID (and a working definition of long COVID). There could be a range of reasonable parameters here, and it would be reasonable to test them all, if they existed; as it is, the PMC model appears to just be plucking figures from published literature on long COVID that correspond to earlier variants of COVID and earlier phases of the pandemic with different levels of infection- and vaccine-induced immunity, and with different numbers of people susceptible to long COVID. The estimates of new weekly long COVID cases from the PMC model I have seen are implausibly huge – everyone in the US would have long COVID several times over by now if the model’s numbers and assumptions were correct. That they don’t reflect a reality that is more weird, complex, and qualitatively shifting (this will be important in the next installment of this critique) than can be captured with the crude, linear mathematical tools of the PMC model.
One could offer some thoughts about what this all means, the political and personal significance of what’s happening in “the current moment” (hate that phrase), or whatever. I’m not going to do any of that. I’m also not going to be so arrogant as to pretend to know what you “should” do about this or in light of this information. We’re all on our own out here in 2024, as far as COVID and so many other things are concerned, and that simply fucking sucks. I can see how it might be tempting to lean into this model and its estimates because it seems like an easy enough way of getting COVID back in the national conversation and on people’s minds again. I could say that I worry about that sort of approach – a political agenda built on incorrect scientific claims is a fragile and vulnerable one – but I’m not even arguing that. I’m just trying to show you how to read the model.