Skip to content

Can someone build a Bayesian tool that takes into account your symptoms and where you live to estimate your probability of having coronavirus?

Carl Mears writes:

I’m married to a doctor who does primary care with a mostly disadvantaged patient base.

The problem her patients face is if they get tested for COVID, they are supposed to self quarantine until they get their test results, which currently takes something like a week. Also, their *family* is supposed to stop going to work, etc.

Seems like a reasonable approach, until you consider that there are often a lot of people in their house, none of whom can afford to lose their job and no extra rooms to quarantine in. And living with someone who has been tested but not yet “positive” is not a valid sick leave excuse. So this is a big ask that will likely lead to people not wanting to get tested.

What would really help would be a statistical tool that takes into account where the patient lives (and maybe where they work) to generate a Bayesian prior, and then integrates reported symptoms (cough, fever, tired, loss of taste, etc) to find a statistically defensible probability of infection that could guide the request to quarantine

For an extreme example, a persistent cough and loss of smell would mean something very different in someone that works in a warehouse in the Bronx vs a ranch owner in rural Montana.

It seems like the P(symptom) if COVID probabilities are around. I am not sure about the total population P(symptom):(COVID or NOT_COVID) which one would also need, right?.

What do you think? Have you heard of anyone making a tool like this?

My reply:

Yes, I guess you could do a formal (or informal) decision analysis.

A starting point is that people are already balancing their estimated risks and benefits in some way to decide whether to be going to work, how much to go out, etc.

You can roughly say that a person is in one of 3 x 2 possible states:
– You have no symptoms or you have symptoms
– You have not taken a test, you have a negative test result, or you have a positive test result.

Most people are in the “no symptoms, no test” state and they’ve already decided how to live their lives. If you are tested and the result is positive or negative, that gives you more information. The difficult case, as you note, is if you have symptoms but your test result in’t in. Having symptoms increases the probability of having COVID, so it should make some people not go to work who were otherwise planning to go to work. But I’m not sure how helpful it would be to have some sort of Bayesian calculator for this, partly because these probabilities are not known very well and partly because it’s not so clear how to pipe this into a decision.

In any case, I agree with you that it’s not right to tell people to self-quarantine before the test results are in. If they want to tell people to self-quarantine if they have symptoms, that could make sense, but then the decision is based on the symptoms, not on the fact of having taken a test.

Carl’s response:

As far as COVID, I am thinking of a tool that takes into account symptoms and where the person lives to get a better estimate.

Simple tools are widespread in the MD community for making all kinds of decisions. You can play around with a huge collection of them here. (If you don’t like clicking links (you shouldn’t) just search for mdcalc.)

The innovation would be automatically using home address to inform the prior. This information could be automatically harvested from the web in an on-going way from e.g the NYT site or Johns Hopkins). Even at the county by county level, this would be useful.

What would be specifically needed to make such a tool?

My reply:

Yes, I guess this would be possible. Maybe it’s a good idea. The basic model would go like this: You’d be trying to estimate your status (never exposed, currently contagious, or exposed and no longer contagious). Your geographic information and exposure (some measure of how many at-risk people you are in close contact with) would give you a prior probability or base rate of probability for each of these three states. Then you’d also need the probability of each possible cluster of symptoms given each of the three states. This model implicitly assumes that these conditional probabilities don’t depend on your location and exposure. All these probabilities would have a lot of uncertainty, but it still seems like better than nothing.

So maybe someone wants to build this? Or has already done so?


  1. Jon Zelner says:

    Hmm – my only reservation about this is that the probabilities will be quite low, even when there is a lot of transmission going on. So for this to be effective, at the very least, the outputs would have to be gauged in terms of both *direct* risks to the individual as well as *indirect* risks associated with the person being ill and not quarantining.

    • Exactly. Even in NYC which was clearly the hardest hit location in the US, if we accept at face value that around 20% of people got it, and it took about 50 days to reach that level… Assuming the disease lasts about 20 days for most people, and you are most infectious for about 13 days (2 or three days before symptoms to 10 days after)… You only probably had at any given time on the order of 0.2 * 13/50 = 0.05 or 5% of randomly chosen people from NYC on any given day would be infectious.

      For areas like southern CA where we are currently probably in the several percent have ever been infected stage (so let’s call it 5%) on any given day the probability that a randomly chosen person has infectious stage COVID is probably at most 1%.

      Sure, add symptoms and it goes up, but when you look at the testing data, it’s something like 2M tests have been given in CA and somewhere around 100k cases… so any randomly chosen person who has sought a test has about .1/2 = 1/20 = 5% chance of being positive…

      When base rates are this low (~ 1% of people have the disease) a positive test is not even that diagnostic. p(disease | positive test) = p(pos test | disease)p(disease)/p(pos test) ~ 1 * .01 / (.05) ~ 0.2

      This is quick back of the envelope stuff, and should be informed by more information, but as it is right now, we’re doing a terrible and inefficient job of utilizing testing resources. We should be testing WAY MORE people using POOLED testing to find hot-regions of communities and using restricted movement and quarantine support (money) to limit spread while letting low-prevalence regions open more.

      Forcing individuals to bear the total cost of quarantine while it’s the general public that gets 100% of the benefit is stupid. Pay people to quarantine!

      • Hmm… rethinking the diagnostic calc, I was too quick there. Assuming a 2% false positive from contamination rate, and around 1% of people expressing the virus…

        p(pos) = p(pos | disease) p(disease) + p(pos | no dis) p(no dis) = 1*.01 + .02 * .99 = .03 so

        p(pos test | disease) p(disease)/p(pos test) = 1 * .01 / .03 = 33% chance you have the disease if you are randomly selected and test positive… a little higher than the .2 = 20% I calculated above… but still similar ballpark.

        Something seems weird about that. But I guess it comes down to the randomly selected bit. We don’t test random people, we test test-seekers. If you have symptoms you’re not randomly selected. Basically this tells you it’s not worth going to get a test if you aren’t symptomatic because even if it comes back positive it’s only maybe 1/3 chance you have the disease.

        Whereas, if you’re symptomatic one assumes the chance you have the disease with a pos test is much higher.

        • Ben says:

          > the probabilities will be quite low, even when there is a lot of transmission going on

          > When base rates are this low (~ 1% of people have the disease) a positive test is not even that diagnostic

          Knowing about this problem comes from the calculation though. Or it comes from calculation on a hypothetical model system.

          So I don’t think this specific reason at least can be the reason to not do the calculation — it requires the calculation to be done in the first place. Or something like that?

        • John Williams says:

          How should one think about this in the context of pooled sampling? For example, suppose that the health dept. in Humboldt County, CA (where I live) started sending teams to collect say 100 samples per day at the Costco parking lot or other similar places, and pooled splits of these into 20 lots of 10 for testing. The prevalence of Covid is low (102 confirmed cases so far from a population of ~136k), so the pooled samples would mostly be negative, and you could retest the reserved splits from any pools that came back positive to identify infected individuals. Thus, for any apparently positive case, there would be two separate tests, also on the same sample.

          Presumably, you would be testing asymptomatic people, and the main reason for doing it would be to get an early warning on a surge in new cases that might threaten hospital capacity. A secondary reason would be for contact tracing.

          • Hello to Humboldt. I had friends who lived on the Klamath River for decades until they moved recently. Beautiful region. I guess they were in Siskiyou but whatever, it’s all beautiful.

            Yes, pooled sampling works more or less like you’re talking about. The assumption is that we want to test a LOT of people but we can’t afford to do this 1-1 because of cost (let’s not get into the question of why does it cost ~ $100 to run a single PCR test when my wife’s bio lab can run PCR for like $1 each). You have a county population of 136k people. You have true prevalence in the range of 1% at most… Let’s send out an emergency alert system message to everyone’s phone, saying “if this phone’s number ends in xx please come to your nearest COVID testing location to help us detect and fight COVID infections” or some such thing.

            Now, suppose the county health board has done a good job of messaging around this and convinced the population that this is a safe and worthwhile activity. Around 1% of the population may show up, so that’s 1400 people or so, maybe at 5 or 10 different community locations.

            Now, to run 1400 tests each week would cost on the order of $140,000 a week in tests alone, not to mention time for the employees to collect them and the PPE and etc. And we’re pretty sure that all but around 15 people are going to be negative.

            Instead, we pool people into groups of say 40, and test each group in a single test. Cost is now $140,000/40 = $3500 which is a weekly cost that seems extremely doable.

            From this data, we can estimate prevalence in the community reasonably accurately. For 1400 people tested 40 per well, you can see differences in the posterior distribution down into the ~ 0.5% range and certainly can do probabilistic decision theory on things like whether there is evidence for growth or not.

            Yes, identifying individuals could be valuable, but for a place like what you have, it might be much better to simply monitor prevalence, and then when you see evidence for growth, send out a message saying “Weekly survey testing indicates COVID growth in our region. For the next two weeks we are moving to restriction level 3 where only essential services may operate starting tomorrow. Thank you all for helping us to control this disease”. Or whatever.

            Now, if you have to burn $150k a week to do this vs $3500/week it seems like there is a big advantage to pooling right?

            Lets suppose you have 5 positive wells, you could re-test those 5 wells, pooled 1/10 with a different random assortment… and probably identify most of the individuals for an additional 5*40/10*100 = $2000 of testing.

            I’m working on simulations surrounding the identifying individuals aspect next.

        • The one thing that puzzles me is that not all symptomatic people experience the same constellation of symptoms. For example, not all experience a fever, cough, or breathing issues.

          I like the idea of providing incentives to test. Preferably with an antigen test, which I understand is more accurate than an antibody test. Any of you have information on this?

      • Clyde Schechter says:

        “This is quick back of the envelope stuff, and should be informed by more information, but as it is right now, we’re doing a terrible and inefficient job of utilizing testing resources. We should be testing WAY MORE people using POOLED testing to find hot-regions of communities and using restricted movement and quarantine support (money) to limit spread while letting low-prevalence regions open more.
        Forcing individuals to bear the total cost of quarantine while it’s the general public that gets 100% of the benefit is stupid. Pay people to quarantine!”


      • Navigator says:

        I agree, but are you referring to antibody tests or nasal/oral swab ones when you propose mass testing? The results from antibodies tests can give us some future planning tools, but randomly swabbing population that can get infected 20min. later, doesn’t really help us learn a lot, unless the same individuals are tested multiple times.

        What am I missing?

        • Antibody tests are unhelpful when it comes to controlling the epidemic because by the time you have antibodies you’re recovered and are not infectious. If you want to control the epidemic you need to be able to react to people developing viral shedding.

          We need to be swabbing people for active virus and looking for changes in the prevalence of infectious people in the community. When prevalence increases, then restrictions on movement can be put in place, and when it decreases, then we can relax this restrictions.

          In a disease with asymptomatic individuals, we can’t rely on people seeking tests to detect changes in prevalence.

          See above response to John Williams that I’m about to write for more info on the math…

      • GH says:

        I so agree with this. And an easy, sustainable way to pay for this would have been to have people pay into a pool every time they take a test. The money could be used to pay for the cost of quarantining.

    • zbicyclist says:

      Thanks for the link. Sobering, though.

      “Among people who are the same age, sex, and health status as you and get sick from COVID-19, the risk of hospitalization is 83% , the risk of requiring an ICU is 51% , and the risk of dying is 68% .”

      If I don’t post here for a while, think of one of my old posts, and throw a mental 3 shovels of dirt on it.

      I have to hope that this reflects, in part, the nonrandom nature of the data involved in their model (testing being skewed, and nursing homes being hotspots), and the limitations of the medical history classifications.

      For example, my hypertension (high blood pressure) is being controlled to the normal range through relatively mild medications; that’s different than not being controlled, but I still checked that box. I checked the “current or former smoker” box, but I quit in 1995. If I pretend I’m still 69 instead of 70, and uncheck those health boxes, I get a risk of dying if I get COVID-19 of 3.2%.

      • confused says:

        Wait, how can the risk of dying (68%) be significantly greater than the risk of requiring an ICU (51%)? That seems very odd.

        • It’s probably not whether you “require” an ICU or not, but rather whether you’re likely to actually get one… There are plenty of people who die at home and by the definition they’re using those people probably didn’t “require” and ICU.

          • confused says:

            Possible, I suppose, but 68% seems way too high for any population large enough to have usable statistics (I mean, maybe 100+-year-olds with multiple comorbidities in nursing homes or something…) It’s not a 68% CFR even in nursing homes.

      • confused says:

        Yeah, the age seems to be divided into 10-year “boxes”. I’m 30; if I put in 29, my risk of dying drops by more than half, but if I put in 39, it stays the same… So this model will overestimate risk for people near the beginning of a decade and underestimate risk near the end of a decade.

        This really seems to suggest the risk for a lot of younger people in the US is not really that significant… it says “A score of 50 is defined as an equal disease burden as the flu, estimated based on total number of flu cases, hospitalizations, ICU admission, and deaths in the 2018-2019 flu season. For every 10x change in (Exposure*Susceptibility), the score will change by 50/3.” So it’s claiming my risk is less than the flu (below a risk score of 50), which must be because I live in an area with relatively low COVID numbers; I think by 30 COVID’s IFR is clearly higher than the flu, though it might not be in the college-student age group.

  2. David says:

    I think an even better idea is creating a ‘what is the risk’ planner for event planners that can help them sort out the probability of having a covid positive person at their event. I have been doing that math for family who live in states where churches are reopening to show the probability of Covid+ individual at church services conditional on the size of the meeting given the number of active cases in their county. I have found that it does give folks pause.

  3. Jacob Steinhardt says:

    A general problem here is that symptoms are non specific, so you won’t get much evidence to distinguish it from flu etc. In areas that are not hard hit, flu-like diseases may have higher prevalence than COVID.

    • confused says:

      Isn’t that the point of the suggestion?

      Someone who has generic symptoms (consistent with COVID, but also consistent with flu or allergies – say a cough) and lives in an area with low COVID prevalence is very unlikely to actually have COVID, but in an area with high prevalence, it’s a lot more likely.

      • Jacob Steinhardt says:

        But at the population level, if everyone who might have COVID assumes that they don’t, it’s a recipe for disaster. My point is that you just need a more effective way than symptoms to figure out who has COVID. It’s possible that thermometer readings could be good enough (people are surprisingly bad at self-assessing fever), though tests would obviously be the best case long term.

        • confused says:

          >>But at the population level, if everyone who might have COVID assumes that they don’t, it’s a recipe for disaster.


          But it doesn’t really make sense for everyone with a cough to assume they do have it, either. Where I live, this spring was a really bad allergy season, and *most* people I talked to had a cough at some point. There’s no way more than half of a relatively low-risk population (one person to a cubicle at my workplace even before we went to telecommuting – which we did well before it was required in our area, lack of congregate living, etc.) in a relatively low-prevalence area has had it.

          >>My point is that you just need a more effective way than symptoms to figure out who has COVID.

          Yeah. But for much of the US, at least, it might also be OK to limit the “stay home if you have a cough even without a positive test” kind of advice to people who have significant interaction with higher-risk groups.

          There seems to be no realistic chance of eradicating COVID in the mainland US/contiguous states (Hawaii, Guam, USVI, etc. might be able to). Most asymptomatic people and some people with mild symptoms won’t seek testing, so you’re never going to catch everyone. So focus your efforts where it really matters…

          I mean, if your workplace is 20s and 30s people and it’s one-person-to-a-cube, the “expected benefit”/”averted risk” of staying home if there’s, say, a 1% chance your cough is actually COVID may not be that high. But if you work in a nursing home, or a prison with lots of medically vulnerable individuals, you’d need to stay home even for a 1% chance.

          • Brent Hutto says:


            You consistently take the sort of stance in these discussions I would have historically associated with Public Health professionals. Targeted measures with a realization that neither risks nor costs apply homogeneously to a population.

            It will be for future historians to puzzle out exactly when Public Health as a profession went from that sort of nuanced thinking to knee-jerk crazy talk at the first sign of a viral outbreak. Could be the effect only dates back to three years ago but I think it had been festering much longer than that.

            • confused says:

              Well, I think a lot of what has happened with COVID may be less of a reflection of incompetence of “public health” overall, and more a case of this particular disease falling between the cracks.

              Flu pandemics are well-known, we’ve had 4 in the “modern era” (1918, 1957, 1968, and 2009) and there is a lot of planning to deal with them.

              We haven’t had a coronavirus pandemic before, and at the very beginning COVID was starting to look like SARS, which was very scary when it happened (it had a much higher IFR – though it ended up killing orders of magnitude fewer people overall). There are after all a lot of disease outbreaks which *are* successfully contained, and containment is obviously the first thing to try.

              So I think that COVID in March kind of fell into a “gap” of planning between “SARS-/MERS-/Ebola-like local outbreak, with limited travel-related spread of a few cases to other nations” and “global flu pandemic”. WHO was very late to declare it a pandemic (March 11th) and kept promoting containment strategies after that probably should have been abandoned as impossible (except for certain isolated island nations like New Zealand, Iceland, various Pacific islands etc.)

              WHO, CDC, etc. are after all bureaucracies, and bureaucracies have trouble dealing with things that fall in between, or outside, the “boxes” of existing plans – but aren’t different *enough* to make it instantly obvious that the existing “boxes” don’t apply.

              Now, there does seem to be a tendency to err on the side of high (worse/disastrous) estimates of the effect of epidemics/pandemics, but that might be a “selection bias” thing – people who go into epidemiology and pandemic planning are (almost by definition) very concerned about the possible impact of pandemics. Otherwise they would have gone into cancer research or Alzheimer’s research or something instead…

  4. Joseph Candelora says:

    Where is Carl Mears located that persons getting tested are asked to quarantine? I got tested in NYC a few weeks ago, and there were no recommendations that I take any special precautions.

  5. Kaiser says:

    It’s already been done, and it is a case study of what happens when obvious selection bias is ignored. Note that only people with severe symptoms were eligible to be tested, and based on the doctor’s note, even when more test kits are available, this bias will still remain. I explained why the first attempt to do this predictive model fails in this Wired column. A longer version is on my blog, where I show both of their major conclusions are wrong, because they didn’t do anything about the bias. The blog links to two other posts, where I discuss further problems with data collection and pre-processing.

    • Kaiser says:

      The key issue is that people with test results are those who have severe symptoms, so the selection criterion for the sample is correlated with the outcome being predicted.

      • Brent Hutto says:

        Which is one facet of the general problem. We continue to know next to nothing about who actually has this virus at any point in time.

        I can’t recall any precedent in my lifetime of so many scientific-sounding pronouncements being issued with based on so few facts. Not knowing what we’re dealing with can’t be allowed to stand in the way of turning the entire world upside down based on fear, supposition and outright bullshit.

        • Kaiser says:

          Yes, it’s tragic that we are not given the data to make better decisions. Watching Asian news shows a different approach: they report on the precise building or bar or club where clusters are being investigated, they report what percent of the building’s residents has returned samples to be tested and traced, etc.

          I see the role of science, especially statistics, to be even more important when the data are scarce or low-quality. People in our field are trained to evaluate data sources, and to quantify the uncertainty of results.

          There are several precedents for “scientific-sounding pronouncements being issued based on so few facts”. My favorite is economic forecasting, or stock market predictions (when have their experts ever get anything right?}. Another area rife with poorly supported conclusions is sports. Yet another is nutritional science. In fact, Andrew’s blog is a running compilation of examples from many many fields.

          • +1 to those examples, the stock market in particular is irritating, they don’t even consistently report *percentage changes* instead of *point changes*. And of course, daily changes are mostly noise. So they put noise on a pedestal…

      • I don’t think that continues to be the case exactly. At least here in CA they are offering tests to anyone who wants them. The key is that if you don’t have symptoms unless you were told that you were exposed to someone with COVID you probably don’t want a test. So while the early test results were entirely people with severe symptoms, currently you might say test results are people with some kind of symptoms, people who were in contact with known COVID patients, and people who are very cautious for various reasons.

        We still have very little in the way of systematic monitoring, but I did hear that several universities in the SoCal area were trying to design sentinel testing for their campuses… so there’s that.

        • confused says:

          Yeah, if you have no reason to think you have it, why get tested?

          And there aren’t any outpatient treatments that are specific for COVID (even remdesivir is IV and needs a hospital setting – so far) so if your symptoms are mild I don’t think there is any *personal* benefit from getting the right diagnosis (obviously if you are worried about spreading it to at-risk loved ones, that’s very different … but if you live alone…)

          I think it is likely also an underestimated factor that many Americans (anecdotally, mostly relatively young men) really don’t like to go to the doctor unless there is a really obvious need. (And not just because of healthcare costs, but because of “if it ain’t broken don’t fix it” mentality, an attitude of self-reliant toughness, etc.)

  6. Noah Haskell says:

    Nicholad Christakis (who seems like a pretty interesting and trustworthy person to follow on Twitter) posted recently about an app that is in the ballpark of this kind of thing:

    I say “in the ballpark”, because it’s (a) focused on respiratory disease in general, not just Covid-19, and (b) not entirely clear what it does under the hood. They’ve got some kind of ML model, so it’s probably not just a straightforward Bayesian estimate, but it seems like maybe worth checking out.

Leave a Reply