Skip to content

Censoring on one end, “outliers” on the other, what can we do with the middle?

This post was written by Phil.

A medical company is testing a cancer drug. They get a 16 genetically identical (or nearly identical) rats that all have the same kind of tumor, give 8 of them the drug and leave 8 untreated…or maybe they give them a placebo, I don’t know; is there a placebo effect in rats?. Anyway, after a while the rats are killed and examined. If the tumors in the treated rats are smaller than the tumors in the untreated rats, then all of the rats have their blood tested for dozens of different proteins that are known to be associated with tumor growth or suppression. If there is a “significant” difference in one of the protein levels, then the working assumption is that the drug increases or decreases levels of that protein and that may be the mechanism by which the drug affects cancer. All of the above is done on many different cancer types and possibly several different types of rats. It’s just the initial screening: if things look promising, many more tests and different tests are done, potentially culminating (years later) in human tests.

So the initial task is to determine, from 8 control and 8 treated rats, which proteins look different. There are some complications: (1) the data are left-censored, i.e. below some level a protein is simply reported as “low”; (2) even above the censoring threshold the data are very uncertain (50% or 30% uncertainty for concentrations up to maybe double the censoring threshold); (3) some proteins are reported only in discrete levels (e.g. measurements might be 3.7 or 7.4, but never in between); (4) sometimes instrument problems, chemistry problems, or abnormalities in one or more rats lead to very high measurements of one or more proteins.

For instance:
(“low” means < 0.10) : Protein A, cases: 0.31, 0.14, low, 0.24, low, low, 0.14, low Protein A, controls: low, low, low, low, 0.24, low, low, low Protein B, cases: 160, 122, 99, 145, 377, 133, 123, 140 Protein B, controls: 94, 107, 139, 135, 152, 120, 111, 118 Note the very high value of Protein B in case rat 5. The drug company would not want to flag Protein B as being affected by their drug just because they got that one big number. Finally, the question: what's a good algorithm to recognize if the cases tend to have higher levels of a given protein than the controls? A few possibilities that come to mind: (1) generate bootstrap samples from the cases and from the controls, and see how often the medians differ by more than the observed medians do; if it's a small fraction of the time, then the observed difference is "statistically significant." (2) Use the Whitney-Mann "U-test". (3) Discard outliers, then use censored maximum likelihood (or similar) on the rest of the data, thus generating a mean (or geometric mean) and uncertainty for the cases and for the controls. Which of those is the best approach, and if the answer is "neither" then what do you recommend?


  1. Maybe it would be better to spend fewer resources on testing rodents and instead move ahead, faster/sooner, with well-designed trials of new cancer drugs in humans.

  2. Phil says:

    Oops, "generate bootstrap samples from the cases and from the controls" is wrong, I meant "generate bootstrap samples from the entire group of 16 to simulate 'cases' and 'controls'."

    Elaine, of course it's possible that they should move to human testing sooner (it's also possible that they should do it later) but that will just move the problem to humans rather than rats. You'll still have the problem of judging whether the drug is working and, if so, how. But in this case the drugs are literally years from human trials: they haven't even been tested for toxicity to humans yet!

  3. Ben Bolker says:

    It would seem to depend a lot on the interpretation of the censored and 'outlier' points, i.e.: are the 'low' results likely to be driven by a qualitatively different process from the non-'low' results, or do they (hypothetically) come from the same distribution as the the non-'low' results but are just below detection? Similarly, are the high 'outliers' data points that a biologist would (1) throw out, (2) treat as 'high' but not as high as they appear, (3) treat as evidence of an interesting, but again different, process?

    I actually like the M-W test in this case, even though it only gives you a p-value and not an estimate of the size of the difference (which I would usually prefer). I don't remember exactly how well M-W handles ties, so I might do a permutation test and compute the p-value from that if necessary …

  4. Ben Bolker says:

    PS I appreciate that this is very preliminary screening, but if they're going to test the rats' blood for dozens of proteins can't they get n>16? That doesn't solve the problem, but (unless the screening is very expensive) it would seem that boosting the sample size would help … power analysis?? ( )

  5. Eli Rabett says:

    You need more untreated rats to have some idea of what the natural variability of the various proteins before you can test for differences. Ideally you chose your rats from a well characterized population. If you can't afford this testing you are in the wrong business.

  6. Phil says:

    Answering some of the questions:
    (1) I don't know why they don't use more rats. I mean, obviously it must come down to money one way or another, but the company I'm asking about is enormous and has huge resources, so I agree it's hard to see why they don't use more rats if they need them. Maybe they think they don't need them. For one thing, all of the rats are essentially identical, both genetically and in the environments they've lived in for their whole lives, so they expect the drug to affect them all quite similarly.

    (2) Although I suppose it's possible for any of the measurements to be substantially in error, the problem with the "low" measurements is thought to be merely a matter of the detection limit. The concentrations of those proteins really are low; they just don't know how low.

    (3) The high outliers are different, they're not related to a detection limit. (a) Most of them are thought to be simply incorrect measurements — there are a few things that can go wrong with testing relatively small amounts of blood for relatively rare proteins, I think. (b) In some rare cases they may be rats that are very abnormal, e.g. the cancer (rather than the drug) can cause something to go haywire. In the (a) cases, you'd want to discard the measurements for sure, they have no bearing on anything you're interested in. In the (b) cases the situation is a bit less clear, but probably the best approach is again to simply discard them…if you can recognize them. It gets a bit dicey, though, since if an error can cause a _very_ high (incorrect) measurement that is easy to recognize then presumably it can also cause a moderately high (but still incorrect) measurement that looks normal or almost normal.

    Ben, I, too, am not sure about handling ties in the M-W test, nor about what to do with large and systematically varying measurement errors, nor about discrete rather than continuous data, nor about censoring. I'd be inclined to ignore the measurement errors, do repeated simulations that resolve ties at random, and treat "low vs low" censoring issues as ties (or, maybe, somewhat favor the "low" value that comes from the distribution with the higher median). All of this seems a bit arbitrary and ugly, though.

  7. bjs12 says:

    I'm not sure about the M-W test in this context: M-W (if I recall correctly) does not do well with lots of ties (since it is supposed to only be used for continuous data).

    Why not just do a t-test?

    You can replace "low" with the LOD. If that makes the SD artificially too low, you can replace it with the SD of the non-low values (or throw a couple of zeroes in to bring the SD back up). You can do a sensitivity where you do the t-test with the potential high outliers in and compare them to a t-test where they are left out. Typically, the t-test is quite robust to wacky things — even with n=16.

    Sure these kinds of adjustments to a boring test are definitely going to fall in the 'ugly' category as you point out. But how sexy do you need for a screening experiment? Surely, when they have some candidate proteins, they'll move on to more rats, better assays, and fancier statistics.

    I don't think an algorithm exists that would convince me that cases are different from controls for either protein A or B. And the t-test is adequate for convincing me that there isn't enough evidence to conclude they are different.

  8. If you want probabilistic results (probabilities over outcomes, with and without the drug), don't you have to model the outliers with, say, a mixture model. Each component of the mixture would have means and variances (or maybe even more properties) with and without the drug, and so on…? Clipping out the outliers prevents you from inferring down the probability distrubution over outcomes, no?

  9. Phil says:

    David: Ideally I think one would model the process that generates outliers, yes. But note that they don't think the outliers are affected by the drug at all, they think they're just lab mistakes or something. If most of these mistakes are glaringly wrong, they can handle this problem simply by discarding them — sort of like weighing groups of 8 adults using digital scale that, 5% of the time, generates a 100s digit at random: you can probably do pretty well by just throwing out any measurement over 350 pounds.