Skip to content

What is/are bad data?

This post is by Lizzie, I also took the picture of the cats.

I was talking to a colleague about a recent paper, which has some issues, but I was a bit surprised by her response that one of the real issues was that it ‘just uses bad data.’ I snapped back reflexively, ‘it’s not bad data, it just needs a better analysis.’

But it got me wondering, what is bad data?

I think ‘bad data’ is fake or falsified you don’t know is fake/falsified. Fake data is great, and I think real data is a fine thing. I don’t know that data can be good or bad, can it? It can be more or less accurate or precise. I can think of things that make it higher or lower quality, but I don’t think we should be assigning data as ‘bad’ or ‘good,’ and labeling it as such to me suggests a misguided relationship to data (which I do think we have in ecology).

The paper uses data from the Living Planet Index (LPI):

…is a measure of the state of the world’s biological diversity based on population trends of vertebrate species from terrestrial, freshwater and marine habitats. The LPI has been adopted by the Convention of Biological Diversity (CBD) as an indicator of progress towards its 2011-2020 target to ‘take effective and urgent action to halt the loss of biodiversity’.

These data have been used a lot to show wild species populations are declining. The new paper purports to show that most wild populations are not declining. The authors use a mixture model to find three parts to their mixture (I don’t know mixture models so encourage anyone who does to take a look and correct me, or just comment generally on the approach, but as best I can tell they looked a priori for three parts to the mixture) and then they took out the ‘extreme’ declining ones and find the rest don’t decline. Okay. Then they wrote a press release saying most populations aren’t declining and we should all be hopeful. Yay.

But the LPI data are best described (by Jonathan Davies) as polling data. We can’t measure most wild species populations and we tend to measure ones from well monitored areas, which are often less biodiverse (I could start saying which animal populations are like which human groups that answer polls a lot, but I will resist). The LPI knows this and they generally do a weighting scheme to try to correct for it (which isn’t great either), but this paper doesn’t seem to try to do much to correct the data. One author wrote a blog about it, noting:

We also note that these declines are more likely in regions that have a larger number of species. This is why the Living Planet Index uses a weighting system, otherwise it would be heavily weighted towards well-monitored locations.

I wish these researchers and others in biodiversity science would use some of the thoughtful-stratification and other approaches used on polling data to try to give us a useful estimate of the state of global biodiversity, instead of press-release-friendly estimates. If anyone needs a project, the data are public!


  1. Brent Hutto says:

    To me “bad data” means data which are not a useful representation of the thing they purport to describe. Now that can certainly be falsified or fiddled by someone seeking to deceive. But the more common (in my experience) way for bad data to arise are through a) hopelessly vague measures and/or b) flawed collection or data management.

    I was semi-joking but I remember one of the first projects I worked on after grad school had some obviously off-target measures being “adapted” for use far from their original purposes. I told my colleague that if an analysis showed a large intervention effect using those measures, then we’d better check the data. Because the measures were so off-target any real effect would probably be obscured.

    The non-joking part came in when we did, in fact, find a huge intervention effect in one analysis. So we went back to the provider of the data and they discovered a rather major coding issue on their part before sending the data to us. The corrected data showed a whole lot of nothin’ much, as I’d expected all along.

    • Joachim says:

      My clunky criterion for good data is that the results described in published papers should not be extreme outliers relative to the population of outcomes of the empirical procedures described.

      It needs some work.

    • gec says:

      > “bad data” means data which are not a useful representation of the thing they purport to describe

      I agree with this characterization. I just want to add that data can fall into this category by virtue of coming from a poorly-designed experiment. For example, a lot of the priming studies in the “embodied cognition” literature are bad data, not because they are fake or low-precision or difficult to analyze properly. They’re bad because they don’t give you want you need to actually understand the relationship between physical contexts and conceptual representations. We already have psychophysics for studying those kinds of relationships, though the results are usually less marketable.

      And it’s especially sad because when you have a mismatch between experimental manipulations and constructs of interest, there’s no “saving” the data through better analyses or replications. In observational settings, almost any data is better than no data, so it’s hard to say that those data are “bad”. But in experimental settings, the data can only ever be as good as the original design, and if the original is inherently uninformative, there’s nothing to be done.

    • Ian Fellows says:

      Well said. When a data generating process doesn’t have a meaningful connection to the reality we wish to study, the data that arises from it is “bad.” Bad is the terminal point on the spectrum of data quality where the only ethical thing an analyst can do is throw it out.

  2. rm bloom says:

    Say I think I’m taking readings every hour of the temperature in my refrigerator, via a thermo-couple and a wire that leads to a sampler and A/D and then to my computer, which logs the readings. Curious-george has taken the thermo-couple out of the refrigerator and stuck it into a banana. Maybe the wire’s been cut. Maybe the A/D isn’t working. Maybe… Well I collect the data, fit a model to prove that my refrigerator was running; and publish in the IEEE transactions on Refrigerators and Machine Learning. Then the phone rings, “Hello, is this Bloom?” “Yes?” “This is IEEE. Is your refrigerator running?” “Well, isn’t that what my latest results have shown?” “Well, then, you’d better go catch it!”

    • Lizzie says:

      It seems like in these examples the data are only ‘bad’ because you are missing information on the generating process. If you had additional information, such as how the thermometer was working or something that predicted ‘off-target measures’, would the data still be ‘bad’? (These seem like potentially fine data, but with missing information, so the model ends up wrong.)

      • Robert says:

        To me “data” does not usually mean “a set of measured numbers, abstracted of context”. Data is a collection of numbers/values *with a claim about which real-world referents they refer to*. If that relationship is wrong then I’m totally comfortable saying the data is bad. There is no way to get anything good out without throwing the data away, because if the correspondence to reality was wrongly stated you probably have no way of figuring out what the right correspondence actually was. In this case the argument seems something like “the data is good, because we can now actually understand which thing it was a high quality measurement of”. I can accept that but it seems like a very artificial form of rescue that we don’t usually get. Information typically degrades, and usually data only gets worse as the moment of measurement recedes and knowledge of the measurement process drifts away.

        I realize that I am arguing that you now have “different data” after you know the truth even though the numbers are all the same, because the claim of correspondence has changed. I’m fine with this. If I stumble across a contextless spreadsheet solely consisting of raw numbers called Document1.xls then I would be comfortable calling it bad data even if I have no idea how it was generated. Maybe this sounds like a strange phrasing to others.

        • jim says:

          “There is no way to get anything good…if the correspondence to reality was wrongly stated you probably have no way of figuring out what the right correspondence actually was. ”


      • rm bloom says:

        Here’s another model for “bad data”. Call it the story of the snipe-hunt.

        When we were kids in summer-camp, and they’d run out of stories to tell, or ways otherwise to distract us they’d say we’re going out tomorrow to hunt for “Snipes” so better get to bed early and so on. By the next day something else’d always come up and the “Snipe Hunt” was put down for another day. No one among us kids was either brave or tactless enough to say to those big guys, “C’mon there’s no such thing as a ‘snipe’!” Because suppose there was! After all those counselors talked about a lot of things we knew nothing about; what if they were right? And after all, weren’t we really there just for that reason: to be inducted into some of the mysteries of the wider world, which would’ve been out of reach back home in our comfortable, cozy, well-regulated but a bit dull routines? So we’d suspend our reflexive 8-year-old’s skepticism and say “Ok, let’s go on that ‘Snipe Hunt’ then; let’s go catch some Snipes, whatever the heck they are!” But the counselors could always say, “Naaah! We’ll hunt for Snipes another day; tonight we’re gonna have a BBQ down at the beach! How ’bout that!?” “Really! A BBQ at the beach!? Wild!” And that always settled the matter. Since you could always sink your teeth into a hamburger between a bun; but a Snipe?”

        What’s the point of the story? What’d happen if you get a bunch of investigators go out looking for “Snipes”. It’s agreed upon. But the secret flaw in the whole escapade is no one can agree in the first place just what a proper “Snipe” would be. Too delicate a matter to bring up once all the crew’s assembled on the boat, the bales of rope, the recording devices; the bathyscaphe, the captain’s table decked out with charts of the Marianas trench; the bottles of port; and the rack-of-lamb. A “snipe”? Or is it a “Shnipe”? Not too sure, ‘mate.

        But we’re going to find ourselves some of them, aren’t we? Each’ll go out and at the end each’ll catch his own ‘snipe’ and then we’ll compare ‘snipes’ and surely there’ll be consensus at last: the consensual snipe!

  3. John Williams says:

    “Good” data can be bad if they reflect only part of the story, and distract attention from the other part; here is an example from Sacramento River salmon biology. There is a lot of variation in Chinook salmon juvenile life history patterns; some head for the ocean as fry, directly after they emerge from the gravel, and others rear for some time in fresh water before they do so. The US Fish and Wildlife Service monitors out-migrating Chinook with nets that are effective for larger “smolt” migrants, but not for smaller fry migrants. This is because the monitoring program began as an effort to quantify the survival of hatchery fish, which are released at smolt size, but over time attention shifted toward the naturally reproducing part of the run. The result is that people tend to forget about the fry migrants, which people don’t know how to sample effectively, even though there is good evidence that a substantial fraction of returning adults migrated as fry. As a consequence, new water projects have been assessed without consideration of their effects on the fry migrants.

  4. Jake says:

    I think if one starts working in NLP it’s super easy to get “bad data”. Things in the wrong language, things in the wrong encoding (whether they break a pipeline or not), spam, html escape codes (if not actual tags), OCR errors, the list goes on.

    Recent project had me dealing with data coming off a “device” with plenty of ways for data to be “bad”, such as “human error leading to corrupted or no metadata” or “machine overheated and sensors decalibrated” or “a cable got loose and 90% of the readings are 0”.

    • Jake says:

      For another example: you get an image corpus but they’re 95% .png, 4% .jpg, 1% .tif, with the occasional .bmp or zero-length file (and nothing has an extension). Or the files are supposed to be RGB but there’s a bunch of grayscale ones. Or the compression got corrupted and only the top 25% of an image is present. Or they’ve gone through different color correction pipelines and have very different brightness / contrast. And so on.

      And sure, Lizzie’s right in that for most of these it’s just “missing information on the generating process” and the fault’s not in our stars, etc, but there’s only so much that one can be charitable in those circumstances.

  5. Jonathan (another one) says:

    Bad data —

    “Econometric theory is like an exquisitely balanced French recipe, spelling out precisely with how many turns to mix the sauce, how many carats of spice to add, and for how many milliseconds to bake the mixture at exactly 474 degrees of temperature. But when the statistical cook turns to raw materials, he finds that hearts of cactus fruit are unavailable, so he substitutes chunks of cantaloupe; where the recipe calls for vermicelli he used shredded wheat; and he substitutes green garment die for curry, ping-pong balls for turtle’s eggs, and, for Chalifougnac vintage 1883, a can of turpentine. (Econometrics, Valavanis, 1959, p. 83)

    • Dale Lehman says:

      But, how does it taste? Seriously, isn’t that a relevant step? You can have bad data and perhaps you can have a good model – one that recognizes the imperfections in the data and accounts for them in a reasonable way. So, the taste may be ok, despite the bad ingredients. I suppose I am suggesting that you can’t talk about data being bad or good without simultaneously addressing the models you are using to analyze that data.

      • Jonathan (another one) says:


        If it’s tasty when you’re done, you’ve accomplished something, just not what you set out to do in the first place. (Unless of course your goal was to produce “anything tasty” because the tenure committee had a reservation at 8.)

        On the other hand, there are many, many recipes that, after substitutions, are actually tasty, but poison as well. And the goal was to be both tasty AND nutritious, right? Tasty and poisonous is actually worse than just foul-tasting.

  6. Michael Nelson says:

    Not to be *too* pedantic, but if you can imagine how the data you have could be better, then you can certainly imagine how it could be worse.

  7. Kyle C says:

    There is no Lizzie in this blog’s author list. Author, author!

  8. BR says:

    It’s a nice perspective shift, like “There’s no such thing as bad weather, just bad clothes.”

  9. Sean Mackinnon says:

    Well, Lynn’s National IQ dataset is often regarded as “bad” but I suppose semantically you could just say “extremely low quality” to avoid a value judgment. Hard to avoid a value judgment on that one though.

  10. Mark Pawelek says:

    Environment science has dominated by by modeling for over 50 years. If one wants to make a big impact write a model. IMHO, bad models have all the defects of bad statistics but models give more leeway for making it up because they literally are made up.

  11. David Marcus says:

    I wasted three years of my life trying to analyze gradiometer data. The gradiometer was usually installed on subs, but the navy had one on a test surface ship. There was interest in what the gravity was in a certain area, so they sent the surface ship out there to collect data. I wasn’t involved in the planning, but the data was given to me (and a couple of colleagues) to analyze. It was clear from the beginning that the data was very noisy (you could actually see a storm they ran into in the data, just by graphing the data). We spent three years trying to model the noise to extract the signal (we didn’t use the data during the storm). At one point, we thought we had succeeded. But, on further analysis, I retracted our conclusions. I eventually gave up. Some time later, we ran into the people who built the gradiometer at a conference. We told them we had not been able to analyze the data from the gradiometer on the surface ship. They said that a surface ship bounces around a lot, while a sub is very smooth. So, the gradiometer won’t work on the surface ship (the gradiometer contained a rotating platform, which was being bounced around). It was installed there just to test that the various connections work, but they never expected it to produce useful data there.

    This was bad data.

  12. Nice topic.

    There is at least one relevant theory that supports many of the comments here e.g. Hypoiconicity, Semiosis and Peirce’s Immediate Object Tony Jappy. Like this one by Ian “data generating process doesn’t have a meaningful connection to the reality we wish to study”.

    The simple diagram of the theory is Od → Oi → S → Ii → Id → If, all are briefly outlined below but we only need the Od and Oi here.

    Where Od (dynamic object) is the reality generating the potentially observable data.

    Oi (immediate object) is the specific form/nature of the data produced by the dynamic object [and made potentially available to us]

    Now, the Od is the connection to the reality we wish to study and the Oi provide hints to that. If those hints _would_ connect us adequately to the Od, the data Oi _would_ be good, but otherwise bad. For instance, if the data was incorrectly entered, if we could get correct entries that _would_ be good. Otherwise bad.

    Or as a former director used to say, you can’t make chicken salad out of chicken s**t!

    The other items are outlined below:

    S (sign) would be the medium in which the data are ‘embodied’ [recorded, made and kept available for analysis, includes meta-data](paper newspapers, digital media – TV, computer screens, etc.,

    Ii (immediate interpretant) might be [what would be repeatedly observed given] the probability model that [we chose as] in an idealized sense we believe could have generated the data we observed (respecting causality, context and logic) [and gets us beyond just the recorded data as that is in the “dead past” whereas statistic _science_ is directed to the future]

    Ii (initial interpretant) is what we make of the preliminary analysis, the Id (dynamic interpretant) is what we make of it after various critical assessments and If (final interpretant) is what we take of it when we have exhausted all critical evaluations that make sense. This of course would be repeatedly cycled through – with the hope that with persistence – the If would adequately represent the Od.

  13. Ron Kenett says:

    Great blog. What we should care about is information quality.

    Missing data can be considered “bad” but in some cases it is actually good. In a study we did based on customer surveys we found out that customers who did not respond to the “what us your income level” question were more truthful in other questions so that responses with such missing values were more useful having lower bias.

    The point is that good or bad is relative to the study goal. Same data, different goals and you get a different perspective.

    We deal with this in The Real Work of Data Science

    and Information Quality: The Potential of Data and Analytics to Generate Knowledge

    • > Missing data can be considered “bad” but in some cases it is actually good.
      Yup, reminds me of one of my first clinical prediction projects – the most predictive of a bad outcome was the 24 hour urine collection assessment being missing. Patients who did not have this, likely were too sick to complete it. But standard practice at the time was to exclude patients who had missing values or maybe single imputation.

      • Dale Lehman says:

        And that reminds me of my submission to the NEJM competition on open data, which I did not win (and am still sore about it). The SPRINT study that was the basis for the competition released the data from the SPRINT blood pressure study (a famous one that looked at the impacts of more aggressive treatment of high blood pressure – in terms of decreased heart attack risks and severe side effects of the increased medication). My discovery was that the trial participants in the treatment group that missed (any, particularly the early visits) their scheduled visits (according to the protocol) had outcomes as if they did not receive the more aggressive treatment at all. Actually somewhat worse – no better survival but more side effects. On the other hand, the participants who made all the scheduled visits actually had more beneficial impacts than reported in the original study. I was the only person in this international competition that picked up on the missed visit impacts. Yes, I’m still sore about it – particularly when the winner made an app that permitted a clinician to enter individual data and receive a recommendation (binary, at that!) of whether or not the treatment would be beneficial.

      • Ron Kenett says:

        Keith – great examples of informative missing data. The point is that, in your example, “cleaning” the data would reduce information quality. I believe this occurs quite often :-)

  14. Tom Redman says:

    I had the good fortune of leading the data quality lab at Bell Labs in late 80s and early 90s and have been immersed in the subject since. Here are some ideas that have stood me in good stead:

    First, a definition: “Data is of high-quality if it is ‘fit-for-use’ by customers (those using it) in operations, data science, decision-making and planning.”

    Second, I find three broad categories of criteria:
    1. Data values must be correct, or at least correct enough.
    2. The data must be relevant and appropriate to the task at hand. Depending on the use, there can be a lot here: comprehensiveness, free of bias, predictive ability, etc. I include some technical criteria like identifiability, which means that data must uniquely tie to its real-world entities, and clear definition, which means there must be adequate metadata such that the customer can understand it, in this category
    3. The data must be presented in a way that the customer can understand.

    Third, there are no absolutes. The “by customers” clause means that data can be fit-for-use by one person in one application and not fit-for-use in another.

  15. KL says:

    Here is another example of bad data that you might discuss.

    “Mortality rate for Black babies is cut dramatically when they’re delivered by Black doctors, researchers say”

    Around 4:10, Vinay Prasad explains that the doctor chosen to sign the physician record after a baby dies might be a medical director, not the physician who delivered care to the mother or who actually delivered the baby.

Leave a Reply