In light of the recent article about drug-target research and replication (Andrew blogged it here) and l’affaire Potti, I have mentioned the “Forensic Bioinformatics” paper (Baggerly & Coombes 2009) to several colleagues in passing this week. I have concluded that it has not gotten the attention it deserves, though it has been discussed on this blog before too.
The authors try to reproduce published data, and end up “reverse engineering” what the original authors had to have done. Some examples:
- §2.2: “Training data sensitive/resistant labels are reversed.”
- §2.4: “Only 84/122 test samples are distinct; some samples are labeled both sensitive and resistant.”
- §2.7: Almost half of the data is incorrectly labeled resistant.
- §3.2: “This offset involves a single row shift: for example, … [data from] row 98 were used instead of those from row 97.”
- §5.4: “Poor documentation led a report on drug A to include a heatmap for drug B and a gene list for drug C. These results are based on simple visual inspection and counting, and are not documented further.”
- There are also gross statistical errors.
Continuing in the usual theme of my occasional posts, I’ll share what reproducible research means for me in practice.
- I use Sweave for almost everything. Here is my template. Here is my xetex template if you care about typography.
- Eventually I save objects that took a long time to compute, set their evaluation to false, and then load the saved object immediately below, but crucially I still have their generative code right there. Here’s what I mean:
<<computation1, eval=TRUE, echo=FALSE>>= ## long MCMC run or whatever here object.1 <- thing_that_takes_forever() save(object.1, file="computation1.Rdata") @ <<process-comp1, eval=TRUE, echo=FALSE>>= load("computation1.Rdata") @
So once I was satisfied that computation1 produced the object.1 of my dreams, I could just flip eval=FALSE on the first code chunk and save myself the hassle. But if I found a problem with it later on, I now know where to go to fix it.
- It is generally not painful to leave any pre-processing / data loading and joining, and recoding in the first code chunk. This will prevent you from having a stylized data file that you don’t know what you did to it, because you actually redo it from scratch every time. It sometimes makes sense to separate this out into a file that you
source()
. - For presentations or other destinations, I can just copy the paper.Rnw file to pres-gfx.Rnw, make any necessary changes (to the size in the code chunk argument, for example, to make Beamer-friendly images). Running
R CMD Sweave pres-gfx.Rnw
on this will ensure that my names are consistent (“pres-gfx-codechunkname.pdf”) and I don’t do something completely different or accidentally use the wrong model on the wrong graphic. - I could, if I had truly mastered ediff, easily merge any changes I made for presentation back to the paper.
Yeah, but remember that error we got the other day that we were unable to reproduce?
Yeah, it was stupid jags start values. Oops.
My solution to the big computations is to use
file <- "computation1.RData"
if(file.exists(file)) {
load(file)
} else {
# big computation here
save(object.1, file=file)
}
(I hope that’s readable.) If I want to re-run things, I just delete the computation1.RData file.
Very nice – thanks!
I also enjoyed your 10 worst graphs page. One recommendation – on the “resources page,” you can link directly to the full text pdf of Andrew’s tables-to-graphs paper rather than the behind-a-paywall JSTOR: http://www.stat.columbia.edu/~gelman/research/published/dodhia.pdf
There’s also cacheSweave (and pgfSweave, which uses cacheSweave) which implement a caching mechanism for Sweave documents.
[…] they said they had done. So Baggerly and Coombes reverse engineered what was actually done with astounding results, e.g. training data with sensitive/resistant labels reversed and bizarre probability laws [P(A,B) = […]