Cheating at Research


My news feed this morning contained this article about an unpleasant local situation that has caused one person I know to lose her job (not because she was involved with the malfeasance, but as fallout from this lab shutting down). On the positive side (I’m going from the article and the investigating panel’s report here — I have no inside information) it sounds like the panel came to the appropriate conclusions.

But my sense is that most research cheating is not nearly so overt. Rather, bogus results come out of a mixture of experimental ineptitude and unconscious squashing of results that don’t conform to expectations. A guide to cheating in a recent issue of Wired contains a nice little summary of how this works:

Create options: Let’s say you want to prove that listening to dubstep boosts IQ (aka the Skrillex effect). The key is to avoid predefining exactly what the study measures — then bury the failed attempts. So use two different IQ tests; if only one shows a pattern, toss the other. Expand the pool: Test 20 dubstep subjects and 20 control subjects. If the findings reach significance, publish. If not, run 10 more subjects in each group and give the stats another whirl. Those extra data points might randomly support the hypothesis. Get inessential: Measure an extraneous variable like gender. If there’s no pattern in the group at large, look for one in just men or women. Run three groups: Have some people listen for zero hours, some for one, some for 10. Now test for differences between groups A and B, B and C, and A and C. If all comparisons show significance, great. If only one does, then forget about the existence of the p-value poopers.

I do not recommend the full article, I only read it because I was trapped on a long flight.

The pattern underlying all of these ways of cheating is to run many experiments, but report the results of only a few. We only have to run 14 experiments that measure no effect before we get a better than 50% chance of finding a result that is significant at the 95% level purely by chance. My wife, a psychologist, says that she has observed students saying things like “we’ll run a few more subjects and see if the effect becomes significant” without realizing how bad this is. In computer science, we are lucky to see empirical papers that make use of statistical significance tests at all, much less make proper use of them.


15 responses to “Cheating at Research”

  1. If you are worried about people making stuff up (consciously or not), asking for tests of statistical significance is like asking a bankster to provide a financial report… you are going to get what you ask for, but it won’t resolve anything.

    The way science deals with these matters is through reproduction. If you always keep in mind that others must be able reproduce your results (“X is better than Y”), then you won’t do the “maybe we should collect more data” thing… because you’ll start worrying about how easy it is for others to reproduce this process…

    My experience with psychologists is that they are very averse to publishing their data… they feel that they “own” the data exclusively (oddly enough, they are unconcerned with the subject… who should be the real owners…). In effect, they rely on the fact that collecting the data is expensive to protect themselves from “competitors”. This has to side effect of making cheating easier.

    In computer science, we could resolve many problems if only people published their software openly. That alone would go a long way toward keeping people honest.

  2. Your characterization as a mixture of experimental ineptitude and unconscious squash really gets at the heart of this. Check out http://arxiv.org/ftp/arxiv/papers/1205/1205.4251.pdf:

    “Two of the present authors, Motyl and Nosek, share interests in political ideology. We were inspired by the fast growing literature on embodiment that demonstrates […] We calculated accuracy: How close to the actual shade did participants get? The results were stunning. Moderates perceived […] (p = .01). […] The ultimate publication, Motyl and Nosek (2012) served as one of Motyl’s signature publications as he finished graduate school and entered the job market. The story is all true, except for the last sentence; we did not publish the finding. […] Surely ours was not a case to worry about. We had hypothesized it, the effect was reliable. But, we had been discussing reproducibility, and […] we conducted a direct replication while we prepared the manuscript. We ran 1,300 participants, giving us .995 power to detect an effect of the original effect size at alpha = .05. The effect vanished (p = .59)”

    More generally, Andrew Gelman’s blog is an excellent source for this kind of stuff http://andrewgelman.com/ Gelman calls this “too many researcher degrees of freedom”. It’s a pervasive problem.

  3. ‘Our immediate reaction was “why the #&@! did we do a direct replication?”‘

    I love it!

    bcs, nice! There’s always an xkcd for that, it seems.

  4. Daniel, I agree, although I’m not super optimistic about how much of the overall problem is solved by making source available. Maybe 10%.

  5. I suspect if I made my code available, other than usable tools, the chance that anyone would replicated results (or for much more famous ones than anything I’m involved in) would be still be pretty close to zero. Make my code better documented, it would increase but not much. If I provided (1) publication guarantee for reproduction, no matter the results, (2) a free month in which to figure out how to replicate and (3) a pony, it might rise to 30%. I suspect in a field with as many results as CS, hoping for much replication is not likely to be rewarded.

    There are subareas, like “how much does code coverage tell us about test suite quality?” where you won’t see much replication, but will see a long series of papers with slightly or radically different methodologies and some shared subjects that could arguably be seen as converging to some kind of “meta-replication” in the sense of having somewhat similar results with somewhat similar experiments.

  6. As far a repeatable results go, CS is an interested cases in that you can in theory turn it into nothing but a matter of compute resources: Automate EVERYTHING to the point you can check out raw source run “$ make graphs_and_tables” and get figures ready for typesetting in your paper.

    OTOH, that makes the “selection” problem worse as it makes it trivially easy to just re-run stuff till you get the results you like.

  7. I agree that “make graphs_and_tables” is a good goal.

    But take for example my student who defended not too long ago where step 1 of the instructions to reproduce would be something like “Acquire a Dell d710 manufactured around summer 2009 and configured to these specifications…”

    His results truly are sensitive to things like hard drive model, CPU stepping, etc., and not in a wimpy way either. The software simply will not run on many CPU revisions. No simulator / emulator exists at the appropriate level of detail, either. And if one did, it would be too slow to be useful.

  8. Toward promoting repeatability and reproducibility in computer science, I want to get back to working on the Emulab-based experimentation workbench.

  9. I don’t think you need to have extreme hardware dependencies to doubt the value of releasing source. I think there are both logical and priactical issues.

    I don’t think that simply re-running software replicates results in a meaningful way. If you run software in the same, or sufficiently similar environment (hardware, OS, compiler, libraries, tools), you should get the same output — including the same bugs. This kind of replication would catch egregious fraud, but it doesn’t say much about the correctness of the result. Software is different from e.g. human/animal subject research, where there are so many incompletely controlled factors that replication is more complex and meaningful.

    Moreover, releasing software in a useful way is extremely resource intensive (and a non-useful way is … not very useful).

    Most research software has a very short shelf life: It evolves quickly, with little attention to compatibility and is often dropped entirely when funding/students are finished. It also often depends on other (equally chaotic) research software. Except for a very few long-lived tools, there is no maintenance/update path – often not even a short-term one. So trying to work with software from a paper that’s only a year or two old is often an exercise in frustration.

    This is exactly different from commercial software and large-scale open source development, where maintenance and compatibility takes huge attention and resources — it’s not something to fund from a research grant.

    The other practical problem is packaging and configuration. Even a good ‘generate graphs and tables’ script is unlikely to be general enough to run on any computing base. For example, such a script couldn’t depend on running as a certain user on particular machines, with specific output directories, etc. For anything non-trivial, porting this to your own environment will be an exercise in frustration….

    A general configuration infrastructure that would manage machine allocation and configuration, authentication, and networking and storage resources in an arbitrary environment is a huge task. Again, a large-scale investment, not a little piece of a grant-funded project.

    In short – a tarball isn’t very useful…

    I think it’s much more useful to try to replicate CS results indirectly, i.e. by designing an independent experiment that would support the conclusion (or not).

    The problem is that it’s nearly as much effort as doing research that would support your own exciting new conclusions AND it’s hard to publish. Even if most reviews are positive, you’re almost guaranteed to be torpedoed by at least one ‘strong reject – no novelty’ or ‘weak reject – nicely done, but results are unsurprising’.

  10. Hi Laura, I generally agree with all of your points. On the other hand, people should still release their code: it can’t do any harm, and it may help. Just in the last couple of years I’ve been blocked several times from reproducing existing results (in order to support comparisons with new work that I’m trying to publish) due to lack of code. In principle I could reproduce results by reading a paper but all too often in practice important details are missing. Even if no details are missing, the effort to re-implement a substantial piece of systems software is often prohibitive.

  11. Hi,

    I don’t get one thing. Why will doing an experiment with more subjects be harmful? More the data, better it is right?

  12. Hi Ethan, you’re right that it is counter-intuitive that more data can invalidate an experiment. I’ve been thinking about writing a full-length post about this but let me try to explain it briefly here. I may not do a good job, I have no real training in this stuff, just kind of picked it up at a basic level.

    Let’s say that we’re trying to figure out if a coin is fair. So we devise an experimental plan with a statistical test. For example, we’ll flip it 20 times and declare it to be a fair coin if we get anywhere from 15 to 25 heads. So we do the experiment and get 15 heads, leading us to conclude that it’s a fair coin by whatever criteria lead us to come up with the range 15-25. But in my heart of hearts, I believe this coin is not fair. I mean, it’s so close to being unfair! So I revise my experimental plan: I’ll flip it 2 more times and if I don’t get any heads, I’ll decide that it’s not a fair coin after all (let’s just pretend that this jives with my chosen test for statistical significance). Was that OK? No! Because now I’ve run two different experiments, not just one — and I’ve conveniently ignored the first one by folding it into the second. Also, my decision to run the second experiment was guided by an outcome I didn’t like for the first experiment. This is all a big no-no, the worst kind of science. By carefully selecting when to stop taking data, I can skew the results in whichever direction I want (not with certainty, of course). I’ll write this up in a more quantitative way when I get a chance.

    Also keep in mind this kind of thing:

    http://blog.regehr.org/archives/836

  13. In your example, if the criteria for coin being fair was within +/-25% from the mean value then if total 42 tosses are considered (ideal value of heads being 21), the “fairness boundary” shifts to 15.75 on the lower side and 26.25 on the higher side. If finally you got 15 heads, then it falls outside the new boundary, so the coin must NOT be fair!

    Anyway, I still don’t get why the motivation behind doing an expt should affect the conclusion. In another universe, there could be another experimenter who will chose more data in the first experiment itself. Conclusions from his experiments should be same as conclusions from the experimenter in this universe I think, even though our experimenter did the experiment in two sessions.

    Will look forward for your blog post!

  14. Re Ethan: “Anyway, I still don’t get why the motivation behind doing an expt should affect the conclusion.”

    It’s not about the motivation for the experiment as a whole (i.e., the hypothesis); it’s about the experimenter deciding when to collect more data for a given experiment, or to stop accepting data, based on the data accumulated thus far.

    If your experimental methodology is “collect data until my hypothesis is true, then stop; if things look hopeless then start a new experiment” then the only way to terminate is with a win!