Do Not Just Run a Few More Reps

It’s frustrating when an experiment reveals an almost, but not quite, statistically significant effect. When this happens, the overwhelming temptation is to run a few more repetitions in order to see if the result creeps into significance. Intuitively, more data should provide more reliable experimental results. This is not necessarily the case.

Let’s look at a concrete example. We’ve been handed a coin and wish to determine if it is fair. In other words, we’ll conduct an experiment designed to see if the null hypothesis (“the coin is fair”) can be rejected at the desired level of confidence: 95% in this case. Just to review, informally a result with 95% confidence means that we are willing to accept a 5% chance that our experiment reaches a false conclusion.

Consider this graph:

The x-axis shows a range of experiments we might be willing to perform: flipping the coin between 10 and 100 times. The red line is the lower confidence bound: the smallest number of heads that indicates a fair coin at the 95% level; the green line indicates the expected number of heads; and, the dark blue line indicates the upper confidence bound: the largest number of heads that is consistent with the coin being fair at the 95% level. The remaining three lines are examples of the kind of random walk we get by actually flipping a coin (or consulting a PRNG) 100 times.

Let’s say that we have decided to flip the coin 40 times. In this case, the 95% confidence interval (using the Wilson score) has a lower bound of 14.1 and an upper bound of 25.9. If we perform this coin flip experiment many times, the coin is declared to be fair 96% of the time — a good match for our desired level of confidence. If we instead flip the coin 80 times, the 95% confidence interval is between 31.4 and 48.6. This new experiment only finds the coin to be fair about 94% of the time. This mismatch between desired and actual coverage probability — which is often, but not always, minor — is an inherent feature of binomial confidence intervals and I’ve written about it before.

So now let’s get back to the point of this post. The question is: How are the results affected if, instead of selecting a number of repetitions once and for all, we run a few reps, check for significance, run a few more reps if we don’t like the result, etc. The setup that we’ll use here is:

  1. Flip the coin 10 times
  2. Check for an unfair coin — if so, stop the experiment with the result “unfair”
  3. If we’ve flipped the coin 100 times, stop the experiment with the result “fair”
  4. Flip the coin
  5. Go back to step 2

The graphical interpretation of this procedure, in terms of the figure above, is to look for any part of the random walk to go outside of the lines indicating the confidence intervals. The light blue and brown lines do not do this but the purple line does, briefly, between 60 and 70 flips.

The punchline is that this new “experimental procedure” is able declare a fair coin to be unfair about 37% of the time! Obviously this makes a mockery of the intended 95% level of confidence, which is supposed to reach an incorrect conclusion only 5% of the time.

Does this phenomenon of “run a few more reps and check for significance” actually occur? It does. For example:

  • People running experiments do this.
  • Reviewers of papers complain that the results are not “significant enough” and ask the experimenter to run a few more reps.
  • I’ve seen benchmarking frameworks that automatically run a performance test until a desired level of confidence is achieved.

In contrast, near the start of this paper there’s a nice story about experimenters doing the right thing.

A Few Good Books

I haven’t been getting a ton of reading done this summer, but here are a few books that you might find interesting or fun.

The Psychopath Test

I’m sort of emphatically not a fan of those lightweight nonfiction books we’ve been seeing a lot of in the past few years that would have made (and often did make) a great article in The Atlantic, but that ends up being repetitive and boring when expanded to book length. I was afraid this was one of those books, but it isn’t. What it is is Ronson stringing together a collection of personal and historical anecdotes about psychopaths; this works because Ronson is unusually gifted at getting people — often wingnuts, but also regular people — to open up and talk. For example, there’s Tony, the guy who pretended to be insane to escape jail time but now can’t get himself released from the hospital that holds Britain’s most criminally insane people. There’s Brian, a scientologist dedicated to debunking psychiatry. There’s Charlotte, who selects applicants for reality TV shows by making sure they’re crazy enough but not too crazy. I’m just scratching the surface, and Ronson is pretty entertaining himself, with a bit of a Woody Allen thing. This is a fun book about a really scary topic.

Are Your Lights On?

If you think about it, an awful lot of problems are caused by the fact that we’re often not very good at identifying the real problem in the first place. This book is a collection of mental tools we can use to do better problem identification. The general gist is to stop and think, and to make an effort to put aside our preconceptions, before just jumping in with a solution that may well make things worse. This kind of material should be useful for anyone but for consultants, engineers, and researchers it is vital. The book is short and funny and out of print but Amazon makes it easy to track down a copy. You should do it.

Tau Zero

Like all science fiction from 40 years ago, Tau Zero has become a bit of a period piece with respect to gender issues and politics, but otherwise it has aged reasonably well. The plot is pure hard SF: an interstellar ramjet is damaged in such a way that it cannot decelerate. To fix the engine, the crew is forced to take their ship into a region of extremely hard vacuum — the nearest such region being outside the galaxy. Due to relativistic time dilation, the long distances involved are not a subjective problem. One thing goes wrong after another and the solution always ends up being to accelerate, compressing time more and more. Early on, the crew realizes that nobody they know on Earth could still be alive; later, they notice that star formation has ceased. The book would be pretty horrible if Anderson turned it onto a physics lecture, but instead he focuses on the crew members and their reactions to an increasingly desperate situation. If you don’t feel like tracking this book down, no problem: just read this classic article by Freeman Dyson while watching the old Keanu Reeves movie Speed.

Dirty Snow

Previously I thought of Simenon only as the author of some fun detective stories that I had to read in French class. Dirty Snow, on the other hand, is one of the darkest books I’ve ever read. It is set in occupied France, but really it could be any wartime police state. The first part follows a young man’s willful descent into evil; in the second part he gains a bit of unwanted self-knowledge.

According to Googer

This post is all Eddie’s fault.

Things that are exquisitely fortuitous

  • her sudden and unexpected departure with a drunken Hanes waving a gun
  • timing
  • positioning
  • several kinds of parameters of the universe
  • that they chose sh** to describe this newest of Beetoven’s competitors
  • that Phaidon, the British publisher, is issuing new editions of Ungerer classics, including “Moon Man”
  • the Boston bombing
  • mutations
  • a meteor strike landing on the venue of a future presidential debate

Exceedingly beautiful, yet …

  • frightfully poor
  • profoundly sad
  • whose beauty was her least charm
  • cruel at the same time
  • it seems not to rest upon solid masonry
  • Mary Howitt was–and you could not help but feel it–queen among them all
  • the sculptor proceeded with little pleasure
  • the soil is much thinner than that of the flat bottom land near the Mississippi
  • not one of the sailors responded to the offer
  • perfectly practical and becoming
  • in tropical scenery, the entire newness, & therefore absence of all associations, which in my own case (& I believe in others) are unconsciously much more frequent than I ever thought, requires the mind to be wrought to a high pitch, & then assuredly no delight can be greater; otherwise your reason tells you it is beautiful but the feelings do not correspond
  • in our opinion neither of them equals our old superb warbler or Blue Wren
  • she rubbed her face with pepper so as not to tempt anyone or take any focus from God
  • intellectually inept person
  • very aggressive
  • simple setting was designed for the Senior play
  • damaged
  • sad when an old campaigner reaches the twilight years
  • it never mattered

Things that are monumentally egregious

  • practices of the IRS
  • s*** you post
  • man-made errors and mistakes
  • displays
  • compensation
  • mismanagement
  • miscarriage of justice
  • mistakes in understanding
  • price-gouging
  • a betrayal of trust
  • crime
  • idiocy
  • errors
  • a blunder
  • their behavior
  • typos
  • the traffic on the way there
  • a failure of judgement
  • government censorship
  • gaps in logic
  • Steptacular or 5ive’s Greatest Hits
  • officiating blunders
  • exercise of power
  • ethical, moral and legal transgressions
  • operator error
  • the recent bank bailout bill [that] includes a provision which tacks on a $450 charge at closing for any residential real estate financing transaction
  • the behavior of Tech fans

Things that are too frightening to seriously consider

  • the thought of attempting to produce crops with these people
  • the prospect of walking away
  • that topic itself and the implications that follow
  • our sinfulness apart of comfort of the Cross
  • regard[ing] the Law as authoritative
  • the thought of quitting a career after over ten years of unquestionable success
  • the implications of Berger’s proposed ban
  • the thought of stepping off into independence
  • the implications of that finding
  • highly radioactive waste being perpetually stored on California’s seismically active coast

Mt Washington

Although I lived in the eastern USA for 10 years, all of my hiking experience has been in the west — so I was happy to take advantage of being in New England last week to climb Mt Washington, the highest mountain in that part of the world. The weather on Washington is famously erratic and harsh, with treeline at only about 4,400′ (as compared to around 11,000′ in Utah). Here are the current summit conditions.

It had rained the night before and the Crawford Path was fantastically green:

This is one of the longer routes on the mountain at 8.5 miles one-way, but it seemed like a nice choice that would let me climb several of Mt Washington’s sub-peaks. I ended up skipping all of them since most of my walk was in a whiteout:

However, the clouds occasionally opened up a bit:

The Lakes of the Clouds were nice, but at this point the wind picked up and I had to keep moving to stay warm:

I basically had the trail to myself, passing very few people until I got to the summit cone where it was a bit more crowded. Overall this was an exceptionally beautiful route.

On top of the mountain it was cold and windy and crowded. It’s always a bit of a bummer to hike a mountain that other people can drive up. On the positive side, the slices of pizza I bought from the cafe were a lot tastier than the random snack items I had brought along for lunch. There were no views to be had, the summit was in the clouds the whole time I was there.

Once I was warm and full, returning by the Crawford Path started to sound boring. Taking the advice of my hiking buddy Dave Hanscom, I returned by the 5.5 mile Jewell Trail, which was pretty but not really comparable to the Crawford Path and also it had a lot more people on it. Happily, the clouds cleared as I descended so I had good views once off the summit cone. The Jewell Trail goes near the cog railway for a little ways:

After getting to the trailhead I was pretty tired and would have loved to hitchhike the 4.5 miles back to my car, but what I hadn’t realized was that the connector road was quite lightly traveled; the only car I saw was going the wrong direction, so I had to walk to whole way.

Cheating at Research

My news feed this morning contained this article about an unpleasant local situation that has caused one person I know to lose her job (not because she was involved with the malfeasance, but as fallout from this lab shutting down). On the positive side (I’m going from the article and the investigating panel’s report here — I have no inside information) it sounds like the panel came to the appropriate conclusions.

But my sense is that most research cheating is not nearly so overt. Rather, bogus results come out of a mixture of experimental ineptitude and unconscious squashing of results that don’t conform to expectations. A guide to cheating in a recent issue of Wired contains a nice little summary of how this works:

Create options: Let’s say you want to prove that listening to dubstep boosts IQ (aka the Skrillex effect). The key is to avoid predefining exactly what the study measures — then bury the failed attempts. So use two different IQ tests; if only one shows a pattern, toss the other. Expand the pool: Test 20 dubstep subjects and 20 control subjects. If the findings reach significance, publish. If not, run 10 more subjects in each group and give the stats another whirl. Those extra data points might randomly support the hypothesis. Get inessential: Measure an extraneous variable like gender. If there’s no pattern in the group at large, look for one in just men or women. Run three groups: Have some people listen for zero hours, some for one, some for 10. Now test for differences between groups A and B, B and C, and A and C. If all comparisons show significance, great. If only one does, then forget about the existence of the p-value poopers.

I do not recommend the full article, I only read it because I was trapped on a long flight.

The pattern underlying all of these ways of cheating is to run many experiments, but report the results of only a few. We only have to run 14 experiments that measure no effect before we get a better than 50% chance of finding a result that is significant at the 95% level purely by chance. My wife, a psychologist, says that she has observed students saying things like “we’ll run a few more subjects and see if the effect becomes significant” without realizing how bad this is. In computer science, we are lucky to see empirical papers that make use of statistical significance tests at all, much less make proper use of them.