Gaining Confidence in Confidence Intervals

Many computer scientists, including myself, end up doing research that needs to be evaluated experimentally, and yet most of us have had little training in statistics or the design of experiments. We end up teaching ourselves this material as we go, which often works pretty well. On the other hand, sometimes the issues are surprisingly subtle. Today’s piece is about a example where I’ve found this to be the case; I ran across it a few years ago when some sanity checks on confidence interval code weren’t giving answers that seemed to make sense.

The topic is confidence intervals for a binomial proportion, where we take a series of independent yes/no measurements and use them to compute an interval that most likely contains the true proportion. For example, if we flip a coin 50 times and get heads 28 times, should we conclude that the coin is biased towards heads? In other words, does the confidence interval for the proportion of heads exclude 0.5? In this case, the 95% interval does contain 0.5 and so the data does not support the conclusion that the coin is unfair at the 95% level.

If we reach some conclusion with 95% confidence, what does that actually mean? Informally, it means that if we repeated the entire experiment (in this case, 50 coin flips) many times, we would reach the correct conclusion 95% of the time. The other 5% of the time we would be incorrect. Statisticians formalize this as the coverage probability: “the proportion of the time that the interval contains the true value of interest.” At this point most of us would ask: Doesn’t the 95% confidence interval have a coverage probability of 95% by definition? Turns out the answer is “no” and this is where the fun starts.

Let’s take an example where n (the sample size) is 21 and p (the true proportion) is 0.479. We can empirically compute the coverage of the 95% confidence interval by repeatedly taking 21 random samples and seeing if the true proportion lies within the 95% confidence interval. For now we’ll use the normal approximation to the binomial. Here’s some code that does all this. Compile and run it like so:

$ gcc -O conf.c -o conf -lm
$ ./conf 21 0.479
n= 21, p= 0.479000, nominal coverage= 95.00 %
measured coverage= 91.62 % 
measured coverage= 91.63 %
measured coverage= 91.63 %
measured coverage= 91.61 % 
measured coverage= 91.60 %

As we can see, the coverage is about 91.6%, which means that although we asked for a 5% probability of error, we are actually getting an 8.4% chance of error—a pretty significant difference. There are a couple of things going on. One problem is that the normal approximation is not a very good one. In particular, it is not recommended when np or n(1-p) is less than 10. Here, however, we pass this test. Other sources give a less conservative test where np and n(1-p) are only required to be 5 or more, which can lead to coverage as small as about 87.4% (for n=11, p= 0.455).

The unfortunate thing is that many stats classes, including the one I took many years ago, and also books such as the ubiquitous Art of Computer Systems Performance Analysis, do not go beyond presenting the normal approximation—I guess because it’s so simple. However, if you use this approximation and follow the np>5 rule, and aim for a 95% confidence interval, you could easily end up publishing results that are not even significant at the 90% level.

Many statisticians recommend avoiding the normal approximation altogether. But what should be used instead? The Clopper-Pearson interval is fully conservative: it is guaranteed to never have less coverage than you asked for. However, it is often overly conservative, so you might be getting 99% confidence when you only wanted 95%. What’s wrong with that? First, it may lead you to waste time running unnecessary experiments. Second, it may cause you to reject a hypothesis that actually holds at the desired level of confidence.

In my work for the last several years I’ve used the Wilson score interval. It is hardly any more difficult to implement than the normal approximation and has better behavior for small np. Also, it works for p=0 or p=1. The Agresti-Coull interval also has many good characteristics. Agresti and Coull’s paper from 1998 is short and easy to read for non-statisticians, I’d recommend it for people interested in learning more about these issues. A more exhaustive (and exhausting) discussion can be found in a 2008 paper by Pires and Amado.

Earlier I alluded to the fact that the normal approximation was just one problem with creating a confidence interval for a binomial proportion. The other problem—and it is a problem for all approximations, including the “exact” Clopper-Pearson interval—is that since the binomial distribution is discrete, it is not possible in general to make the actual coverage probability be the same as the nominal coverage probability.

The relationship between the nominal and actual coverage probabilities is interesting to observe. The following graph (produced using the C program above) shows the behavior of the normal approximation of the 95% confidence interval for n=50, between the np=5 and n(1-p)=5 bounds:

 

I wrote this piece because the difference between nominal and actual coverage probabilities, which is well-understood by statisticians, does not seem to have been communicated effectively to practicing scientists. In some ideal world the details wouldn’t matter very much because people like me would simply rely on a convenient R library to do all confidence interval computations, and this library would be designed and vetted by experts. In practice, a one-size-fits-all solution is probably not optimal and anyway I always seem to end up with confidence interval code embedded in various scripts—so I’ve spent a fair amount of time trying to learn the underlying issues, in order to avoid screwing up.

Summary: Ignore what you probably learned about confidence intervals for proportions in college. Use the Agresti-Coull interval or Wilson score interval.

Spiral Jetty

Smithson’s Spiral Jetty is often totally submerged, or else high and dry as it was when I last saw it. Today a favorable lake level as well as warm weather and a feeling of having spent maybe one too many days around the house motivated us to take a day trip. The hazy air and calm water created an otherworldly effect. Afterwards, we stopped by the Golden Spike site and also the Thiokol rocket garden.

A Few Thoughts on Scratch

Over the last year or so my older son, who is about to turn eight, has spent maybe one morning a month creating a video game using Scratch. This doesn’t seem like a lot of time to spend learning to program, but I’ve been trying to avoid pushing. Our usual mode of operation is that he sits there at one computer while I work at another one; I’ll come over and answer questions or help out if he gets stuck, but he’s basically driving both the design and implementation.

The game is simple: you control an ant that is being chased by a bomb, which blows up the ant if it catches it. The ant cannot move through the walls of a simple maze, but the bomb can. Here the ant is being chased:

Here it got caught:

This morning I asked Jonas if his code used any variables and he asked: What’s a variable? So I explained it would be used to store the number of lives you had left or something like that, and he immediately wanted to give the player multiple lives. The mechanics of doing this was a bit beyond him—this was not only his first variable, but also his first inter-sprite messaging—so we sat down together and implemented it in around half an hour. Adding the reset-board and game-over logic were the most complicated tasks we’ve done so far.

I also tried to tell Jonas how creating a smart adversary is perhaps the hardest aspect of making a good computer game, and that nobody even knows how to make an enemy as smart as he is. Jonas replied that in that case it should be pretty easy to create an enemy as smart as his five years old brother. I think he was joking but it wasn’t totally clear.

Does Scratch have any drawbacks? Perhaps a few. I found the lack of procedural abstraction distracting; Scratch seems to encourage a significant level of code duplication that would grow onerous for any non-simple tasks. Jonas and I both found the concurrent-object execution model to be a bit confusing since it was often not clear where to put a particular piece of code. He ended up putting most of the game logic into the bomb that chases the player around the screen—this worked pretty well, but didn’t seem ideal. We ran into a few interface warts where, for example, a comparison widget seems to encourage you to type the name of a variable into one of its slots, but the resulting code doesn’t work—you rather must drag a widget corresponding to the variable into the comparison object.

Overall Scratch is really well done; I found the programming environment to be enjoyable and Jonas was able to pick up new constructs from the menus and figure them out on his own. The compile-to-Flash feature is totally great since you can execute people’s Scratch projects in the browser.

What next? I’m trying to choose between Racket and Python via Phil Guo’s online Python tutor.

Fastest FizzBuzz

The always-entertaining FizzBuzz problem came up again on Hacker News today, and for no other reason than I just got out from under a nasty deadline, I looked around on the net for interesting solutions, for which this Rosetta Code page is a gold mine. The Windows batch file and BSD make versions are hilarious, though I was a bit disappointed that Piet was not represented. Anyhow, here it is: (reference)

Another wonderful FizzBuzz solution uses template metaprogramming to evaluate the solution at compile time. I was disappointed, however, to look at the code emitted by Clang++, G++, and Intel’s C++ compiler, which all generate a large amount of crap instead of boiling the program down to its essence: emitting a single string.

I also noticed that various sites that discuss the efficiency of FizzBuzz implementations, such as this one, failed to provide what I would consider the most efficient code:

#include <stdio.h>

int main (void)
{
  puts("1\n2\nFizz\n4\nBuzz\nFizz\n7\n8\nFizz\nBuzz\n11\nFizz\n13\n14\nFizzBuzz\n16\n17\nFizz\n19\nBuzz\nFizz\n22\n23\nFizz\nBuzz\n26\nFizz\n28\n29\nFizzBuzz\n31\n32\nFizz\n34\nBuzz\nFizz\n37\n38\nFizz\nBuzz\n41\nFizz\n43\n44\nFizzBuzz\n46\n47\nFizz\n49\nBuzz\nFizz\n52\n53\nFizz\nBuzz\n56\nFizz\n58\n59\nFizzBuzz\n61\n62\nFizz\n64\nBuzz\nFizz\n67\n68\nFizz\nBuzz\n71\nFizz\n73\n74\nFizzBuzz\n76\n77\nFizz\n79\nBuzz\nFizz\n82\n83\nFizz\nBuzz\n86\nFizz\n88\n89\nFizzBuzz\n91\n92\nFizz\n94\nBuzz\nFizz\n97\n98\nFizz\nBuzz");
  return 0;
}

It would perhaps be better style to use printf() instead of puts(), counting on the compiler to optimize the former to the latter, but the version of GCC that I use does not actually do that.

Using my timing code on a random Nehalem 64-bit Linux machine I get:

g++ -O3: 125,000 cycles

Intel C++ -fast: 103,000 cycles

Clang++ -O3: 103,000 cycles

gcc -O3: 29,000 cycles

gcc -O3 -static: 21,000 cycles

Here’s the C code and C++ code. For benchmarking purposes I piped their STDOUT to /dev/null.