Two Rules for Random Testing

When you run a test case on a piece of software, you’re conducting an experiment where the null hypothesis is “the system under test correctly executes this test.” The problem is that the value of each such experiment is small because there are so many possible inputs. Random testing has a stronger null hypothesis: “the system under test correctly executes as many generated test cases as we have time to run.” The idea is to explore parts of the space of possible system inputs that a human wouldn’t think to test. This leads us to:

Random Testing Rule 1: Err on the side of expressiveness by prohibiting the generation of as few (legal) inputs as possible.

In other words, a better random tester is one that will — given time — explore a larger part of the input space. Expressiveness, however, can be difficult to achieve:

  1. It is easy to fail to notice parts of the input space that should be explored. How many of us know all of Linux’s system calls? All of C’s trigraphs? All of GNU C’s inline assembly directives?
  2. Some valid inputs are very hard to distinguish from invalid inputs. For example, assume that we’re attempting to generate random JavaScript programs that terminate. See the problem? On the other hand, often there are shortcuts — maybe we can just kill the apparently non-terminating programs after a timeout. In other cases, the system under test can itself tell us whether an input is valid.

Of course, expressiveness isn’t the whole story for random testing. For example, we can generate all possible JavaScript programs by generating random bit strings. However, if we actually do this we’ll waste almost all of our testing time generating bit strings and getting them rejected by the same tiny part of the JS lexer. This observation leads us to:

Random Testing Rule 2: Restrict the expressiveness of generated programs as needed to achieve effective testing.

Trivially, the random bit string generator was too expressive to be an effective tester for a JavaScript implementation. Similarly, if I’m testing a new, buggy floating point optimization pass for a compiler, simply turning off integer variables in the test case generator may be the way to proceed. In general we must shape the inputs so that they exercise the right parts of the system under test.

Good taste is required to distinguish between parts of the input space that are germane vs. those that are irrelevant. For example, there is a large family of JavaScript programs differing only in choice of identifier names, and another large family differing only in choice of floating point constants. These sub-spaces of the input space probably do not need to be explored deeply because they are not tightly connected to the difficult logic in a JS VM. On the other hand, see here for a story about a magic FP constant that hangs PHP on certain platforms.

Summary: Good random testing is achieved by balancing the tension between too much and too little expressiveness in the test case generator.

Probabilities in Random Testing

A typical real computer system has an extremely large input space and no testing method can cover more than an infinitesimal part of it. On the other hand, broad regions of the input space will trigger bugs. The problem is that we do not know the shapes or locations of the buggy parts of the space. Random testing is a shotgun approach to finding these regions; empirically, it works well.

I’ve written, or supervised the writing of, about 10 random test case generators. The most difficult and unsatisfactory part of engineering a good random tester is setting the probabilities properly. For example, if I’m generating JavaScript programs to stress-test a browser, I need to decide how often to create a new variable vs. referencing an existing one, how often to generate a loop vs. a conditional, etc. The shape of the eventual JS program depends strongly on dozens of little decisions like these.

A few more observations:

  • If the probabilities are chosen poorly, the tester will be far less effective than it could have been, and may not find any bugs at all
  • Probabilities alone are an insufficient mechanism for engineering good test cases, and need to be combined with other mechanisms such as heuristic limits on the size of various parts of a test case
  • Equivalent to the previous bullet, probabilities sometimes want to be a function of the test case being constructed; for example, the probability of creating a new function may decrease as the number of generated functions increases
  • The connection between probabilities and high-level properties of test cases may be very indirect
  • Alternatives to playing probability games, such as uniform sampling of the input space or bounded exhaustive testing, are generally harder or less effective than just getting the probabilities right

In practice, probability engineering looks like this:

  1. Think hard about what kind of test cases will drive the system under test into parts of its state space where its logic is weak
  2. Tweak the probabilities to make the generated tests look more like they need to look
  3. Look at a lot of generated test cases to evaluate the success of the tweaking
  4. Go back to step 1

Random testing almost always has a big payoff, but only when combined with significant domain knowledge and a lot of hard work.

Outgunned

Several years ago I published my favorite kind of paper: it took a problem that was hard to solve by hand, and solved it using 20% cleverness and 80% brute force. The details don’t matter here, but the solution had scalability problems. Therefore, the next iteration of the work (done primarily by a very smart student) was more like 80% cleverness and 20% brute force. The student, Usit, disappeared into Microsoft after getting an MS degree, leaving me to give the talk about his work. After giving this talk I spent a while talking with a member of the group that produced Astree, a team that contains some of the heavy hitters in the program analysis world. Of course, they had encountered similar problems to ours in constructing good transfer functions, and they also had some ideas for solving these problems. Not long after this conversation, I decided to drop this line of research. My impression then — and five years later I still think it’s true — was that while I could have continued to publish in this area, my contributions would not have been very deep or interesting ones. Basically I realized that (1) the problems in this area were highly mathematical and (2) any math power I could bring to the table was insignificant compared to groups containing people like the guy I talked to.

I take a positive and a negative lesson from this little story. The negative one is about the state of math education in the US. Why am I so often outclassed by people (who usually don’t have a degree in math) educated outside the USA? It’s ridiculous, and I actually do have a degree in math. I’m not saying that I’m supposed to be a math genius because I took some classes (I chose to go to grad school in CS instead of math for a reason) but the difference is pretty noticable. Looking back, I only learned real mathematical thinking in a handful of upper-division classes for math majors only. In contrast, the math classes that were in service of an engineering degree emphasized cookie-cutter methods for predefined problems.

The positive lesson is that I found it very freeing to realize that a research area is covered by other people who are more capable than I am. This has happened to me a few more times since then and it’s always a relief. As a researcher, life is good as long as there exists just one important problem that I’m most capable of attacking.

The Day I Learned To Love Perl

Following my second year of grad school, I spent the summer of 1997 as an intern for Myricom, a company in the L.A. area that makes really fast local-area networks. It was a great place to work: small, and filled with super-smart ex-Caltech people.

One day I was hacking while the important people were in a meeting with a big customer who was having some sort of network problem. Bob Felderman, my supervisor, came out and told me they had a hypothesis about the problem but some data needed to be analyzed to make sure, and could I do it like right now? At the time I was basically a C hacker and groveling lots of complicated text files using a C program would have been pure hell. I probably gave Bob a skeptical look. He walked over to someone else’s desk, handed me Perl book, and said this was the right tool for the job. I had never so much as touched a Perl interpreter but I started digging and within probably 30 minutes had solved the problem, went into the meeting, and handed them the smoking gun they’d been looking for.

I do not love Perl for all tasks; I definitely do not love it for large projects; and, I seldom love the Perl that anyone else writes. But that day I learned to love Perl and even now 95% of my LOC are in it. Of course this is partly because my students get to write all the fun code and I’m left parsing output files, graphing stuff, and otherwise stringing together long chains of UNIX processes.

Cryptocontributions, Blogs, and How Science Works

Most people — including many scientists — understand the process of science to be repeated application of the scientific method. In this model, a hypothesis is formulated, experiments are conducted to test the hypothesis, data is analyzed, and the results usually lead to a new hypothesis. This adequately captures the “99% perspiration” aspect of doing science, but misses out on the best part where you synthesize something new, that is not a direct consequence of the data.

In computer science, the writing of papers often works like this: A research group leader says something like “this year we’re going to submit three papers to ISCA / SOSP / SIGGRAPH / other top conference.” The group then chooses the most promising ideas, implements and evaluates them, and finally prepares papers that claim the ideas are good ones.

The goal-driven mode of writing papers has at least three problems. First, it leads to deadline-driven development and writing, which makes it very tempting to dismiss unexpected findings. Second, since the overhead of writing a paper is high, papers start to feel like supertankers that are hard to steer once they start heading in a certain direction. New developments and insights are just not easy to factor into the overall message. Third, the publish-or-perish factor (and also the fear of getting scooped) often leads us to publish too early, decreasing the odds that we’ll have developed enough perspective on the work to see its contributions clearly.

Even when interesting and unexpected results make it into a paper (as opposed to being dismissed outright either by the PI or by a student doing the work) the discussion of them is often buried deep in some subsection of the the paper. When this happens — and the interesting development is not even mentioned in the abstract or conclusion — I call it a “cryptocontribution.” Sometimes these hidden gems are the most interesting parts of what are otherwise pretty predictable pieces of work. When authors are too focused on getting the thing submitted, it’s really easy to shove interesting findings under the rug. Certainly I’ve done it, though I try hard not to.

For years I have enjoyed other people’s cryptocontributions. However, since starting a blog I’ve tried to become more attuned to them in my own writing. Because blogs are different then papers — they’re not high overhead or goal-oriented — I often manage to notice before pressing “publish” that the most interesting part of a post is actually paragraph 17. When this happens I cut out the paragraph and save it in my blog todo list for later elaboration. A fair amount of the time, I end up scrapping the post too — basically it was just a writing exercise for me. (At least one of my colleagues, if he reads this, will be surprised to learn that I scrap anything.)

In summary:

  • The world-view promoted by the scientific method is one where progress is linear and incremental. This misses the equally important, orthogonal process of noticing something wrong, or something out of the ordinary, and leaping off in a new direction. By definition, these unexpected results are not direct consequences of the current hypothesis.
  • Escaping the constraints of the regular publication system, and instead writing about science in a short, non-goal-directed format such as a blog, makes it easier to bring cryptocontributions out of hiding. I would argue that these unlooked-for contributions are often as valuable as the ones being explicitly sought, and shedding light on them will improve the quality of the science being performed.