Skip to content

Irreducible Test Cases

Automated test case reduction is a search-based procedure for taking a large input that triggers a bug and turning it into a smaller input that triggers the same bug. I won’t go into the whole rationale for reducers but in general, for given bug, we always prefer to trigger it using a test case that is as small as possible. Although finding the actual smallest input is infeasible, reducers usually work well in practice; for example, the test cases in this post were all produced automatically. On the other hand, if reduction fails then we’re stuck with a large bug-triggering test case.

Sometimes an irreducible test case is a symptom of a failure in the reducer toolchain; the rest of the time, it provides interesting and useful information about the bug triggered by the test case. Here are some possibilities:

  1. The reducer is incapable of performing the transformations that are necessary to produce a small test case. For example, Berkeley Delta only supports removal of one or more contiguous lines of code and we’ve seen bug-triggering C++ files that it cannot reduce to less than 100 KB. This kind of reducer failure is not interesting, it only means the reducer needs to be improved. You can tell that it is happening if you are able to manually reduce the automatic reducer’s output.
  2. The reducer is producing invalid test cases. Some programs (PDF readers, etc.) take inputs across trust boundaries and therefore must gracefully handle all possible inputs. Others, such as file systems, databases, and C/C++ compilers, are allowed to misbehave when presented with invalid inputs. This is an obstacle to test case reduction. If the reducer is creating invalid test cases, the best solution is to invoke a validity checker from within the reducer’s “interestingness test.” If such a checker is not available (none is for C++, for example) then reduction is difficult.
  3. The system being tested is failing non-deterministically. In this case, the dynamic is that the reducer creates tests that have a smaller and smaller probability of triggering the bug until the final result is just about useless. This possibility is easy to check: just run the system on the test case a few times and make sure the bug always fires. Common sources of nondeterminism include race conditions, timeouts, memory exhaustion, and address space randomization. This reducer failure is interesting only because it tells us that a serious problem — nondeterminism — exists. Sometimes you can work around the problem (for example, by turning off ASLR) and other times you cannot. Reducing test cases for systems that are inherently nondeterministic is an open problem.
  4. The bug being triggered is related only loosely to the test case that triggers it. For example, a threading bug might not depend on the test input as much as it does on timing effects. Similarly, a bug in the expansion logic for a hash table doesn’t depend on the hash table contents as much as it does on the total number of entries. Worse, it is not uncommon to see memory corruption bugs that mostly depend on accidents of memory layout. As the person doing the debugging, this is the most interesting outcome, and also the most common one once problems 1-3 in the reducer toolchain are ironed out. This situation is basically a signal to debug the problem some other way. For example, use a profiler to debug a system that runs for too long, a heap analysis tool to debug a system that allocates too much, or a memory safety checker to debug a system that suffers from corruption.

I recently generated a large number of test cases that tried to crash an older version of GCC, and reduced each crash-triggering test case. Over five days, about 6 million test cases were run and the compiler crashed about 1,100 times. The median-sized crash trigger was quite small: 84 bytes. However, in three cases the reduced test case exceeded 1 KB and the largest reduced test case was 33 KB. Running the compiler on this large test case under Valgrind showed no errors; my best guess is that memory corruption or something similar was in fact happening, but it happened to not be detectable by Valgrind. I could be wrong (I’m happy to provide the test case if anyone’s interested).

{ 2 } Comments

  1. bcs | October 25, 2013 at 12:30 pm | Permalink

    For case 3 (truly no-deterministic tests) , it would be interesting to try some kind of “multi-metric” reducer where one of the metrics is the percentage of the time that the test passes; i.e. an acceptable step is one that fails more often (without getting much larger) or that is smaller (and doesn’t fail much more of the time).

    Also, have you tried running csmith to look for code that produces non-deterministic output? (Clearly that would require some work to scrub out things that are supposed to be non-deterministic like time-stamps etc.)

  2. regehr | October 25, 2013 at 3:57 pm | Permalink

    Hi bcs, I haven’t tried looking for non-determinism using Csmith, that’s a good idea though, when combined with ASLR, as it would spot nasty memory corruption with no false positives, and with not as much slowdown as Valgrind.

    I haven’t tried reducing nondeterministic programs yet, and in fact I hope I’m never forced to. I think your strategy would work but ugh.