Four Books on Debugging

It often seems like the ability to debug complex problems is not distributed very evenly among programmers. No doubt there’s some truth to this, but also people can become better over time. Although the main way to improve is through experience, it’s also useful to keep the bigger picture in mind by listening to what other people have to say about the subject. This post discusses four books about debugging that treat it as a generic process, as opposed to being about Visual Studio or something. I read them in order of increasing size.

Debugging, by David Agans

At 175 pages and less than $15, this is a book everyone ought to own. For example, Alex says:

Agans actually sort of says everything I believe about debugging, just tersely and without much nod in the direction of tools.

This is exactly right.

As the subtitle says, this book is structured around nine rules. I wouldn’t exactly say that following these rules makes debugging easy, but it’s definitely the case that ignoring them will make debugging a lot harder. Each rule is explained using real-world examples and is broken down into sub-rules. This is great material and reading the whole thing only takes a few hours. I particularly liked the chapter containing some longer debugging stories with specific notes about where each rule came into play (or should have). Another nice chapter covers debugging remotely—for example from the help desk. Us teachers end up doing quite a bit of this kind of debugging, often late at night before an assignment is due. It truly is a difficult skill that is distinct from hands-on debugging.

This material is not specifically about computer programming, but rather covers debugging as a generic process for identifying the root cause of a broad category of engineering problems including stalled car engines and leaking roofs. However, most of his examples come from embedded computer systems. Unlike the other three books, there’s no specific advice here for dealing with specific tools, programming languages, or even programming-specific problems such as concurrency errors.

Debug It! by Paul Butcher

Butcher’s book contains a large amount of great material not found in Agans, such as specific advice about version control systems, logging frameworks, mocks and stubs, and release engineering. On the other hand, I felt that Butcher did not present the actual process of debugging as cleanly or clearly, possibly due to the lack of a clear separation between the core intellectual activity and the incidental features of dealing with particular software systems.

zeller

Why Programs Fail, by Andreas Zeller

Agans and Butcher approach debugging from the point of view of the practitioner. Zeller, on the other hand, approaches the topic both as a practitioner and as a researcher. This book, like the previous two, contains a solid explanation of the core debugging concepts and skills. In addition, we get an extensive list of tools and techniques surrounding the actual practice of debugging. A fair bit of this builds upon the kind of knowledge that we’d expect someone with a CS degree to have, especially from a compilers course. Absorbing this book requires a lot more time than Agans or Butcher does.

debugging_by_thinkingDebugging by Thinking, by Robert Metzger

I have to admit, after reading three books about debugging I was about ready to quit, and at 567 pages Metzger is the longest of the lot. The subtitle “a multidisciplinary approach” refers to six styles of thinking around which the book is structured:

  • the way of the detective is about deducing where the culprit is by following a trail of clues
  • the way of the mathematician observes that useful parallels can be drawn between methods for debugging and methods for creating mathematical proofs
  • the way of the safety expert treats bugs as serious failures and attempts to backtrack the sequence of events that made the bug possible, in order to prevent the same problem from occurring again
  • the way of the psychologist analyzes the kinds of flawed thinking that leads people to implement bugs: Was the developer unaware of a key fact? Did she get interrupted while implementing a critical function? Etc.
  • the way of the computer scientist embraces software tools and also mathematical tools such as formal grammars
  • the way of the engineer is mainly concerned with the process of creating bug-free software: specification, testing, etc.

I was amused that this seemingly exhaustive list omits my favorite debugging analogy: the way of the scientist, where we formulate hypotheses and test them with experiments (Metzger does discuss hypothesis testing, but the material is split across the detective and mathematician). Some of these chapters—the ways of the mathematician, detective, and psychologist in particular—are very strong, while others are weaker. Interleaved with the “six ways” are a couple of lengthy case studies where buggy pieces of code are iteratively improved.

Summary

All of these books are good. Each contains useful information and each of them is worth owning. On the other hand, you don’t need all four. Here are the tradeoffs:

  • Agans’ book is The Prince or The Art of War for debugging: short and sweet and sometimes profound, but not usually containing the most specific advice for your particular situation. It is most directly targeted toward people who work at the hardware/software boundary.
  • Butcher’s book is not too long and contains considerably more information that is specific to software debugging than Agans.
  • Zeller’s book has a strong computer science focus and it is most helpful for understanding debugging methods and tools, instead of just building solid debugging skills.
  • Metzger’s book is the most detailed and specific about the various mental tools that one can use to attack software debugging problems.

I’m interested to hear feedback. Also, let me know if you have a favorite book that I left out (again, I’m most interested in language- and system-independent debugging).

Around Zion National Park

Zion NP makes a great destination for a quick weekend trip in winter (and not just for people living nearby—Zion is only about a 2.5 hour drive from Las Vegas). Since the main canyon is 800 m deep, some parts of it don’t get much sun in winter. On the other hand, the east side of the park has some nice south-facing drainages that typically warm up nicely on a sunny day in January or February.

Undefined Behavior Executed by Coq

I built a version of OCaml with some instrumentation for reporting errors in using the C language’s integers. Then I used that OCaml to build the Coq proof assistant. Here’s what happens when we start Coq:

[regehr@gamow ~]$ ~/z/coq/bin/coqtop
intern.c:617:10: runtime error: left shift of 255 by 56 places cannot be represented in type 'intnat' (aka 'long')
intern.c:617:10: runtime error: left shift of negative value -1
intern.c:167:13: runtime error: left shift of 255 by 56 places cannot be represented in type 'intnat' (aka 'long')
intern.c:167:13: runtime error: left shift of negative value -1
intern.c:173:13: runtime error: left shift of 234 by 56 places cannot be represented in type 'intnat' (aka 'long')
intern.c:173:13: runtime error: left shift of negative value -22
intern.c:173:13: runtime error: left shift of negative value -363571240
interp.c:978:43: runtime error: left shift of 2 by 62 places cannot be represented in type 'long'
interp.c:1016:19: runtime error: left shift of negative value -1
interp.c:1016:12: runtime error: signed integer overflow: -9223372036854775807 + -2 cannot be represented in type 'long'
interp.c:936:14: runtime error: left shift of negative value -1
ints.c:721:48: runtime error: left shift of 1 by 63 places cannot be represented in type 'intnat' (aka 'long')
ints.c:674:48: runtime error: signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long'
compare.c:307:10: runtime error: left shift of 1 by 63 places cannot be represented in type 'intnat' (aka 'long')
str.c:96:23: runtime error: left shift of negative value -1
compare.c:275:12: runtime error: left shift of negative value -1
interp.c:967:14: runtime error: left shift of negative value -427387904
interp.c:957:14: runtime error: left shift of negative value -4611686018
interp.c:949:14: runtime error: signed integer overflow: 31898978766825602 * 65599 cannot be represented in type 'long'
interp.c:949:14: runtime error: left shift of 8059027795813332990 by 1 places cannot be represented in type 'intnat' (aka 'long')
Welcome to Coq 8.4pl1 (February 2013)
unixsupport.c:257:20: runtime error: left shift of negative value -1

Coq < 

This output means that Coq---via OCaml---is executing a number of C's undefined behaviors before it even asks for any input from the user. The problem with undefined behaviors is that, according to the standard, they destroy the meaning of the program that executes them. We do not wish for Coq's meaning to be destroyed because a proof in Coq is widely considered to be a reliable indicator that the result is correct. (Also, see the update at the bottom of this post: the standalone verifier coqchk has the same problems.)

In principle all undefined behaviors are equally bad. In practice, some of them might only land us in purgatory ("Pointers that do not point into, or just beyond, the same array object are subtracted") whereas others (store to out-of-bounds array element) place us squarely into the ninth circle. To which category do the undefined behaviors above belong? To the best of my knowledge, the left shift problems are, at the moment, benign. What I mean by "benign" is that all compilers that I know of will take a technically undefined construct such as 0xffff << 16 (on a machine with 32-bit, two's complement integers) and compile it as if the arguments were unsigned, giving the intuitive result of 0xffff0000. This compilation strategy could change.

If we forget about the shift errors, we are still left with three signed integer overflows:

  • -9223372036854775807 + -2
  • -9223372036854775808 - 1
  • 31898978766825602 * 65599

Modern C compilers are known to exploit the undefinedness of such operations in order to generate more efficient code. Here's an example where two commonly-used C compilers evaluate (INT_MAX+1) > INT_MAX to both 0 and 1 in the same program, at the same optimization level:

#include <stdio.h>
#include <limits.h>

int foo (int x) {
  return (x+1) > x;
}

int main (void)
{
  printf ("%d\n", (INT_MAX+1) > INT_MAX);
  printf ("%d\n", foo(INT_MAX));
  return 0;
}

$ gcc -w -O2 overflow.c
$ ./a.out 
0
1
$ clang -O2 overflow.c
$ ./a.out 
0
1

Here's a longish explanation of the reasoning that goes into this kind of behavior. One tricky thing is that the effects of integer undefined behaviors can even affect statements that precede the undefined behavior.

Realistically, what kinds of consequences can we expect from signed integer overflows that do not map to trapping instructions, using today's compilers? Basically, as the example shows, we can expect inconsistent results from the overflowing operations. In principle a compiler could do something worse than this---such as deleting the entire statement or function which contains the undefined behavior---and I would not be too surprised to see that kind of thing happen in the future. But for now, we might reasonably hope that the effects are limited to returning wrong answers.

Now we come to the important question: Is Coq's validity threatened? The short answer is that this seems unlikely. The long answer requires a bit more work.

What do I mean by "Coq's validity threatened"? Of course I am not referring to its mathematical foundations. Rather, I am asking about the possibility that the instance of Coq that is running on my machine (and similar ones running on similar machines) may produce an incorrect result because a C compiler was given a licence to kill.

Let's look at the chain of events that would be required for Coq to return an invalid result such as claiming that a theorem was proved when in fact it was not. First, it is not necessarily the case that the compiler is bright enough to exploit the undefined integer overflow and return an unexpected result. Second, an incorrect result, once produced, might never escape from the OCaml runtime. For example, maybe the overflow is in a computation that decides whether it's time to run the garbage collector, and the worst that can happen is that the error causes us to spend too much time in the GC. On the other hand, a wrong value may in fact propagate into Coq.

Let's look at the overflows in a bit more detail. The first, at line 1016 of interp.c, is in the implementation of the OFFSETINT instruction which adds a value to the accumulator. The second overflow, at line 674 of ints.c, is in a short function called caml_nativeint_sub(), which I assume performs subtraction of two machine-native integers. The third overflow, at line 949 of interp.c, is in the implementation of the MULINT instruction which, as far as I can tell, pops a value from the stack and multiplies it by the accumulator. All three of these overflows fit into a pattern I've seen many times where a higher-level language implementation uses C's signed math operators with insufficient precondition checks. In general, the intent is not to expose C's undefined semantics to programs in the higher-level language, but of course that is what happens sometimes. If any of these overflows is exploited by the compiler and returns strange results, OCaml will indeed misbehave. Thus, at present the correctness of OCaml programs such as Coq relies on a couple of things. First, we're hoping the compiler is not smart enough to exploit these overflows. Second, if a compiler is smart enough, we're counting on the fact that the resulting errors will be egregious enough that they will be noticed. That is, they won't just break Coq in subtle ways.

If these bugs are in the OCaml implementation, why am I picking on Coq? Because it makes a good example. The Coq developers have gone to significant trouble to create a system with a small trusted computing base, in order to approximate as nearly as possible the ideal of an unassailable system for producing and checking mathematical proofs. This example shows how even such a careful design might go awry if its chain of assumptions contains a weak link.

These results were obtained using OCaml 3.12.1 on a 64-bit Linux machine. The behavior of the latest OCaml from SVN is basically the same. In fact, running OCaml's test suite reveals signed overflows at 17 distinct locations, not just the three shown above; additionally, there is a division by zero and a shift by -1 bit positions. My opinion is that a solid audit of OCaml's integer operations should be performed, followed by some hopefully minor tweaking to avoid these undefined behaviors, followed by aggressive stress testing of the implementation when compiled with Clang's integer overflow checks.

UPDATE: psnively on twitter suggested something that I should have done originally, which is to look at Coqchk, the standalone proof verifier. Here's the output:

[regehr@gamow coq-8.4pl1]$ ./bin/coqchk
intern.c:617:10: runtime error: left shift of 255 by 56 places cannot be represented in type 'intnat' (aka 'long')
intern.c:617:10: runtime error: left shift of negative value -1
intern.c:167:13: runtime error: left shift of 255 by 56 places cannot be represented in type 'intnat' (aka 'long')
intern.c:167:13: runtime error: left shift of negative value -1
intern.c:173:13: runtime error: left shift of 234 by 56 places cannot be represented in type 'intnat' (aka 'long')
intern.c:173:13: runtime error: left shift of negative value -22
intern.c:173:13: runtime error: left shift of negative value -363571240
interp.c:978:43: runtime error: left shift of 2 by 62 places cannot be represented in type 'long'
interp.c:1016:19: runtime error: left shift of negative value -1
interp.c:1016:12: runtime error: signed integer overflow: -9223372036854775807 + -2 cannot be represented in type 'long'
interp.c:936:14: runtime error: left shift of negative value -1
ints.c:721:48: runtime error: left shift of 1 by 63 places cannot be represented in type 'intnat' (aka 'long')
ints.c:674:48: runtime error: signed integer overflow: -9223372036854775808 - 1 cannot be represented in type 'long'
compare.c:307:10: runtime error: left shift of 1 by 63 places cannot be represented in type 'intnat' (aka 'long')
Welcome to Chicken 8.4pl1 (February 2013)
compare.c:275:12: runtime error: left shift of negative value -1

Ordered list:
  
Modules were successfully checked
[regehr@gamow coq-8.4pl1]$ 

Catching Integer Errors with Clang

Peng Li and I at Utah, along with our collaborators Will Dietz and Vikram Adve at UIUC, wrote an integer overflow checker for Clang which has found problems in most C/C++ codes that we have looked at. Do you remember how pervasive memory safety errors were before Valgrind came out? Integer overflows are that way right now.

An exciting recent development is that due to a ton of work done by Will, our checker is now in the Clang trunk. Taking a cue from the excellent address sanitizer, it is called the integer sanitizer. To use it, check out and build the latest Clang (you must also build Compiler-RT) and then compile your code using the -fsanitize=integer or -fsanitize=undefined option. The former will tell you about well-defined but possibly erroneous unsigned overflows, whereas the latter will only tell you about undefined behaviors (including some non-integer-related ones). For more details on these options, see Clang’s documentation for its code generation options. The integer sanitizer is not yet in a released version of Clang, but we expect it to be part of the 3.3 release.

One thing we realized very early on in this work is that integer overflows are surprisingly difficult to understand, particularly when they occur in the middle of complex expressions. For example, see this somewhat undignified interaction between me and the main PHP guy. As a result, we put a lot of work into emitting good error messages. Some examples follow.

The integer sanitizer not only checks for divide by zero, which is kind of boring, but also for INT_MIN / -1 and INT_MIN % -1. Real codes don’t seem to perform these operations when left alone, but see here.

#include <limits.h>

int main (void) {
  return INT_MIN / -1;
}

Result:

$ clang -fsanitize=integer div.c
$ ./a.out 
div.c:4:18: runtime error: division of -2147483648 by -1 cannot be represented in type 'int'
Floating point exception (core dumped)
$ 

Unsigned overflows are well-defined by C/C++ and are often intentional, particularly in bitsy codes like hash functions and crypto. On the other hand, unintentional unsigned overflows can be bugs, and we can detect them if you want:

int main (void) {
  return 0U - 1;
}

Result:

$ clang -fsanitize=integer unsigned.c 
$ ./a.out 
unsigned.c:2:13: runtime error: unsigned integer overflow: 0 - 1 cannot be represented in type 'unsigned int'
$ clang -fsanitize=undefined unsigned.c 
$ ./a.out 
$ 

Signed integer overflows are undefined by C/C++. Compilers used to provide 2’s complement wraparound for signed overflow, but this is no longer reliable. Therefore, signed overflow should always be avoided. One example commonly seen in real codes is negation of INT_MIN:

#include <limits.h>

int main (void) {
  return -INT_MIN;
}

Result:

$ clang -fsanitize=integer signed.c 
$ ./a.out
signed.c:4:10: runtime error: negation of -2147483648 cannot be represented in type 'int'; cast to an unsigned type to negate this value to itself
$ 

Another class of integer error occurs when the right operand to a shift operator is negative or is not less than the bitwidth of the promoted left operand. But these are kind of boring so let’s look at a more arcane kind of shift error: in C99 and later many kinds of signed left shift have undefined behavior, such as this one:

int main (void) {
  return 0xffff << 16;
}

Result:

$ clang -fsanitize=integer shift1.c
$ ./a.out 
shift1.c:2:17: runtime error: left shift of 65535 by 16 places cannot be represented in type 'int'
$ 

In the more recent versions of C/C++, it is not legal to shift a 1 into, out of, or past the sign bit. See 6.5.7.4 of the C99 standard for more details.

Anyway, I think this hits the high points. The slowdown due to integer checking is generally less than 50%. We believe that we can reduce this, but so far have mainly focused on usability and correctness. Will already did some very nice work which marks the trap handling code as cold.

We would appreciate usability feedback from early adopters. On our TODO list are a few things such as:

  • Perhaps dropping unsigned overflows from the set of checks enabled by -fsanitize=integer (these would be enabled by a separate flag).
  • Compiler directives for suppressing integer sanitizer errors where they are not wanted.
  • Redirecting the error stream to a file or to syslog.
  • Porting over a few additional checks from IOC such as detecting lossy truncations and sign conversions.

Let us know if you would find these to be useful.

How to Fuzz an ADT Implementation

Sometimes the standard libraries don’t meet your requirements. When this happens you might grab an open source b-tree, splay tree, Judy array, or whatever, or you might implement something yourself. This post is about the situation where you have some code for a data structure, you also have some unit tests, but even so you are not yet quite convinced it is trustworthy enough to be included in a larger project. The problem is how to gain confidence in it. Part of the solution is random testing, or fuzzing. Previously I wrote about the general case. This post gives a detailed example for C code, although the techniques are mostly independent of the language. I’ll repeat the exercise later for a Python library.

We’ll start with an open source red-black tree. Although you could look at the original code, it’s going to be more useful to peek into the Github repo that I setup to accompany this post. The baseline tag is the initial checkin after a bit of cleaup. As a side note, this is the first time I’ve tried to use Github to help tell a coding story in a blog post; hopefully it’ll work out.

The contents of this ADT include:

  • red_black_tree.h defines the API for a generic red-black tree; the interface is commented and looks pretty sane as long as we bear in mind that generics aren’t exactly C’s strong point
  • red_black_tree.c is the implementation, again it’s fine at first glance although there isn’t any checkRep() or repOK()
  • test_red_black_tree.c is an interactive test driver: it issues API calls to the red-black tree based on commands from STDIN
  • simple_test.sh contains the single unit test for this red-black tree, it works by passing commands to the test driver

Before writing any code let’s make sure we know what we’re doing here. We just want to answer a simple question:

Are we confident that this module will work properly in production code?

The answer right now is “don’t know yet” since we haven’t even seen the code run. Eventually we’ll either discover some nasty bugs in which case we’ll probably abandon the code, or else we’ll satisfy ourselves that it works as intended.

In some situations we might not need to run the code that we’re reusing. For example, we might perform a detailed code inspection and then conclude that the code is solid. Alternatively, if this red-black tree was included in a heavily-used version of Boost, we might be comfortable using it out of the box, trusting that other users have tested it sufficiently. But that’s not that case here. Finally, some friendly Frama-C user might prove once and for all that the red-black tree is correct, in which case we might feel kind of silly fuzzing it (I wouldn’t, but others might).

Step 0: Run the Unit Tests

The provided unit tests are extremely minimal: just simple_test.sh, which instantiates a red-black tree and does a few operations on it. What can we learn from running this test? All we know is that for this short sequence of operations:

  • The code doesn’t crash.
  • No assertions are violated.
  • The printed output is plausible.

There are two problems. First, we have no quantitative information about the adequacy of simple_test.sh. Second, we haven’t monitored the execution of the red-black tree very closely: it could be doing all sorts of things wrong and we’d never know it. We’ll deal with these concerns one after the other.

Step 1: Get Coverage Information

We’ll start with line coverage, which is usually the most basic kind of coverage that tells us anything useful. The diffs for step 1 have a modified makefile which tells GCC to record coverage information. If we recompile, run the test script, and ask gcov about the coverage of red_black_tree.c, it says that 76% of lines were covered. Looking a bit closer, there are big chunks of code that aren’t covered and one of the main red-black utility functions, RightRotate(), didn’t even get called.

What did we accomplish in step 1? First, we confirmed our suspicion that simple_test.sh is seriously inadequate. Second, we set a baseline for coverage that we had better be able to improve on.

Step 2: Improve Coverage Using a Fuzzer

Fuzzing an ADT implementation is never difficult, but here it’s particularly simple since test_red_black_tree.c already contains most of the code that we need. All we have to do is drive the method calls randomly instead of taking instructions from STDIN. I made a copy of this file called fuzz_red_black_tree.c and made the appropriate changes. The tag is step_2 and you can see the changes here. Running the fuzzer gives 99.6% line coverage of red_black_tree.c, a nice improvement for such a small amount of work.

What did we accomplish in step 2? We almost completely closed the gap between the baseline and maximal amounts of code coverage.

Step 3: Strengthen the Test Oracles

Just covering code isn’t enough: this code could be doing totally crazy stuff and we’d never know it. Here’s an incomplete list of things that we want to know:

  • Are there memory leaks?
  • Are any non-crashing undefined behaviors executed, such as integer errors or accessing memory that is uninitialized or just a little bit out of bounds?
  • Are the red-black tree’s invariants respected?
  • Is the red-black tree functionally correct? In other words, is it adding, deleting, and sorting elements as required?

Questions like these are answered by test oracles, An oracle tells us if a test case worked or not. Since we’re doing random testing, we require oracles to be automated.

The first question above can be answered by Valgrind. The second question is partially answered using Valgrind and also Clang’s new -fsanitize=integer mode. We could add more tools to the mix but these are a good start. As it happens, neither of them reports any problems during a fuzzing run. This is a great sign that the code we’re dealing with here is pretty high quality. Lots of codes will have problems that are uncovered by these tools.

Checking the red-black invariants takes a bit more work. Traditionally, the function that accomplishes this is called checkRep() or repOK(). Basically any nontrivial data structure needs one of these and here we’re going to have to write it. However, even if this function existed already, we should give it a close look since many sloppy invariant checkers exist. Here are a few bits of advice on invariant checkers:

  • Don’t play games in the invariant checker such as trying to fix things up: just bomb the program with an assertion violation if anything is wrong.
  • Be absolutely sure the invariant checker does not modify the data structure.
  • In the fuzzer, call the invariant checker every time the data structure is at a stable point following a modification. In other words, check the invariants any time they are supposed to hold, but might not.
  • The invariant checker should traverse every piece of memory that has been allocated for the data structure.
  • All nontrivial fields of the data structure should be checked for consistency, even if it seems kind of stupid to do so.
  • It’s easy to accidentally write a vacuous invariant checker. To combat this, during development you can add conditions that are too strong and make sure they fail, and also print a few statistics such as the number of nodes visited. Actually, a better idea (suggested by Alex Groce in a comment) is to break the tree code, not the invariant checker.
  • BE ABSOLUTELY SURE THE INVARIANT CHECKER DOES NOT MODIFY THE DATA STRUCTURE.

Here’s the recursive helper function for checkRep(); the top-level checking function is just below. Running the fuzzer (after modifying it to call the checker) finds no problems.

Finally we’re ready to ask if the red black tree actually works. Is it possible that it doesn’t work even if it has passed all of our tests so far? Certainly. For example, a logic error could just drop a node while leaving the tree in a consistent shape. This last oracle requires a bit more work than the others because we require a reference implementation for this kind of ADT. In a nice programming language, all sorts of container classes would be sitting around, but this is C. We’ll just hack up a quick and dirty container rather than trying to reuse something; this is simple if we can tolerate operations that have up to O(n log n) running time. The other thing we need to do is modify the fuzzer a bit. First, it must perform operations on the red-black tree and the reference implementation in parallel. Second, it must contain assertions ensuring that the results of these operations are the same. Here are the diffs accomplishing all this. This is all pretty straightforward, but one tricky thing we needed to do is modify the fuzzer to avoid adding multiple nodes with the same key. This is not great as far as fuzzing goes (we want to make sure that red-black tree properly deals with duplicated keys). However, I could not think of a way to test functions such as treePredecessor() against the reference implementation in the presence of duplicated keys, which stop the predecessor from being uniquely defined. Anyhow, we’ll fix this later.

What did we accomplish in step 3? We greatly strengthened the test oracles; now we can be sure that a fuzzing run does not violate various C language rules, it does not violate any red-black invariants, and it does not cause the red-black tree to get out of sync with a reference implementation of a sorted container.

Steps 4-9

The hard parts—writing the fuzzer, the checkRep(), and the functional correctness oracle—are finished, but we’re not quite done. At this point I’ll start to move faster and cover each point only briefly.

  • Step 4: Let’s take another look at the coverage information. There are two things we need to address. First, we missed line 591. This could be due to a flaw in the tree code, a flaw in our fuzzer, or we may have simply not run the fuzzer for long enough. The second problem is that a number of lines are executed just once. Let’s try to fix these both by wrapping the entire create/fuzz/destroy sequence in a loop. This is a common idiom seen in fuzzers. The desired effect is achieved: line 591 gets hit many times and also we cover the tree create and destroy code more times. Diffs for step 4.
  • Step 5: The extra loop nest has slowed down the fuzzer; it now takes about 7 seconds to execute even when its output is redirected to /dev/null. Let’s remove the output and just trust it to exit with a useful message when something goes wrong. After these changes it runs in 2.5 seconds with the existing compilation options and 1.2 seconds with optimization turned on. We no longer cover the RBTreePrint() function but we can probably live with that.
  • Step 6: Let’s return to the problem where the fuzzer is never adding elements to the tree that have duplicate keys. A reasonable solution is: for every tree instance, randomly decide if duplicate keys are permitted. If so, we’ll run a slightly weaker checker that doesn’t look at the results of the predecessor and successor queries. This turns out to be super easy.
  • Step 7: Another issue we might consider is what proportion of duplicated keys will lead to the best fuzzing results. This is hard to figure out ahead of time so we’ll just randomize it. Diffs here.
  • Step 8: Up until now we’ve only looked at the keys of tree nodes. We should also check that the info field (a void *) is retained across all tree operations. This requires a quick hack to the reference container and also a few additional assertions in the fuzzer. Again, we have to be careful about duplicate keys. Also in this step we’ll randomize the number of fuzzing steps per tree instance. Diffs here.
  • Step 9: Finally we’ll look at coverage once more and then re-run the fuzzer under valgrind and under the integer sanitizer. As far as we can tell this red-black tree is solid code and we’re going to call this job finished.

Our general strategy during these refinement steps is to keep iterating on the various aspects of the fuzzer until it feels done. Anything that seems like it might profitably be randomized, should be randomized. We must bear in mind the various ways that bugs can hide from tests with good coverage. We should also remember that while poor code coverage indicates a bad random tester, good coverage does not imply that a fuzzer is any good.

What Have We Missed?

Turns out there are a few obvious things we haven’t yet tested:

  • Our red-black tree supports multiple instances but we have only tested one instance at a time. Thus, we’ll have missed any bugs where the tree code inadvertently refers to global state.
  • We never try negative key values.
  • Although we have fuzzed the API provided by the red-black tree, we haven’t fuzzed the APIs that it uses, such as malloc()/free().

In a real-world fuzzing situation we might do these things or we might simply inspect the code in order to convince ourselves that these factors don’t affect the behavior of the tree code.

Conclusion

This post has two points. First, you should write ADT fuzzers. It is often more effective (for a given amount of your time) at finding obscure bugs than hand-written unit tests are. Second, fuzzing an ADT is really easy. It took me around 90 minutes to do the work reported here, and wouldn’t have even taken that long if I hadn’t needed to read up on red-black invariants. Besides being easy, writing a custom fuzzer also gives you a very solid feel for some important characteristics of the code you intend to reuse: Is it fast? Do the method calls perform reasonable error checking? Is the code tightly written? Furthermore, if you do this testing work yourself, as opposed to running someone else’s fuzzer, you get to tailor the fuzzer to emphasize the use cases that are important to your application. Testing is necessarily incomplete but if you beat the crap out of the subset of the API that you actually intend to use, it’s possible to become reasonably confident that it will work in practice. Of course if your use of the API changes over time, it might be a good idea to revisit the fuzzer.

I learned a couple of things while writing this post. First, Github is more awesome than I knew and I really enjoyed integrating it into a blog post about coding. Second, I probably have little more data structure OCD than I had realized.

Simplest Program Showing the Difference Between Sequential Consistency and TSO

The effects of weak memory models on programs running on shared-memory multiprocessors can be very difficult to understand. Today I was looking for the simplest program that illustrates the difference between sequentially consistent execution and TSO (the memory model provided by x86 and x86-64). Based on an example from Section 8.2.3.4 of the big Intel reference manual, I came up with this:

#include <pthread.h>
#include <stdio.h>
#include <assert.h>

volatile int x, y, tmp1, tmp2;

void *t0 (void *arg)
{
  x = 1;
  tmp1 = y;
  return 0;
}

void *t1 (void *arg)
{
  y = 1;
  tmp2 = x;
  return 0;
}

int main (void)
{
  while (1) {
    int res;
    pthread_t thread0, thread1;
    x = y = tmp1 = tmp2 = 0;
    res = pthread_create (&thread0, NULL, t0, NULL);
    assert (res==0);
    res = pthread_create (&thread1, NULL, t1, NULL);
    assert (res==0);
    res = pthread_join (thread0, NULL);
    assert (res==0);
    res = pthread_join (thread1, NULL);
    assert (res==0);
    printf ("%d %d\n", tmp1, tmp2);
    if (tmp1==0 && tmp2==0) break;
  }
  return 0;
}

It is easy to see that this program cannot terminate on a sequentially consistent multiprocessor (assuming that all system calls succeed). On the other hand, the program can and does terminate under TSO. Because individual processors are sequentially consistent, the program will not terminate on a multicore Linux machine if you pin it to a single core:

$ gcc -O tso.c -o tso -pthread
$ taskset -c 1 ./tso

It’s easy to “fix” this code so that it does not terminate in the multicore case by adding one memory fence per thread. In GCC on x86 and x86-64 a memory fence is:

asm volatile ("mfence");

In some cases this version is preferable since it also stops some kinds of reordering done by the compiler:

asm volatile ("mfence" ::: "memory");

The stronger version should be unnecessary here: since the relevant reads and writes are to volatile-qualified objects, the compiler is already not allowed to reorder them across sequence points.

Thinking abstractly about this stuff always messes with my head; it’s better to just write some code. Also I recommend Section 8.2 of Intel manual, it contains a lot of excellent material.

Use of Goto in Systems Code

The goto wars of the 1960s and 1970s are long over, and goto is dead—except that it isn’t. It has a number of legitimate uses in well-structured code including an idiom seen in systems code where a failure partway through a sequence of stateful operations necessitates unwinding the operations that have already completed. For example:

int init_device (void) {
  if (allocate_memory() != SUCCESS) goto out1;
  if (setup_interrupts() != SUCCESS) goto out2;
  if (setup_registers() != SUCCESS) goto out3;
  .. more logic ...
  return SUCCESS;
 out3:
  teardown_interrupts();
 out2: 
  free_memory();
 out1:
  return ERROR;
}

Is goto necessary to make this code work? Certainly not. We can write this instead:

int init_device (void) {
  if (allocate_memory() == SUCCESS) {
    if (setup_interrupts() == SUCCESS) {
      if (setup_registers() == SUCCESS) {
        ... more logic ...
        return SUCCESS;
      }
      teardown_interrupts();
    } 
    free_memory();
  }
  return ERROR;
}

And in fact a decent compiler will turn both of these into the same object code, or close enough. Even so, many people, including me, prefer the goto version, perhaps because it doesn’t result in as much unsightly indentation of the central part of the function.

Tonight’s mainline Linux kernel contains about 100,000 instances of the keyword “goto”. Here’s a nice clean goto chain of depth 10.

Here are the goto targets that appear more than 200 times:

out (23228 times)
error (4240 times)
err (4184 times)
fail (3250 times)
done (3179 times)
exit (1825 times)
bail (1539 times)
out_unlock (1219 times)
err_out (1165 times)
out_free (1053 times)
nla_put_failure (929 times)
failed (849 times)
out_err (841 times)
unlock (831 times)
cleanup (713 times)
drop (535 times)
retry (533 times)
again (486 times)
end (469 times)
bad (454 times)
errout (376 times)
err1 (362 times)
found (362 times)
error_ret (331 times)
error_out (276 times)
err2 (271 times)
fail1 (264 times)
err_free (262 times)
next (260 times)
out1 (242 times)
leave (240 times)
abort (228 times)
restart (224 times)
badframe (221 times)
out2 (218 times)
error0 (208 times)
fail2 (208 times)

“goto out;” is indeed a classic. This kind of code has been on my mind lately since I’m trying to teach my operating systems class about how in kernel code, you can’t just bail out and expect someone else to clean up the mess.