It’s hard to know when a piece of software has been sufficiently tested. Code coverage metrics are a partial answer. Basically, a coverage metric is a function that maps each execution to a set of coverage targets. For example, if we are performing function coverage, then the universe of targets is comprised of functions from the source code (or object code) and an execution covers every function from which it causes at least one statement or instruction to execute.
Can a bug hide from function coverage? This isn’t hard:
void foo (void) { if (!x) bug(); }
As long as foo() is called with x having a non-zero value, the function will be covered but the bug will not fire. We can fix this particular problem by tightening up the coverage metric, for example by insisting on statement coverage, where each statement in the program is a separate coverage target.
Let’s talk now about branch coverage, which insists that every conditional branch executes both ways. This is very similar to statement coverage except that it fixes a major hole in statement coverage where you can get 100% coverage without executing the false condition of any conditional that lacks an else branch.
Branch coverage is interesting because it’s about the strongest coverage metric that makes sense in practice. We can easily define metrics that have more coverage targets, such a path coverage, but achieving full coverage is extremely difficult and also it forces us to confront hard questions about the feasibility of paths. Here’s a page that provides a quick overview of the common and some no-so-common coverage metrics.
The rest of this piece is a partial list of ways that bugs can hide from branch coverage, and also some discussion of how we might go about finding these bugs.
Bug-Hiding Strategy #1: Concurrency
Hiding a bug using multithreading is really easy. For example:
void called_by_thread_1 () { p = NULL; ... do something quick ... p = q; ... do something slow ... } int called_by_thread_2 () { return *p; }
Assume that p is an atomic variable and that q is never NULL. We can easily get full branch coverage without testing the case where called_by_thread_2() dereferences a null pointer, and in fact that odds are this is what will happen. Concurrent programs want specialized coverage metrics; here’s a paper that I like.
Bug-Hiding Strategy #2: Math Overflow
This code looks innocent:
int average (int x, int y) { return (x+y)/2; }
But unfortunately, on a platform with 32-bit ints, average (1000000000, 2000000000) returns -647483648, despite the fact that the actual result is perfectly representable in 32-bit two’s complement. The problem is that the intermediate value x+y overflows. This is undefined behavior in C/C++. In practice we expect wrapping behavior from this kind of overflow, but even wrapping produces the wrong answer here.
Common test coverage metrics, including very difficult ones like path coverage, do not help us find math overflow bugs. One possibility would be to add overflow checks to the source code and then aim for branch coverage, but this is horrible. One could also add overflow checks using the compiler and then aim for branch coverage of the object code; this probably only makes sense in conjunction with test-case generation tools such as Klee.
The real solution to math overflow bugs is systematic documentation and enforcement of constraints on integer values, supported by a tool such as Frama-C. For example, the average() function would be preceded by this annotation:
/*@ requires x + y <= INT_MAX; */
Adding x and y inside the annotation does not cause overflow problems because integer operations in the annotation language are the mathematical ones, not the C ones. The problem with writing these annotations is that it is very hard work.
Bug-Hiding Strategy #3: Loop Error
There is much potential for hiding bugs inside loops:
int factorial (int n) { assert (n < 13); int f = 1; do { f *= n; n--; } while (n>0); return f; }
This code looks pretty good and, in fact, it returns the correct result for 0<n<13. The assertion protects against integer overflow on 32-bit machines. The remaining interesting case is n=0, for which this function is wrong: it returns 0 instead of 1. Branch coverage doesn’t help find this kind of bug.
Test coverage for loops is tricky; some sources recommend forcing each loop to execute 0 times, 1 time, and more than once, when possible.
Bug-Hiding Strategy #4: Lookup Table
Certain kinds of codes get a good speedup by precomputing some sub-results and storing these in a lookup table. This is commonly seen for checksums, math library functions, and crypto codes. For example, this AES implementation has several lookup tables. They cannot be verified by casual inspection. If any entry in such a table is incorrectly computed or gets corrupted somehow, the resulting code will be wrong and branch coverage again does not help find the problem. The famous Pentium fdiv bug was precisely this kind of error.
We might insist that every entry in each lookup table is accessed during testing, or we might simply double check that each table contains that proper values.
Bug-Hiding Strategy #5: Array Index
Code like this is pretty easy to write:
int update (int index, value_t value) { if (index > incorrect_table_size) return ERROR; table[index] = value; return OK; }
Here we’ll need to test the OK and ERROR conditions, but if we’re comparing against the wrong table size, RAM corruption can result from code that received 100% branch coverage. The table size is unlikely to be wrong for a statically sized table, but I’ve seen real programs where a piece of code failed to properly notice when a dynamically sized table changed size.
Bug-Hiding Strategy #6: Path-Sensitive Conditionals
This code is highly contrived, but it illustrates the problem:
int silly (int a) { int *p = &x; if (a > 10) p = NULL; if (a < 20) return *p; return 0; }
Assume x is a global int. If we test this code with parameter values of 5 and 25, branch coverage is satisfied, and yet we will not trigger the undefined behavior that happens when a==15. Besides going for path coverage, I don’t know of a good way to find these bugs.
Bug-Hiding Strategy #7: Heap Shape
Tons of problems can hide in heap data structures. For example, assume that we have a red-black tree implementation where the delete() operation leaves the tree in a broken state where subsequent insert() calls will fail. We could easily end up with 100% branch coverage of insert(), delete(), and the other operations without even actually testing the case where a node is deleted and then inserted.
Conclusions
100% branch coverage is often attainable at the unit level, but almost never is at the system level. Regardless, it is insufficient for finding any number of kinds of bugs (the list here is just off the top of my head; if you have any favorites that I missed, please let me know).
There’s no general-purpose way to find these bugs, but rather a variety of partial solutions. First, we can pay close attention to idioms such as lookup tables that can subvert branch coverage. Second, we can add extra paths to the code in an attempt to make implicit predicates explicit, such as integer overflow checks. Third, we can play with the coverage metric, for example by augmenting branch coverage with additional loop coverage targets. Finally, we can add annotations to our code to express necessary preconditions, postconditions, and loop invariants, and then use a tool to check these.
16 responses to “Hiding Bugs from Branch Coverage”
Or, we can remember Dijkstra’s adage: Testing can never prove the absence of bugs, only the presence.
Hi Joshua, that’s true enough, but it’s only by looking at how the bugs escape that we figure out how to test better.
I’ve seen taxonomies of bugs and haven’t liked them, but thought that maybe focusing things a bit would help.
Hi, thanks, that is a pretty good list from off the top of your head 🙂
I tried to think of some other classes, some ideas are:
== Bug-Hiding Strategy #x: Environment Dependence ==
Perhaps you’ve met your coverage goal – but not under all possible environments. The simplest way this might become a problem is when a program fails to check the return value of a system call. A program may rely on unstated preconditions which are usually true, such as %SystemRoot% being C:\WINDOWS or malloc returning non-NULL. An obvious an trivial version:
p = malloc(sizeof(struct whatever));
p->busy = 0
…
Despite the apparent triviality, more complex manifestations can be difficult to find. Testing on a variety of platforms or system configurations is a good start, however I am not aware of a comprehensive strategy for addressing the general case.
The compiler can catch many simple cases statically (e.g. warn_unused_result). Dynamically, one approach would be to force coverage of syscalls and arrange that each fails in turn, from the point of view of the program (a kind of structured fuzzing). Works best for finding crashes though, as more subtle misbehaviour can be hard to detect. Also not comprehensive.
== Bug-Hiding Strategy #x+1: COME FROM ==
Exception handers, SEH handlers and interrupt handlers all function like the mythical “COME FROM” statement. Conceptually similar to concurrency issues, but also occurring in single threaded programs when asynchronous events or nonlocal control transfers occur . An example might be a modified version of #1:
void does_something() {
p = NULL;
… do something …
p = q;
… do something else …
}
int handle_unexpected () { /* COME FROM does_something… somewhere… */
printf(“Handling unexpected, state is: %s\n”, p->state);
}
The problem is not so much ensuring coverage of handlers, but ensuring that a program works correctly when interrupted at every possible point. I don’t know of a good way of dealing with these bugs via testing, perhaps considering the multithreaded case would provide some insight.
Tools like Microsoft Pex find all math errors, overflows and array bounds problems that are actually reachable. A classic example is argument validation of a ProcessArraySlice(int[] array, int start, int count) function. A naive range check easily overflows.
I think Pex aims for 100% path coverage which isn’t too bad because the tests are generated and no one has to type them.
That also means that it find collisions in hash tables, of course.
> I think Pex aims for 100% path coverage which isn’t too bad because the tests are generated and no one has to type them.
Well, whether someone types them or not, for large programs the number of tests is simply too big without some restrictions on what you mean by path coverage. Godefroid’s SMART idea of complete interprocedural paths only seems reasonable, I don’t recall what Pex does if anything (other than “not get there for large programs”).
In general all of these can be simply if uselessly categorized as:
branch coverage can be fooled if covering branch b exposes the bug only under predicate p, and p is not always true. I think that covers every single one of these.
“Obviously” the right coverage is to build the full set of interesting p and cover all branches under both truth values for each p. Unfortunately there are a lot of p’s out there.
Has anyone mentioned timing-related bugs? I can think of two cases:
1) the code checks for specific things to do with dates and times
http://apple.slashdot.org/story/13/01/03/2229228/adobe-and-apple-didnt-unit-test-for-forward-date-bugs-do-you
2) asynchronously occurring interrupts (which are, perhaps, based on timers)
Blair, thanks! Both of those are great bug patterns.
tobi, I believe Alex is right that path coverage for non-trivial codes is very much out of reach. But I think one could argue that any individual function which has more paths than you can test is probably poorly written.
Alex, I agree that the predicate view is the right general picture. I started to write about this but then it sort of detracted from the specificity of the examples so I nuked the text!
Hi Trevor, some kinds of timing bugs are covered by #1 above. The other case you mention, dealing with specific dates and times, is a whole ocean of bugs that’s probably worth dealing with separately. I’ve never understood why programs that process dates and times are so difficult.
Total agreement that “add all predicates” is silly. In theory, though, you could subsume some of integer overflow (if you took a wrongheaded testing approach vs. a wise static analysis approach), concurrency, and paths with smart predicate selection.
I buy the “individual function paths” argument which is why I always use path at the single-function level, not between functions. I credit Patrice Godefroid with making this seem obvious. It also simplifies understanding path results.
Some work in progress elsewhere suggests that single-function paths, modified to handle loops, is a useful code coverage measure. There’s also some stuff (not mine) coming up at ICSE suggesting that dataflow coverage may not have much gain on control flow.
I assume that function pointers deserve their own special place in hell. 🙂
Great article, thanks for summarizing this!
Also, just wanted to point out another bug/vulnerability in the “Bug-Hiding Strategy #5” code. It can be called with a negative index 🙂
Thanks for the succinct overview of the limitations of 100% branch coverage testing!
Some additional observations:
(1) Even though 100% branch coverage will not find every bug, it does generally find a great many more bugs than you’ll find without 100% branch coverage testing, or by doing just 95% branch coverage testing.
(2) In my experience, if you use 95% of your budget to get from 0% to 95% branch coverage and find N bugs in the process, then you use the other 95% percent of you budget and find an additional N bugs in going from 95% coverage up to 100% coverage.
(3) It is not sufficient to merely have your test suite cause every branch to be executed once in each direction. You really need to show that each branch makes a difference in the outcome. In other words, you need to show that removing a branch causes some test to fail. That’s harder, and cannot be automatically measured (or at least not by means known to me). This point needs to be fully understood and appreciated by developers who are doing the branch coverage tests.
(4) C-preprocessor macros are useful for inserting extra branch operations to make sure that you test boundary conditions, and for hiding branch operations that are part of defensive code and ought never be taken in a working system. Examples: http://www.sqlite.org/src/artifact/a6b3f816df7?ln=278-302 and http://www.sqlite.org/src/artifact/a6b3f816df7?ln=232-251
Hi Richard, thanks for the input. As you know, you are one of my software testing heroes! I certainly agree that 100% branch coverage is a big thing for most codes. I like your point #3 and need to think about it more.
Richard: is (3) the “elided conditionals” problem?
http://cs.brown.edu/~er/papers/renieris-paste2004.pdf
I’ve always found that paper disturbing…
BTW, I’m looking at some testing results over SQLite right this moment — it’s the program where we found highest correlation between coverage for tests and their ability to kill mutants, for any C program. For a couple of big Java programs the correlation was even higher, though.
In example #5, I instantly spotted two bugs you didn’t mention!
The fact that there’s a test at all implies that the caller may pass arbitrary, possibly invalid index values. However, even if incorrect_table_size is the number of elements in table[], the code doesn’t check for index being negative, nor does it handle the boundary case where index == table_size.