Proposal to Stress-Test Implementations of C++11 Concurrency

I recently submitted a proposal to NSF’s “small” Computer Systems Research solicitation. The rest of this post contains the body of the proposal, edited a bit for formatting (citations are missing and there are no doubt some leftover latex-isms) and to remove some sections that are tedious to people not sitting on an NSF panel. If you’re curious, here’s the entire proposal in the form that the panel will see. I’m posting this material in hopes of getting useful feedback and also because I think requests for public money should be open. I wrote a bit more about this earlier.

Motivation and Overview of the Proposed Work

Many large and important concurrent software systems are written partly or entirely in C++98:[0. “C++98” refers to the 1998 version of the C++ standard. It is the dialect in which nearly all contemporary C++ code is written. The recently approved C++11 standard has not yet been entirely implemented by any compiler we are aware of.] [0. (link)]

  • Apple’s Mac OS X
  • Adobe Illustrator
  • Facebook
  • Google’s Chrome browser
  • Microsoft Windows 7 and Internet Explorer
  • Firefox
  • MySQL

Similarly, many large and important concurrent software systems are implemented in C, including almost every operating system kernel and hypervisor, most Java virtual machines and many other programming language runtimes, and many embedded systems including those that control critical transportation and medical care systems.

It is ironic, then, that these languages–in which massive amounts of mission- and safety-critical concurrent code are written–have no built-in support for concurrent execution. This omission can cause problems for software development efforts. First, the compiler–not designed for concurrency–can introduce race conditions into programs that appear correct at the source level. Second, consider this code which implements a two-process mutual exclusion scheme using the well-known Dekker algorithm:

volatile int flag[2], turn;

void dekker_unlock (int i)
  turn = 1-i;
  flag[i] = 0;
}
void dekker_lock (int i) {
  flag[i] = 1;
  while (flag[1-i]) {
    if (turn != i) {
      flag[i] = 0;
      while (turn != i) { }
      flag[i] = 1;
    }
  }
}

This code can be proved to be correct (that is, it provides progress, mutual exclusion, and bounded wait) as long as operations on volatile global variables are sequentially consistent.[0. According to Lamport, sequential consistency requires that “the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.”] Therefore, this code works on uniprocessor machines (of course, a spinlock is not efficient on a uniprocessor, but at least it is correct). On the other hand, on most multiprocessor machines this Dekker implementation is incorrect: it fails to provide mutual exclusion. This is even the case on machines with a relatively strong memory model such as multicore Intel and AMD processors implementing the x86 and x86-64 architectures.

To fix the Dekker code, processor-specific memory fence operations must be added. Fences enforce ordering constraints on memory operations that would otherwise be reordered by hardware. Because memory system and fence semantics vary, code with fences is not portable, and also it can be extremely difficult to figure out where fences are required. If one errs on the side of adding too many fences, performance will suffer. If one errs on the side of adding too few, the Dekker code will fail to properly implement mutual exclusion.

Of course, the vast majority of programmers are not interested in writing platform-specific code for a weak memory model. The alternative is to target a specific concurrency library such as POSIX threads or Win32 threads. However, libraries are typically not universally available; POSIX threads are generally not found on non-UNIX platforms and Win32 is not found outside of Windows machines. Additionally, libraries such as POSIX threads fail to provide direct access to low-level hardware operations such as compare-and-swap instructions, preventing certain high-performance idioms from being expressed portably.

The C++11 Memory Model

The recently-adopted C++11 standard was designed to fix the problems discussed in the previous subsection. First, it standardizes C++ concurrency across platforms. The standard formalizes the notion of data race and prohibits compilers from introducing races. Second, it provides an extensive suite of powerful, low-level operations that are intended to make it possible for portable code to attain very high performance. It is believed that the next version of the C standard (the current draft is referred to as “C1X”) will incorporate a memory model very similar to the C++11 concurrent memory model.

Problem: Correctness of Implementations of the C++11 Memory Model

The problem that the proposed work aims to solve is:

  1. The C++11 memory model is large and complicated, containing dozens of classes and templates, and hundreds of functions. These interact closely with OS-specific synchronization calls and hardware-specific synchronization instructions.
  2. Today’s C++98 compilers will serve as the basis for all known C++11 implementations. These tools–which are large, complicated, and contain highly aggressive optimizations–were developed in the absence of any non-trivial memory model.
  3. Compiler optimizations and the C++11 memory model are intimately related. For example, some transformations that were previously permissible are now illegal. Although the full extent of the interactions between existing compiler implementations and the new memory model is not yet totally understood, it is clear that a lot of work will be required to iron out the details. Subtle bugs are likely to persist for some time.

We have talked to compiler developers who work on LLVM, GCC, and on several important commercial compilers. None of these developers expects that implementation of the C++11 memory model will be quick or easy. In fact, they all expressed concern about interactions between existing optimization passes and constraints imposed by the new model.

Intellectual Merit — Major Research Challenges and Solution Sketch

In a nutshell, the proposed work is to randomly generate concurrent C++11 programs, compile them, and then run them on an infrastructure designed to reveal concurrency errors. This is a logical extension of our previous work that has resulted in more than 400 previously unknown compiler bugs being identified, reported, and usually fixed. However, there are several significant challenges that must be overcome.

Challenge #1: Generating Valid Code

The major challenge in creating Csmith was to generate code that is expressive and well-defined. Expressive code exercises a large subset of language features, and well-defined code avoids executing any of the 191 undefined behaviors from the C standard (integer overflow, array and pointer errors, union errors, divide by zero, shift past bitwidth, etc.). In C++11, data races are an undefined behavior. The definition of data race is intricate, but the core idea is to determine if every access to each shared data item is sufficiently synchronized with respect to all other accesses to that item. Generating random, race-free programs is expected to be significantly non-trivial in the presence of arrays and pointers. Our solution will exploit the dataflow analysis that is already a part of Csmith. Essentially, we will implement a static race checker that runs in tandem with program generation.

Challenge #2: Generating Interesting Invariants

Automated random testing can succeed only when useful invariants are identified in the system under test. Our previous work has exploited three invariants. First, our testing of the volatile qualifier made use of the volatile invariant, which is implied by the C and C++ standards. It specifies that the number of load and store operations issued to each byte of a volatile-qualified object must not change across compilers or optimization options. Second, our wrong-code bug hunting using Csmith relied on the invariant that for a well-defined test program, a checksum taken over the global variables at the end of execution must not change across compilers or optimization options. Third, randomized compiler testing relies on the simple invariant that the compiler should not crash when presented with valid input.

Randomly testing implementations of the C++ memory model will only be possible if we can generate code with interesting concurrent invariants. Of course, concurrent programs written by humans have invariants (e.g., “the empty flag is set iff the list is empty”). We plan to use ideas from invariant-based program synthesis to create random programs with non-trivial invariants. A useful consequence of the invariant-based approach to compiler testing is that Csmith’s test cases will become self-checking–compiler bugs will cause them to fail with an assertion violation. In contrast, our compiler testing work up to this point has used differential testing where multiple compilers (or multiple optimization levels of a single compiler) were tested against each other.

Challenge #3: Exploring the Scheduling Space

The space of thread interleavings is large, and finding bugs in concurrent programs boils down to exploring as much of this space as possible. Our compiler stress testing work will explore three approaches. First, we hypothesize that early progress can be made using simple mechanisms to introduce scheduling noise, for example by introducing randomized nop instructions or sleep calls into a compiled program using binary rewriting, and by using a second machine to send randomized network interrupts to the machine that is executing test cases. Second, there is much existing work on designing runtime systems that attempt to make concurrency errors show themselves; we will reuse this work to the maximum extent possible. Third, we will explore the idea of implementing a hostile x86-64 simulator. Although we believe that it is probably not feasible to compute a pessimal schedule for a program, hostility is an attainable goal. It will be implemented using techniques such as triggering context switches near and inside (inferred) critical sections and delaying inter-processor communication as long as possible, in effect replacing the CPU’s write buffer with a “wrong buffer.”[0. The term “wrong buffer” appears to have originated with system developers at ARM; we know of no appearances of this term in print.]

The thesis of this proposal: Randomized stress testing of implementations of the C++11 memory model will significantly shorten the period of time during which compilers are flaky and immature. In the long run, randomized stress testing will lead to a better understanding of corner-case behaviors of the C++11 memory model and to test suites with improved coverage of interactions between the C++11 memory model, compiler optimizers, OS thread systems, and processor memory coherence mechanisms. The intellectual work required to produce concurrent test cases is a major increment upon our existing work on randomized test case generation which has had great success in finding compiler bugs, but which is limited to sequential code.

The State of the Art is Inadequate

Existing methods for testing compiler implementations break down into three broad methods:

  1. Compiling large, real applications. For example, a significant milestone in the development of any compiler is when it becomes self-hosting (can compile itself).
  2. Compiling collections of small, troublesome codes (e.g., GCC’s “torture test,” which contains about 2800 programs).
  3. Randomized testing. Our Csmith project is one example, but the history of this technique goes back about 50 years.

All three methods are useful and necessary. Compiler developers we have talked to, who are working on C++11 implementations, plan to use methods 1 and 2. We do not believe these methods will be sufficient to rapidly squash all of the relevant bugs (nor do the compiler developers appear to believe this). Consequently, we claim the proposed work is useful and important.

Why Testing Instead of Verification?

CompCert is an existence proof for a verified, usable, optimizing compiler for a realistic programming language. One might ask: Why should a testing proposal be funded? Why not be more ambitious and create a proved-correct C++ compiler? There are several reasons:

  • Testing is end-to-end, whereas formal verification typically addresses a narrow layer of the software stack. For example, in addition to finding compiler bugs, our compiler testing work has found linker bugs, assembler bugs, runtime library bugs, CPU simulator bugs, and even bugs caused by the fact that the compiler had itself been miscompiled. Formal verification cannot find these errors without specifically verifying the linker, assembler, runtime library, simulator, and the compiler that compiled the compiler under test. While these are all laudable goals, the combined effort to do these verifications is gigantic.
  • Even verified compilers require testing. For example, we have discovered and reported 13 CompCert bugs. The core issue is that creating a machine-checked mathematical proof is one thing, but proving the right thingis a separate and very hard problem. Typically it is not possible to once and for all “prove that one has proved the right thing.”
  • The C++11 language is dramatically more complex than C. CompCert does not even compile all of C, but rather only a useful subset. Creating a verified, optimizing compiler for a broadly useful subset of C++11 would be a massive undertaking, requiring (we guess) more than a few person-decades of effort.

Summary: A verified C++11 compiler is out of reach for now. But even if we had this compiler, we would still need to stress test it.

Specific Outcomes

The proposed work, if successful, will have several useful outcomes. First, and most obviously, we will create tools for stress-testing implementations of the C++11 concurrency model. These tools will be open-source, and consequently will benefit every team that is working on a C++11 implementation (in addition to transitively benefiting users of those compilers and users of systems built by those compilers). Second, as a side effect of developing and validating our tools, we will report bugs against open source compilers (GCC and LLVM, in particular) and also we will work with any commercial compiler vendors who seek us out.[0. Although we have built solid relationships with the GCC and LLVM teams, so far we have not had very good luck in building relationships with companies that produce compilers. The problem (we think) is that we represent a poor short-term value proposition: we generate zero revenue while producing bug reports that suck up valuable engineering resources. Furthermore, being outsiders, we would seem to repesent the potential for causing only bad publicity–to paraphrase Dijkstra, we can find bugs but we cannot certify their absence.] Third, we expect to be able to produce useful “litmus tests”–small test cases that are (empirically) difficult for compilers to translate correctly and that, taken as a group, provide good test coverage of interesting parts of the C++11 concurrency model. Finally, we hope to generate a better understanding of the C++11 concurrency model. By analogy, our Csmith work has (completely automatically) exposed some interesting corner cases in the C programming language, as described here: (link)

The PI’s Prior Work on Compiler Testing

This section briefly describes our work leading up to this proposal. It also explains why we are virtually certain that the proposed work will find bugs in production-quality compilers.

Although the C and C++98 languages do not have a notion of concurrent execution, they do have a simple memory model that is designed to support reliable access to memory-mapped hardware device registers. This model is provided by the volatile qualifier.[0. The rationale for volatile is described here: (link)] A volatile-qualified object must be accessed at the memory system level as it is by the C/C++ abstract machine. In other words, a read from a volatile-qualified variable in C/C++ must result in a load being issued at the machine level–the variable’s value must not be cached in a register. Similarly, a write to a volatile variable in C/C++ must result in a store at the machine level–the compiler must not suppress redundant stores, as an optimizer normally would. The volatile qualifier’s semantics in C++11 are the same as they are in C and C++98.

The PI, while teaching embedded systems courses in the early/mid 2000s, noticed that the compilers being used by students did not always respect the volatile qualifier. This was troubling. In joint work with Eric Eide, he developed a way to test compilers’ implementation of the volatile qualifier. For every compiler we tested–including commercial tools used to compile safety-critical embedded systems–we found errors in its treatment of volatile-qualified variables. Some of these results are reported in our 2008 paper, but since then we have tested quite a few more compilers, with the same result. In summary, no compiler that we tested was able to correctly implement the volatile qualifier, which is basically trivial when compared to the C++11 memory model.

Since volatile-qualified objects are not totally reliable, we developed a workaround where source code is re-written in such a way that every read from a volatile variable is changed into a call to a helper function, and every write to a volatile variable is changed into a different helper function call. We verify by hand that the helper functions perform the necessary memory operations. In a large fraction of our tests (96% of the time) this strategy was successful in working around bugs in compilers’ implementations of volatile. The workaround is successful because compilers tend to be able to perform function calls reliably, even when they cannot access volatiles reliably. Subsequently, we became interested in the 4% of cases where our workaround failed. We hypothesized that these cases were due to “regular” compiler bugs having nothing to do with the volatile qualifier. We began systematically hunting for these bugs.

Since 2008 the PI’s group has found and reported about 400 bugs in widely-used C compilers. These bugs were all found using Csmith, a random C program generator that is a (distant) descendant of our volatile testing tool. Most of the bugs were in the compilers’ middle-ends–the part of a compiler that is independent of both the language being compiled and the architecture being targeted. Despite the fact that most of these 400 bugs have been fixed, we can still crash and find wrong-code bugs in the latest version of all compilers that we routinely test (about six). CompCert is the lone exception. Csmith is open-source software.[0. (link)]

Brief Introduction to the C++11 Memory Model

Although the C++11 standard cannot be freely downloaded, a recent draft can be.[0. (link)] Material about the concurrency model is concentrated in Chapters 1, 29, and 30. However, a “less formal” description by Boehm and a paper by Boehm and Adve are probably better places to begin reading. Additionally, a book about the model is expected to be released in January 2012. On the more formal side, Batty et al. have formalized the C++11 memory model (but not the rest of the language). This kind of activity is essential to understanding the most subtle parts of the model.

Application Programming Interface

The API for the C++11 concurrency model divides into two major parts. The first is a thread support library. It is object oriented and uses the C++ exception system, but otherwise is not very different from Pthreads and other existing thread libraries. C++11 threads are intended to map onto OS-supported threads in a one-to-one fashion. The threads interface is contained in four header files, respectively supporting thread manipulation, mutual exclusion, condition variables, and futures.

A welcome feature of the mutex library is a facility for safely acquiring multiple locks:

std::unique_lock lck1(m1,std::defer_lock);
std::unique_lock lck2(m2,std::defer_lock);
std::unique_lock lck3(m3,std::defer_lock);
lock(lck1,lck2,lck3);
// manipulate shared data ...

The advantage of this interface is that the developer does not have to worry about deadlocking the program by acquiring locks in the wrong order.

The second major component of the C++11 API is an atomic operations library. Atomic operations are synchronization operations that are semantically equivalent to locking an object, reading and/or writing it, and then releasing the lock. The difference is that in some cases, the atomic operation may be significantly more efficient than the explicit locking approach. For example, on x86 and x86-64 processors an atomic increment of an integer value can be compiled into a single instruction.

The most interesting feature of the C++11 atomic API is that it provides portable accesses to weak memory models. The default memory ordering for all atomic operations is sequential consistency (memory_order_seq_cst), meaning that the sequentially consistent operation is part of a global total ordering of memory operations. This is intuitive; for example, the Dekker code from Section 1–which is incorrect on a non-sequentially-consistent multiprocessor–can be fixed in C++11 simply by declaring the synchronization variables as atomic. That is, instead of declaring them as:

volatile int flag[2], turn;

They should be declared as:

std::atomic<int> flag[2], turn;

The atomic template replaces the default integer operations (assignment, increment, etc.) with atomic and sequentially consistent code appropriate for the target platform.

The problem with sequential consistency is that it can be expensive. To support high-performance multiprocessor programming, C++11 defines five more memory orderings: memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, and memory_order_acq_rel.

The relaxed memory ordering is the most permissive: atomic operations using it are still atomic, but no ordering constraints are imposed across threads. This ordering is perhaps not very useful by itself, but C++11 provides explicit fence operations that can be combined with relaxed consistency to build useful and performant concurrent data structures.

Using memory_order_acquire and memory_order_release, high-performance lock acquisition and release primitives can be constructed. An acquire operation is a one-way memory fence that prevents other memory operations from being moved above the acquire; a release operation is a one-way memory fence that prevents other memory operations from being moved below the release. Together, they prevent memory operations from being moved outside of a critical section–since this would defeat the purpose of the critical section. An access with memory_order_acq_rel behaves as both an acquire and a release. The memory_order_consume is similar to memory_order_acquire, but weaker: only values that are dependent on the lock variable (via a dependency mechanism explained in the standard) are ordered by the consume/release sequence.

Obligations on the Developer and Compiler

The fundamental obligation placed upon a developer writing C++11 programs is that she must never permit a data race to occur. Informally, a program is free of data races if it contains adequate synchronization. In more detail, the C++11 standard says:

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other.

The definition of atomic is straightforward–operations on atomic types are always atomic. The definition of conflicting is also straightforward: “Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location.” Alas, happens before is more intricate. It basically follows Lamport’s definition but accounts for synchronization that occurs through weak memory-model operations such as acquires and releases.

A data race, like any other undefined behavior in C++, destroys the meaning of the program. Avoiding races is relatively straightforward (if any concurrent programming problem can be called that) as long as the developer sticks with mutexes and sequentially consistent atomics. Static and dynamic race-detection tools are a fairly mature technology and will be able to help once they are adapted to support C++11.

On the other hand, the compiler cannot limit itself to the sequentially consistent case; its obligation to faithfully implement the C++11 standard is severe. The soundness of various compiler optimizations under weak memory models is a matter of current research. The next section explains some of the problems.

What Do C++11 Concurrency Bugs Look Like?

This section uses examples to briefly illustrate what could go wrong when a C++11 compiler translates a concurrent program. The best references about the interactions between optimizers and C++11 are by Boehm and by McKenney et al. (the latter document refers to a port of the C++11 memory model to C, but the discussion applies to C++ as well).

Example #1

Because C++98 and C were specified in the absence of a concurrency model, it is permissible for a compiler to introduce writes to objects that were not written to by the program. For example, consider this struct and accessor function:

struct foo { int a:8; int b:12; char c; } x;
void update_b (void) {
  x.b = 1;
}

It is perfectly legal for a C or C++98 implementation to translate update_b() into code that loads the entire 32-bit word containing x, appropriately modifies the 12 bits corresponding to x.b, and stores the word back into memory. In C++11 this translation is illegal because it may introduce data races with concurrent threads that are accessing field c. The current development version of the GNU C++ compiler, which is in a “bug fix only” state in preparation for the 4.7.0 release, miscompiles this code even if it is given compiler flags prohibiting it from introducing data races.[0. These flags are discussed in the GCC Wiki: (link)] On x86-64 the output is:

_Z8update_bv:
 movl x(%rip), %eax 
 andl $-1048321, %eax 
 orb $1, %ah 
 movl %eax, x(%rip) 
 ret

This code reads the entire struct into a register, uses a bitwise-AND and then a bitwise-OR instruction to modify the bits corresponding to b, and then stores the result back to memory. This is wrong in C++11.

Example #2

McKenney et al. provide an illustrative example concerning constant propagation: one of the simplest, most useful compiler optimizations. Given that x is an atomic variable and it is known to be either 0 or 1, how might this code be optimized?

if (x.load(memory_order_consume)) {
  ...
} else {
  ...
}
y = 42 * x / 13;

On platforms with expensive multiplication and division, the compiler would like to transform this code to:

if (x.load(memory_order_consume)) {
  ...
  y = 3;
} else {
  ...
  y = 0;
}

However, McKenney et al. state that “If the underlying machine preserves control-dependency ordering for writes, this optimization is perfectly legal. If the underlying machine does not preserve control-dependency ordering, then either this optimization must be avoided, a memory fence must be emitted after the load of x, or an artificial data dependency must be manufactured.” If compiler writers miss this subtlety, the generated code will either be suboptimal or wrong.

Details of the Proposed Approach

To recap, the proposed stress-testing work flow is to repeatedly (1) generate a random concurrent C++11 program, (2) compile it using a variety of possibly-buggy C++11 compilers and optimization flags, and (3) run these programs in hostile execution environments that are designed to expose interesting interleavings at the memory system level. We are not proposing to support all of C++11 in Csmith–that needs to be done, but it is well beyond the scope of this proposal. Rather, Csmith will generate C++11 that looks like C code with C++11 concurrency constructs added. We expect this to be sufficient to find many, but not all, bugs in C++11 concurrency implementations.

Generating Data-Race-Free Programs

Csmith already includes sophisticated machinery for generating programs that avoid undefined behaviors. Our PLDI 2011 paper explains this in detail, but the essence of the solution is to interleave static analysis with program generation. First, Csmith generates a statement (e.g., x = *y;) that is a candidate for addition to the current program. Second, Csmith performs a number of static, dataflow-driven safety checks on the candidate statement in the current program context. In this example, x and *y must be compatible types and y must not be null or refer to an out-of-scope memory location. If the safety checks pass, the candidate is committed to the current program; otherwise, the candidate is discarded. Progress is probabilistically guaranteed because Csmith’s random choices always include at least one “easy” alternative that never fails the safety check. Loops and function calls complicate this simple picture, but the basic structure of the solution, which we call “guess and check,” applies.

We will augment Csmith’s existing dataflow analyses to support static data race detection. Plenty of good starting points are available. Innovation may be required to perform race checking in the presence of atomics using C++11’s weak memory orderings. We do not know of a tool that checks for races in weakly ordered C++11, though it seems very likely that these will appear during the next year or two.

Given a race checker in Csmith, we believe that the guess and check approach will be sufficient to generate race-free programs. Even so, creating “good” random programs is as much an art as a science and we expect that additional refinements will be useful. First, accesses to sequentially consistent, atomic variables are always safe–the race detector does not need to be consulted before committing an access to this kind of memory. Second, Csmith’s safety checks can be simplified for data that is not shared between threads. We believe it will be useful to pre-partition variables into unshared ones that can be freely accessed by a single thread, and shared variables that must be synchronized appropriately. We expect unshared data to play an important role in compiler bug detection; for example, unshared data can expose compiler-introduced races like the example presented earlier. Also, code accessing unshared data can trigger optimization passes that may mangle nearby shared data accesses.

Generating Programs with Concurrent Invariants

Currently, just before terminating, a program generated by Csmith takes a checksum of the values held in all global variables and prints it. The checksum is invariant across all platforms making compatible choices for certain implementation-defined parameters (integer width and representation, mainly). Thus, a checksum difference indicates a compiler bug.

In general, the checksum approach will not work for concurrent programs. One approach would be to generate deterministic concurrent programs–where the checksum will again be invariant–but this seems quite restrictive, and will serve as a backup plan in case other proposed approaches do not work. Csmith currently does not track the values of scalar variables; its dataflow analysis will need to be extended to do this. We previously developed cXprop, a value-tracking abstract interpreter for concurrent C code; we do not expect it to be difficult to either repurpose cXprop or else re-implement the same techniques within Csmith. cXprop already knows how to generate assertions corresponding to the invariants that it computes. While we were debugging cXprop, an assertion violation generally signaled a cXprop bug–but of course these assertions can also catch compiler bugs. To ensure that invariants are not violated, Csmith will synthesize a separate thread that asserts that every invariant holds, in an infinite loop.

Thus far, we have discussed using the guess-and-check approach, combined with a concurrency-aware dataflow analysis, to generate concurrent code with invariants. We will also explore an alternative approach where invariants are the starting point for program synthesis, rather than a side effect of their construction. We will exploit a technique developed by Deng et al. for invariant-based synthesis of concurrent programs. This approach shows how to start with a coarse-grained synchronization solution and some invariants, and automatically refine it into a fine-grained solution. Our approach will be very similar: we will begin with randomized invariants and coarse-grained code (implemented using thread-level locks). Then, we will randomly apply transformations that make the solution more fine-grained, while preserving the invariants. First, critical sections will be split into smaller critical sections, then they will be replaced (when possible) with sequentially-consistent operations on atomic types, and finally we will attempt to introduce relaxed memory-model operations and memory barriers.

Finding Invariant Violations

Given an executable test case containing assertions, we need a way to determine if it was compiled correctly. The easiest (but least reliable) way to do this is to run the program and wait for an assertion violation to occur. This is not reliable because it is typically the case that running a concurrent program explores only a vanishingly small part of the space of all possible interleavings among memory operations in its threads. The hardest (but most reliable) way to find invariant violations is to run the compiled code through a formal methods-based tool that finds an execution where an invariant is violated, or else proves that such does not exist. Although a number of research efforts have attacked the problem of program verification on relaxed memory models, all such tools that we are aware of do not scale to even medium-sized codes. The problems that must be solved to make these tools scale are extremely difficult. Thus, while we plan to take advantage of any such tools that become available, we also plan to explore alternatives.

Executing on hardware

It is not difficult to perturb a program’s schedule in order to increase the probability that interesting interleavings are seen. One way is to use binary rewriting to insert randomized nop instructions into the compiled code. The inserted code can also use a random number generator so that delays are parameterized by a seed value. Another method is to cause interrupt handlers to fire on the machine that is doing the testing. This is easy to arrange by having a second machine send it small packets over a network link at short, random intervals. This method has the further benefit of causing substantial memory traffic due to DMA operations initiated by the network interface card.

Alglave et al.’s Litmus tool is based on similar ideas: it randomizes memory locations accessed by threads and also the threads’ CPU affinities. We will explore these techniques in addition to the ones described above. Although there are certainly limits on the effectiveness of techniques like these, we believe they are very much worth trying simply because they are low-cost, easily portable across architectures, and absolutely reliable in the sense that any behavior that is observed is real–there are no false alarms when executing on real hardware.

Yet another approach to executing test cases on real hardware is being explored by the GCC developers.[0. (link)] They plan to execute the code under test in one thread and an assertion checker in a separate thread (as we propose above). They will to single-step through the thread under test (in an automated fashion, using debugger support) running the full assertion check after every instruction. This will ensure that every intermediate memory state is checked. Drawbacks of this approach are that it will incur a significant slowdown and also the act of single-stepping the processor will flush the processor’s pipeline and store barrier–this could easily mask bugs.

Executing in a simulator

Running test codes in a simulator reduces testing throughput, but has other advantages such as being repeatable and making it possible to simulate platforms that are unavailable or inconvenient. Although we do not know of an open-source simulator for ARM, x86, or x86-64 that faithfully implements the relevant memory model, it is possible that one exists or will exist within the next year or two. If not, we do not believe it will be too difficult to add weak memory model support to an existing simulator.

Given a weak memory model simulator, the challenge is to make it actively hostile to the program under test. One approach would be to use state-space exploration to look at many possible interleavings of memory operations. However, we believe this will be too slow to test realistic programs. Rather, it should be possible to focus the simulator’s efforts on weak points in the code under test. By analogy with race-directed software testing, we will explore the idea of causing the simulator to delay a thread until another processor is ready to execute code that touches conflicting memory locations. Similarly, there is significant room for hostility in the implementation of the write buffer, which should act as a “wrong buffer” by delaying flushes to main memory for as long as possible.

Test Case Reduction

To keep the story simple, we have not yet mentioned one problem that we will need to work on: test case reduction–the process of turning a large failure-inducing program into a much smaller one. Reduction is an important part of root-cause analysis for any compiler bug, and in practice it needs to be automated. Although the results are not yet published, automated test case reduction has been an active area of research for us for several years–but only for the case of sequential code.

Automated test case reduction for concurrent codes is more difficult because automated reduction algorithms–generally based on Delta Debugging–require deterministic execution. For example, Choi and Zeller’s work on isolating failure-inducing thread schedules fundamentally depends on a capture/replay mechanism that makes threads execute deterministically. It is not clear how to avoid the need for determinism. It is also not clear how to get efficient deterministic replay on a weak-memory multicore; there are research solutions addressing this problem but they generally require hardware support or are very inefficient; Heydari and Azimi survey this work. In summary, replay on real multiprocessors is effectively a separate research problem and we do not plan to address it.

Determinism in a simulator is trivial. Even so, test case reduction remains challenging. The fundamental difficulty in reducing the size of failure-inducing C/C++ programs is avoiding undefined behaviors. For example, in our experience it is common for a test case reducer to delete the initializer for a variable, resulting in a test case with undefined semantics.

When reducing concurrent C++11 programs the problems are even worse because we must not introduce a data race during reduction.

Two solutions to avoiding undefined behaviors in general, and race conditions in particular, suggest themselves. First, we can use an external race detector to validate test cases during reduction, if one becomes available. Second, since Csmith will already know how to generate race-free (and otherwise well-defined) programs, we can leverage it to also generate smaller versions of the programs. We have already experimented with this approach for reducing sequential test cases in the C language, and it appears to be workable.

What PC-lint is Really For: Enforcing Language Dialects

John Carmack’s piece about using static analyzers is really nice, and he’s right that applying PC-lint to a significant piece of C code results in page after page of warnings. Faced with this problem on a code base I cared about, and not wanting to give up, I first created a config file for PC-lint that suppressed the vast majority of warnings and then, over a few weeks, cleaned up the remaining issues. Once the code base linted clean, I could turn on classes of warnings whenever I had the time and inclination.

As an example, PC-lint supports fine-grained application of stronger typing rules. So one day I might assume that all uses of “bool” or some other enumerated type are segregated from uses of integers. Of course, in any real code base, such an assumption is wrong and a pile of annoying warnings ensues. I’d either fix them or else decide that it wasn’t worth it until some major rearchitecting could be done.

The cool thing was that after a while I realized that I wasn’t really writing C code anymore, but rather a much more strongly typed dialect where I actually had a degree of (justifiable) confidence that certain kinds of errors were not being made. The C compiler, with its near-nonexistent error checking, was acting as a mere code generator. It wasn’t even the case that PC-lint was uncovering a lot of serious bugs in my code (though it did find a few). Rather, it gave me some peace of mind and assured me that what I thought was happening in the code was in fact happening.

Of course, we always write code in dialects, whether we realize it or not. In many cases, language dialects are formalized as coding conventions or standards so a group of people can all read or write the same dialect — this has enormous advantages. The things that I found charming about PC-lint are that (1) it supported a very flexible range of dialects and (2) it quickly and mechanically ensured that my code stayed within the dialect. When writing non-C code (especially Perl) I often wish for a similar dialect checker.

Book Beginnings

If a book starts out just right, I’ll keep going back and rereading the first few sentences long after I’ve finished the book. Here are a few that did that to me.

Fagles’ translation of the Odyssey:

Sing to me of the man, Muse, the man of twists and turns driven time and again off course, once he had plundered the hallowed heights of Troy.

Gravity’s Rainbow:

A screaming comes across the sky. It has happened before, but there is nothing to compare it to now.

Blood Meridian:

See the child. He is pale and thin, he wears a thin and ragged linen shirt. He stokes the scullery fire. Outside lie dark turned fields with rags of snow and darker woods beyond that harbor yet a few last wolves.

I think the last one is my favorite.

Bit Puzzle Results

This is the followup piece to Monday’s bit puzzle. First off, the spec for this problem wasn’t clear what to return when both arguments are the same. But let’s look at Carlos’s comment about where this code would be used:

If you think of a binary number as a sequence of binary tree decisions on the full range of possible numbers, this bit operation gives you the most specific common ancestor on this tree.

This use case makes it seem pretty clear that just returning the argument is the right result. So that’s what I’ve assumed here.

These are the basic approaches to the problem:

  1. Construct the output by iterating over bits.
  2. Compute the result using “standard bit tricks” built on operations found on all common processors.
  3. If the target processor has fast “find first set bit” instructions (e.g. modern x86/x64 has bsf/bsr), these can be applied to the XOR of the arguments, and after that the output can be generated straightforwardly.

Code for approach 1 can be found in my original post. Solutions using approach 2 are exemplified by code from Eddie Kohler, Andrew Hunter, and others:

uint64_t high_common_bits_portable(uint64_t a, uint64_t b)
{
  uint64_t x = a ^ b;
  x |= x >> 1;
  x |= x >> 2;
  x |= x >> 4;
  x |= x >> 8;
  x |= x >> 16;
  x |= x >> 32;
  return (a & ~x) | (x & ~(x >> 1));
}

uint64_t low_common_bits_portable(uint64_t a, uint64_t b)
{
  uint64_t x = (a ^ b) & -(a ^ b);
  return (a & (x - 1)) | x;
}

The first one is really nice.

Here is code implementing approach 3:

uint64_t high_common_bits_intrinsic(uint64_t a, uint64_t b)
{
  if (a == b) return a;
  uint64_t bit = 1ULL << 63 - __builtin_clzll(a ^ b);
  return (a & ~(bit - 1)) | bit;
}

uint64_t low_common_bits_intrinsic(uint64_t a, uint64_t b)
{
  if (a == b) return a;
  uint64_t bit = 1ULL << __builtin_ctzll(a ^ b);
  return (a & bit - 1) | bit;
}

These are mostly the code left by Benjamin Kramer, but modified to avoid invoking the builtin function with a zero argument, which the GCC documentation says has undefined behavior. This is the same approach that I used when solving this problem, though Benjamin did a better job with the masking code. This code doesn’t use the actual bsf/bsr instructions but rather GCC intrinsics that are supposed to compile to whatever efficient “find first set bit” primitive exists on the target platform.

Now that we have three implementations of each function, let’s work on testing them. We first need test vectors. It seems clear that making the inputs independently random 64-bit integers is wrong. My approach was to use a random 64-bit number for one of the inputs, and to construct the other input by flipping 0..63 random bits of the first argument. It’s possible that this isn’t great, I don’t know — but it’s not going to matter that much when benchmarking approaches 2 and 3 since they contain almost no data-dependent control.

My approach was to generate 1000 inputs, start a timer, run a function over all of the inputs, then stop the timer. 1000 values seemed reasonable since these arrays fit into the L1 of most any modern machine. Each function runs over the array five times and the fastest result is reported. The code is here.

Here are the results from a Core i7 920 running Linux in 64-bit mode:

[WRONG RESULTS REMOVED — SEE BELOW]

My i7 920 is based on the Nehalem architecture — one off from Intel’s latest. If anyone has a Sandy Bridge processor (Core i7 2600 or 2700, most likely) sitting around, please run the code and send me results, but please be careful to turn off processor power management and take other appropriate timing precautions. I’d also like to see results for Clang, GCC, and ARM’s compiler on a modern ARM processor.

Here are some statically linked executables for x86-64 Linux: GCC, Clang, Intel CC.

Finally, thanks, everyone, for the great responses!

UPDATE:

Argh, my first attempt failed to stop the compiler from inlining the bit functions into the test harness. (Thanks for pointing this out, Richard and Chris!) The results were not technically incorrect, but they didn’t correspond to a reasonable use case for these functions. In particular, the Intel compiler was able to use SIMD to massively optimize operations on sequential array elements. After turning off inlining, the results are:

[regehr@gamow code]$ clang -O3 -march=corei7 -mtune=corei7 bitdiff.c -o bitdiff ; ./bitdiff

high slow      : time =  52.7 (ignore this: 10649328621511951360)
high portable  : time =  16.6 (ignore this: 10649328621511951360)
high intrinsic : time =   9.5 (ignore this: 10649328621511951360)
low slow       : time =  53.4 (ignore this: 9949931644094715106)
low portable   : time =   7.0 (ignore this: 9949931644094715106)
low intrinsic  : time =   8.6 (ignore this: 9949931644094715106)

[regehr@gamow code]$ current-gcc -Ofast -march=corei7 -mtune=corei7 bitdiff.c -o bitdiff ; ./bitdiff

high slow      : time =  48.4 (ignore this: 8093869745297464062)
high portable  : time =  15.4 (ignore this: 8093869745297464062)
high intrinsic : time =   9.5 (ignore this: 8093869745297464062)
low slow       : time =  48.0 (ignore this: 8862089369885824930)
low portable   : time =   8.0 (ignore this: 8862089369885824930)
low intrinsic  : time =   8.5 (ignore this: 8862089369885824930)

[regehr@gamow code]$ icc -fast bitdiff.c -o bitdiff ; ./bitdiff

high slow      : time =  51.1 (ignore this: 15670527381357769456)
high portable  : time =  18.0 (ignore this: 15670527381357769456)
high intrinsic : time =  12.3 (ignore this: 15670527381357769456)
low slow       : time =  48.3 (ignore this: 17922617506918019178)
low portable   : time =   9.0 (ignore this: 17922617506918019178)
low intrinsic  : time =   9.9 (ignore this: 17922617506918019178)

Times are in CPU cycles per call. The answer appears to be: regardless of compiler, the intrinsic version for high_common_bits gives about a 50% speedup. For low_common_bits, it doesn’t matter which version you use.

Finally, here’s what Clang gives for the version of low_common_bits using intrinsics:

low_common_bits_intrinsic:
        movq    %rdi, %rax
        cmpq    %rsi, %rax
        je      .LBB5_2
        xorq    %rax, %rsi
        bsfq    %rsi, %rcx
        movl    $1, %edx
        shlq    %cl, %rdx
        leaq    -1(%rdx), %rcx
        andq    %rax, %rcx
        orq     %rdx, %rcx
        movq    %rcx, %rax
  .LBB5_2:
        ret

I think the final movq could be avoided, but otherwise this looks good.

A Bit Puzzle

I love a good bit-level puzzle. Today’s is one I learned about from a comment in a Google+ discussion:

Given two words a and b, return the high bits that they have in common, with the highest bit where they differ set, and all remaining bits clear. So 10101 and 10011 yields 10100. Then do the same thing, only flipped, so we keep low-order identical bits and mask out the high ones.

The problem comes from trie implementation. Here’s code that does it the slow way:

uint64_t high_common_bits_64_slow (uint64_t a, uint64_t b)
{
  uint64_t mask = 0x8000000000000000LLU;
  uint64_t output = 0;
  int i;
  for (i=63; i>=0; i--) {
    if ((a & mask) == (b & mask)) {
      output |= (a & mask);
    } else {
      output |= mask;
      goto out;
    }
    mask >>= 1;
  }
 out:
  return output;
}

uint64_t low_common_bits_64_slow (uint64_t a, uint64_t b)
{
  uint64_t mask = 1;
  uint64_t output = 0;
  int i;
  for (i=0; i<64; i++) {
    if ((a & mask) == (b & mask)) {
      output |= (a & mask);
    } else {
      output |= mask;
      goto out;
    }
    mask <<= 1;
  }
 out:
  return output;
}

The problem is to do much better than these. To keep things simple, let’s say we’re interested primarily in speed on newish x86-64 implementations. Choose any programming language, or just write assembly. In a few days I’ll post my solutions in addition to any good ones that show up here before then. Beware WordPress’s mangling of source code in the comments — feel free to just mail me.

A Different Approach to System Security

I enjoy it when science fiction has something useful to say about computer security. Towards the end of Iain M. Banks’ Matter, there’s a big space battle and we find this passage:

“Compromised,” Hippinse told him. “Taken over by the other side. Persuaded by a sort of thought-infection.”

“Does that happen a lot, sir?”

“It happens.” Hippinse signed. “Not to Culture ships, as a rule; they write their own individual OS as they grow up, so it’s like every human in a population being slightly different, almost their own individual species despite appearances; bugs can’t spread. The Morthanveld like a degree more central control and predictability in their smart machines. That has its advantages too, but it’s still a potential weakness. This Iln machine seems to have exploited it.”

Monoculture is an obvious and serious danger. For example, about the 2003 Slammer/Sapphire worm:

Propagation speed was Sapphire’s novel feature: in the first minute, the infected population doubled in size every 8.5 (±1) seconds. The worm achieved its full scanning rate (over 55 million scans per second) after approximately three minutes, after which the rate of growth slowed down somewhat because significant portions of the network did not have enough bandwidth to allow it to operate unhindered. Most vulnerable machines were infected within 10-minutes of the worm’s release.

Imagine the damage a similar worm could do today, or in 20 or 100 years, if properly weaponized.

I know there are automated approaches to diversity (ASLR, randomized instruction sets, etc.) but I found “they write their own individual OS as they grow up” to be a very charming idea, perhaps in part because it is so wildly impractical today.

Android Projects Retrospective

The Android Projects class I ran this semester has finished up, with students demoing their projects last week. It was a lot of fun because:

  • I find mobile stuff to be interesting
  • the students were excited about mobile
  • the students were good (and taught me a ton of stuff I didn’t know)
  • the course had 13 people enrolled (I seldom get to teach a class that small)

The structure of the class was simple: I handed each student a Samsung Galaxy 10.1 tab and told them that the goal was for each of them to do something cool by the end of the semester. I didn’t do any real lectures, but rather we used the class periods for discussions, code reviews, and a short whole-class software project (a sudoku app). Over the last several years I’ve become convinced that large lecture classes are a poor way for students to learn, and teaching a not-large, not-lecture class only reinforced that impression. If we believe all the stuff about online education, it could easily be the case that big lecture courses are going to die soon — and that wouldn’t be a bad thing (though I hope I get to keep my job during the transition period).

The course projects were:

  • A sheet music reader, using OpenCV — for the demo it played Yankee Doodle and Jingle Bells for us
  • A tool for creating animated GIFs or sprites
  • A “where’s my car” app (for the mall or airport) using Google maps
  • Finishing and polishing the sudoku app that the whole class did (we had to drop it before completion since I wanted to get going with individual projects)
  • An automated dice roller for Risk battles
  • A generic board game engine
  • A time card app
  • A change counting app, using OpenCV
  • A bluetooth-based joystick; the demo was using it to control Quake 2
  • A cross-platform infrastructure (across Android and iOS) for games written in C++

The total number of projects is less than the number of students since there were a few students who worked in pairs. I can’t easily count lines of code since there was a lot of reuse, but from browsing our SVN repo it’s clear that the students wrote a ton of code, on average.

Overall, my impression that desktops and laptops and dying was reinforced by this course. Not only can tablets do most of what most people want from a computer, but also the synergy between APIs like Google maps and the GPS, camera, and other sensors is obvious and there are a lot of possibilities. Given sufficient compute power, our tabs and phones will end up looking at and listening to everything, all the time. OpenCV’s Android port is pretty rough and it needs a lot of optimization before it will work well in real-time on tabs and phones, but the potential is big.

Android has a lot of neat aspects, but it’s an awful lot of work to create a decent app and it’s depressingly hard to make an app that works across very many different devices. This is expected, I guess — portable code is never easy. The most unexpected result of the class for me was to discover how much I dislike Java — the boilerplate to functional code ratio is huge. This hadn’t seemed like that much of a problem when I used to write standalone Java, and I had a great time hacking up extensions for the Avrora simulator, but somehow interacting with the Android framework (which seems pretty well done) invariably resulted in a lot of painful glue code. The students didn’t seem to mind it that much. It’s possible that I’m exaggerating the problem but I prefer to believe they’re suffering from the Stockholm syndrome.

Pluggable Domain-Specific Testcase Reducers

An important part of effective random testing is providing developers with actionable testcases, and a big part of creating actionable testcases is testcase reduction. Delta debugging, as exemplified by the delta tool, is a good generic reduction technique. However, in many cases, as I discussed the other day here, a domain-specific reducer can do a much better job.

Today’s question is: How to build a domain-specific testcase reduction tool? Of course, one option is to build this kind of tool from scratch, but that’s not very fun. After building a few testcase reducers from scratch, and after watching my students build a few more, I discovered a nice API for isolating the transformation part of a reducer — the main part that is actually domain specific — from the rest of the tool. But first, let’s look at what has already been accomplished in terms of factoring a testcase reducer into reusable parts.

The original delta debugging algorithm, ddmin, mixes all of its concerns together. To reuse ddmin, you reimplement from scratch. The delta tool improves the situation by factoring the “interestingness test” out into an external script. This test decides whether a freshly produced delta variant should be accepted or discarded. This is really useful and I’ve written probably well over a dozen different interestingness tests. However, the transformation part of ddmin remains hard-coded in the delta tool. The only thing it knows how to do is remove one or more contiguous lines from a text file. This is nice but not nearly good enough for demanding reduction tasks.

This post is about factoring out the reduction transformations. This is necessary because you need one set of transformations to reduce C++ code (template expansion, namespace flattening, etc.) and different ones to reduce XML or Java or whatever. Furthermore, even if we consider testcase reduction for a single language, some transformations can be expressed as simple string manipulations — there’s no real need to factor these out of a reduction tool — whereas other transformations (inlining a C function or performing copy propagation) are far more involved. We want to maintain some separation between these compiler-like tools and the program reducer.

The transformation tool API is very simple; it provides a single point of entry:

reduce (filename, command, number)

It works like this:

  • filename refers to the current test case, which the transformation tool modifies in place
  • command specifies which transformation to perform; it doesn’t matter for tools that only perform one kind of transformation
  • number identifies which instance of the transformation to perform
  • the return value of a reduce call is either success, indicating that the transformation succeeded, or failure, indicating that number did not correspond to any transformation

The only trick here is properly enumerating transformation opportunities. For any given test case, it is required that there exists an n such that number in 0 .. n-1 will result in successful transformations and number ≥ n will fail. In my experience, assigning numbers to transformation opportunities is straightforward. For example, consider a transformation that tries to replace a constant in a test case with the value “1”. We could number constants by their order of appearance in the token stream, we could number them by their order of occurrence in a preorder traversal of the AST, or whatever — it doesn’t matter as long as the above property holds.

It is not required that the transformation produces a valid testcase, nor is it required that the transformation always makes the test case smaller. There are not, strictly speaking, any requirements at all on the transformation; it could randomly change the test case or even generate a whole new test case from scratch. On the other hand, in practice, it’s nice if the transformation:

  • is deterministic,
  • for any given input, eventually runs out of transformation opportunities when repeatedly applied, and
  • produces a valid, improved test case reasonably often.

The first property makes testcase reduction repeatable. The second property means that the reducer will eventually terminate when it reaches a fixpoint, as opposed to requiring a timeout or other termination criterion. To return to the example of a transformation that replaces constants with one, we would need to put a special case in the transformer code that prevents it from replacing one with one in order to satisfy this property. In practice it is often the case that a successful transformation decreases the number of possible transformations by one — but it’s easy to find examples where this does not hold (and it doesn’t need to). The third property means that the reducer is reasonably effective and fast.

Other parts of the testcase reducer — its search strategy, interestingness test, and fitness function — do not matter here. Let’s look at an example of plugging the “replace constant with one” transformer into a generic reducer. Let’s pretend we have a compiler that crashes with a math exception when it tries to constant-fold the division by zero. We’ll begin with this unreduced test case:

int foo (void) {
  int x = 33;
  int y = x / 0;
  return y + 66;
}

We invoke the reducer and it first issues the call “reduce (foo.c, REPLACE_WITH_ONE, 0)”. The call succeeds and gives:

int foo (void) {
  int x = 1;
  int y = x / 0;
  return y + 66;
}

This test case is still interesting (it still triggers the compiler crash) and it is smaller than the previous one, so the reducer accepts it as the current test case. The reducer heuristically assumes that the successful transformation eliminated an opportunity; therefore it again issues “reduce (foo.c, REPLACE_WITH_ONE, 0)”. Again the call succeeds and gives:

int foo (void) {
  int x = 1;
  int y = x / 1;
  return y + 66;
}

This test case is not interesting — it will not trigger a divide-by-zero bug in the compiler. This variant is discarded and the reducer now issues “reduce (foo.c, REPLACE_WITH_ONE, 1)”. Recall that constants with value 1 are skipped over when the transformer enumerates its possibilities. The transformation succeeds and the result is:

int foo (void) {
  int x = 1;
  int y = x / 0;
  return y + 1;
}

This is again interesting. However, the reducer’s next call “reduce (foo.c, REPLACE_WITH_ONE, 1)” fails because transformation opportunity 1 does not now exist. To be sure that it has reached a fixpoint, the reducer will need to again test “reduce (foo.c, REPLACE_WITH_ONE, 0)” which again succeeds but leaves an uninteresting testcase. At this point the REPLACE_WITH_ONE transformation cannot make progress. If the reducer has no other transformations to try, it terminates.

It would be fairly straightforward to reimplement the delta tool in this framework; I haven’t bothered yet because delta works perfectly well. Something interesting about the delta tool is that it doesn’t really reach a fixpoint: if you run it twice using the same parameters, the second run sometimes improves the testcase. Basically the authors decided to go for quicker termination at the expense of testcase quality. Another quirk is that it iterates backwards through its list of transformations whereas in the example above, we iterated forwards. Presumably the delta authors observed that backwards iteration is faster; I haven’t done the experiments to confirm this. To support backwards iteration, we’d have to add another call to the API:

get_count (filename, command)

This call returns the total number of possible instances of a transformation there are in a test case. To go backwards we’d start by issuing a reduce call at one less than this number and work down to zero.

A minor variant of the reduce interface that is used by my c_delta tool replaces the number argument with a byte count. The tool then doesn’t even bother searching for transformation opportunities in bytes preceding the specified starting point. Revising c_delta’s regex-based transformations to use numbers instead of byte counts would not be hard but I don’t have plans to do this soon.

The idea of decomposing testcase reduction into a collection of modular parts is one that I’ve explored before. This post takes a detailed look at one aspect of the overall design of testcase reducers: the reduction transformation. The API is simple and (I think) generically useful. I’m embarrassed to say that it took me a few years of dinking around with various testcase reducers before I discovered it. My guess is that it’s not sufficiently complicated or researchy to be publishable, so the writeup here will have to suffice.

Something that I hope comes out of this line of work is that after factoring reducers into reusable parts, we’ll be able to do a better job improving those parts. For example, comments on my previous post about reduction for C code indicated some transformations that I hadn’t thought of. Once we implement these, for example as Clang or Rose modules, then they can be trivially plugged into any testcase reducer supporting this API. Similarly, now that the overall reducer doesn’t need to understand interestingness tests or transformations, we’re free to start evaluating better search strategies, figuring out how to parallelize the reducer, etc.

Help Wanted: If you have a transformation tool for C or C++ that can be modified to fit the interface described in this post, and if it implements some transformation that is useful (or if you are willing to implement one), please leave a comment or drop me a line. I’m currently working on a “best of breed” reduction tool for C/C++ programs and am willing to try out any plausible transformation. I report a lot of compiler bugs and expect this tool to be super useful in practice.

How Integers Should Work (In Systems Programming Languages)

My last post outlined some of the possibilities for integer semantics in programming languages, and asked which option was correct. This post contains my answers. Just to be clear: I want practical solutions, but I’m not very interested by historical issues or backwards compatibility with any existing language, and particularly not with C and C++.

We’ll start with:

Premise 1: Operations on default integer types return the mathematically correct result or else trap.

This is the same premise that as-if infinitely ranged begins with, but we’ll perhaps end up with some different consequences. The things I like about this premise are:

  • It appears to be consistent with the principle of least astonishment.
  • Since it is somewhat harsh — ruling out many semantics such as 2’s complement wraparound, undefined overflow, saturation, etc. — it gives a manageably constrained design space.
  • I just don’t like two’s complement wraparound. It’s never what I want.

Of course there are specialized domains like DSP that want saturation and crypto codes that want wraparound; there’s nothing wrong with having special datatypes to support those semantics. However, we want the type system to make it difficult for those specialized integers to flow into allocation functions, array indices, etc. That sort of thing is the root of a lot of serious bugs in C/C++ applications.

Despite the constraints of premise 1, we have considerable design freedom in:

  • When to trap vs. giving a correct result?
  • Do traps occur early (as soon as an overflow happens) or late (when an overflowed value is used inappropriately)?

For high-level languages, and scripting languages in particular, traps should only occur due to resource exhaustion or when an operation has no sensible result, such as divide by zero. In other words, these language should provide arbitrary precision integers by default.

The tricky case, then, is systems languages — those used for implementing operating systems, virtual machines, embedded systems, etc. It is in these systems where overflows are potentially the most harmful, and also they cannot be finessed by punting to a bignum library. This leads to:

Premise 2: The default integer types in systems languages are fixed size.

In other words, overflow traps will occur. These will raise exceptions that terminate the program if uncaught. The rationale for fixed-size integers is that allocating more storage on demand has unacceptable consequences for time and memory predictability. Additionally, the possibility that allocation may happen at nearly arbitrary program points is extraordinarily inconvenient when implementing the lowest levels of a system — interrupt handlers, locks, schedulers, and similar.

Side note: The Go language provides arbitrary precision math at compile time. This is a good thing.

Overflows in intermediate results are not fun to deal with. For example, check out the first few comments on a PHP bug report I filed a while ago. The code under discussion is:

result = result * 10 + cursor - '0';

It occurs in an atoi() kind of function. This function — like many atoi functions (if you’ve written one in C/C++, please go run it under IOC) — executes an integer overflow while parsing 2147483647. The overflow is a bit subtle; the programmer seems to have wanted to write this code:

result * 10 + (cursor - '0')

However, the code actually evaluates as:

(result * 10 + cursor) - '0'

and this overflows.

How can we design a language where programmers don’t need to worry about transient overflows? It’s easy:

Premise 3: Expression evaluation always returns the mathematically correct result. Overflow can only occur when the result of an expression is stored into a fixed-size integer type.

This is more or less what Mattias Engdegård said in a comment to the previous post. Clearly, eliminating transient overflows is desirable, but how costly is it? I believe that in most cases, mathematical correctness for intermediate results is free. In a minority of cases math will need to be performed at a wider bitwidth, for example promoting 32-bit operands to 64 bits before performing operations. Are there pathological cases that require arbitrary precision? This depends on what operators are available inside the expression evaluation context. If the language supports a builtin factorial operator and we evaluate y = x!; there’s no problem — once the factorial result is larger than can fit into y, we can stop computing and trap. On the other hand, evaluating b = !!(x! % 1000000); is a problem — there’s no obvious shortcut in the general case (unless the optimizer is especially smart). We can sidestep this kind of problem by making factorial into a library function that returns a fixed-width integer.

A consequence of premise 3 that I like is that it reduces the number of program points where math exceptions may be thrown. This makes it easier to debug complex expressions and also gives the optimizer more freedom to rearrange code.

The next question is, should overflows trap early or late? Early trapping — where an exception is thrown as soon as an overflow occurs — is simple and predictable; it may be the right answer for a new language. But let’s briefly explore the design of a late-trapping overflow system for integer math. Late trapping makes integers somewhat analogous to pointers in C/C++/Java where there’s an explicit null value.

We require a new signed integer representation that has a NaN (not-a-number) value. NaN is contagious in the sense that any math operation which inputs a NaN produces NaN as its result. Integer NaNs can originate explicitly (e.g., x = iNaN;) and also implicitly due to uninitialized data or when an operation overflows or otherwise incurs a value loss (e.g., a narrowing type cast).

Integer NaN values propagate silently unless they reach certain program operations (array indexing and pointer arithmetic, mainly) or library functions (especially resource allocation). At that point an exception is thrown.

The new integer representation is identical to two’s complement except that the bit pattern formerly used to represent INT_MIN now represents NaN. This choice has the side effect of eliminating the inconvenient asymmetry where the two’s complement INT_MIN is not equal to -INT_MAX. Processor architectures will have to add a family of integer math instructions that properly propagate NaN values and also a collection of addressing modes that trap when an address register contains a NaN. These should be simple ALU hacks.

A few examples:

int x; // x gets iNaN 
char c = 1000; // c gets iNaN 
cout << iNaN; // prints "iNaN" 
p = &x + iNaN; // throws exception 
x = a[iNaN]; // throws exception

Good aspects of this design are that an explicit “uninitialized” value for integers would be useful (see here for example), execution would be efficient given hardware support, programs would tolerate certain kinds of benign overflows, and the stupid -INT_MAX thing has always bothered me. It’s also possible that a more relaxed overflow system leads to some nice programming idioms. For example, iterating over the full range is awkward in C. But now that we have a sentinel value we can write:

for (x = INT_MIN; x != iNaN; x++) { use (x); }

The main disadvantage of late trapping is that it would sometimes mask errors and make debugging more difficult in the same way that null pointers do. Type system support for “valid integers”, analogous to non-null pointers, would mitigate these problems. On the other hand, perhaps we’d have been better off with early trapping.

Finally, we could design an integer system using the iNaN integer representation and also early trapping. In this system, iNaN would exclusively be used to represent uninitialized or garbage values. Any operation on iNaN other than simple copying would result in a trap.