Proposal to Stress-Test Implementations of C++11 Concurrency

I recently submitted a proposal to NSF’s “small” Computer Systems Research solicitation. The rest of this post contains the body of the proposal, edited a bit for formatting (citations are missing and there are no doubt some leftover latex-isms) and to remove some sections that are tedious to people not sitting on an NSF panel. If you’re curious, here’s the entire proposal in the form that the panel will see. I’m posting this material in hopes of getting useful feedback and also because I think requests for public money should be open. I wrote a bit more about this earlier.

Motivation and Overview of the Proposed Work

Many large and important concurrent software systems are written partly or entirely in C++98:[0. “C++98” refers to the 1998 version of the C++ standard. It is the dialect in which nearly all contemporary C++ code is written. The recently approved C++11 standard has not yet been entirely implemented by any compiler we are aware of.] [0. (link)]

  • Apple’s Mac OS X
  • Adobe Illustrator
  • Facebook
  • Google’s Chrome browser
  • Microsoft Windows 7 and Internet Explorer
  • Firefox
  • MySQL

Similarly, many large and important concurrent software systems are implemented in C, including almost every operating system kernel and hypervisor, most Java virtual machines and many other programming language runtimes, and many embedded systems including those that control critical transportation and medical care systems.

It is ironic, then, that these languages–in which massive amounts of mission- and safety-critical concurrent code are written–have no built-in support for concurrent execution. This omission can cause problems for software development efforts. First, the compiler–not designed for concurrency–can introduce race conditions into programs that appear correct at the source level. Second, consider this code which implements a two-process mutual exclusion scheme using the well-known Dekker algorithm:

volatile int flag[2], turn;

void dekker_unlock (int i)
  turn = 1-i;
  flag[i] = 0;
void dekker_lock (int i) {
  flag[i] = 1;
  while (flag[1-i]) {
    if (turn != i) {
      flag[i] = 0;
      while (turn != i) { }
      flag[i] = 1;

This code can be proved to be correct (that is, it provides progress, mutual exclusion, and bounded wait) as long as operations on volatile global variables are sequentially consistent.[0. According to Lamport, sequential consistency requires that “the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.”] Therefore, this code works on uniprocessor machines (of course, a spinlock is not efficient on a uniprocessor, but at least it is correct). On the other hand, on most multiprocessor machines this Dekker implementation is incorrect: it fails to provide mutual exclusion. This is even the case on machines with a relatively strong memory model such as multicore Intel and AMD processors implementing the x86 and x86-64 architectures.

To fix the Dekker code, processor-specific memory fence operations must be added. Fences enforce ordering constraints on memory operations that would otherwise be reordered by hardware. Because memory system and fence semantics vary, code with fences is not portable, and also it can be extremely difficult to figure out where fences are required. If one errs on the side of adding too many fences, performance will suffer. If one errs on the side of adding too few, the Dekker code will fail to properly implement mutual exclusion.

Of course, the vast majority of programmers are not interested in writing platform-specific code for a weak memory model. The alternative is to target a specific concurrency library such as POSIX threads or Win32 threads. However, libraries are typically not universally available; POSIX threads are generally not found on non-UNIX platforms and Win32 is not found outside of Windows machines. Additionally, libraries such as POSIX threads fail to provide direct access to low-level hardware operations such as compare-and-swap instructions, preventing certain high-performance idioms from being expressed portably.

The C++11 Memory Model

The recently-adopted C++11 standard was designed to fix the problems discussed in the previous subsection. First, it standardizes C++ concurrency across platforms. The standard formalizes the notion of data race and prohibits compilers from introducing races. Second, it provides an extensive suite of powerful, low-level operations that are intended to make it possible for portable code to attain very high performance. It is believed that the next version of the C standard (the current draft is referred to as “C1X”) will incorporate a memory model very similar to the C++11 concurrent memory model.

Problem: Correctness of Implementations of the C++11 Memory Model

The problem that the proposed work aims to solve is:

  1. The C++11 memory model is large and complicated, containing dozens of classes and templates, and hundreds of functions. These interact closely with OS-specific synchronization calls and hardware-specific synchronization instructions.
  2. Today’s C++98 compilers will serve as the basis for all known C++11 implementations. These tools–which are large, complicated, and contain highly aggressive optimizations–were developed in the absence of any non-trivial memory model.
  3. Compiler optimizations and the C++11 memory model are intimately related. For example, some transformations that were previously permissible are now illegal. Although the full extent of the interactions between existing compiler implementations and the new memory model is not yet totally understood, it is clear that a lot of work will be required to iron out the details. Subtle bugs are likely to persist for some time.

We have talked to compiler developers who work on LLVM, GCC, and on several important commercial compilers. None of these developers expects that implementation of the C++11 memory model will be quick or easy. In fact, they all expressed concern about interactions between existing optimization passes and constraints imposed by the new model.

Intellectual Merit — Major Research Challenges and Solution Sketch

In a nutshell, the proposed work is to randomly generate concurrent C++11 programs, compile them, and then run them on an infrastructure designed to reveal concurrency errors. This is a logical extension of our previous work that has resulted in more than 400 previously unknown compiler bugs being identified, reported, and usually fixed. However, there are several significant challenges that must be overcome.

Challenge #1: Generating Valid Code

The major challenge in creating Csmith was to generate code that is expressive and well-defined. Expressive code exercises a large subset of language features, and well-defined code avoids executing any of the 191 undefined behaviors from the C standard (integer overflow, array and pointer errors, union errors, divide by zero, shift past bitwidth, etc.). In C++11, data races are an undefined behavior. The definition of data race is intricate, but the core idea is to determine if every access to each shared data item is sufficiently synchronized with respect to all other accesses to that item. Generating random, race-free programs is expected to be significantly non-trivial in the presence of arrays and pointers. Our solution will exploit the dataflow analysis that is already a part of Csmith. Essentially, we will implement a static race checker that runs in tandem with program generation.

Challenge #2: Generating Interesting Invariants

Automated random testing can succeed only when useful invariants are identified in the system under test. Our previous work has exploited three invariants. First, our testing of the volatile qualifier made use of the volatile invariant, which is implied by the C and C++ standards. It specifies that the number of load and store operations issued to each byte of a volatile-qualified object must not change across compilers or optimization options. Second, our wrong-code bug hunting using Csmith relied on the invariant that for a well-defined test program, a checksum taken over the global variables at the end of execution must not change across compilers or optimization options. Third, randomized compiler testing relies on the simple invariant that the compiler should not crash when presented with valid input.

Randomly testing implementations of the C++ memory model will only be possible if we can generate code with interesting concurrent invariants. Of course, concurrent programs written by humans have invariants (e.g., “the empty flag is set iff the list is empty”). We plan to use ideas from invariant-based program synthesis to create random programs with non-trivial invariants. A useful consequence of the invariant-based approach to compiler testing is that Csmith’s test cases will become self-checking–compiler bugs will cause them to fail with an assertion violation. In contrast, our compiler testing work up to this point has used differential testing where multiple compilers (or multiple optimization levels of a single compiler) were tested against each other.

Challenge #3: Exploring the Scheduling Space

The space of thread interleavings is large, and finding bugs in concurrent programs boils down to exploring as much of this space as possible. Our compiler stress testing work will explore three approaches. First, we hypothesize that early progress can be made using simple mechanisms to introduce scheduling noise, for example by introducing randomized nop instructions or sleep calls into a compiled program using binary rewriting, and by using a second machine to send randomized network interrupts to the machine that is executing test cases. Second, there is much existing work on designing runtime systems that attempt to make concurrency errors show themselves; we will reuse this work to the maximum extent possible. Third, we will explore the idea of implementing a hostile x86-64 simulator. Although we believe that it is probably not feasible to compute a pessimal schedule for a program, hostility is an attainable goal. It will be implemented using techniques such as triggering context switches near and inside (inferred) critical sections and delaying inter-processor communication as long as possible, in effect replacing the CPU’s write buffer with a “wrong buffer.”[0. The term “wrong buffer” appears to have originated with system developers at ARM; we know of no appearances of this term in print.]

The thesis of this proposal: Randomized stress testing of implementations of the C++11 memory model will significantly shorten the period of time during which compilers are flaky and immature. In the long run, randomized stress testing will lead to a better understanding of corner-case behaviors of the C++11 memory model and to test suites with improved coverage of interactions between the C++11 memory model, compiler optimizers, OS thread systems, and processor memory coherence mechanisms. The intellectual work required to produce concurrent test cases is a major increment upon our existing work on randomized test case generation which has had great success in finding compiler bugs, but which is limited to sequential code.

The State of the Art is Inadequate

Existing methods for testing compiler implementations break down into three broad methods:

  1. Compiling large, real applications. For example, a significant milestone in the development of any compiler is when it becomes self-hosting (can compile itself).
  2. Compiling collections of small, troublesome codes (e.g., GCC’s “torture test,” which contains about 2800 programs).
  3. Randomized testing. Our Csmith project is one example, but the history of this technique goes back about 50 years.

All three methods are useful and necessary. Compiler developers we have talked to, who are working on C++11 implementations, plan to use methods 1 and 2. We do not believe these methods will be sufficient to rapidly squash all of the relevant bugs (nor do the compiler developers appear to believe this). Consequently, we claim the proposed work is useful and important.

Why Testing Instead of Verification?

CompCert is an existence proof for a verified, usable, optimizing compiler for a realistic programming language. One might ask: Why should a testing proposal be funded? Why not be more ambitious and create a proved-correct C++ compiler? There are several reasons:

  • Testing is end-to-end, whereas formal verification typically addresses a narrow layer of the software stack. For example, in addition to finding compiler bugs, our compiler testing work has found linker bugs, assembler bugs, runtime library bugs, CPU simulator bugs, and even bugs caused by the fact that the compiler had itself been miscompiled. Formal verification cannot find these errors without specifically verifying the linker, assembler, runtime library, simulator, and the compiler that compiled the compiler under test. While these are all laudable goals, the combined effort to do these verifications is gigantic.
  • Even verified compilers require testing. For example, we have discovered and reported 13 CompCert bugs. The core issue is that creating a machine-checked mathematical proof is one thing, but proving the right thingis a separate and very hard problem. Typically it is not possible to once and for all “prove that one has proved the right thing.”
  • The C++11 language is dramatically more complex than C. CompCert does not even compile all of C, but rather only a useful subset. Creating a verified, optimizing compiler for a broadly useful subset of C++11 would be a massive undertaking, requiring (we guess) more than a few person-decades of effort.

Summary: A verified C++11 compiler is out of reach for now. But even if we had this compiler, we would still need to stress test it.

Specific Outcomes

The proposed work, if successful, will have several useful outcomes. First, and most obviously, we will create tools for stress-testing implementations of the C++11 concurrency model. These tools will be open-source, and consequently will benefit every team that is working on a C++11 implementation (in addition to transitively benefiting users of those compilers and users of systems built by those compilers). Second, as a side effect of developing and validating our tools, we will report bugs against open source compilers (GCC and LLVM, in particular) and also we will work with any commercial compiler vendors who seek us out.[0. Although we have built solid relationships with the GCC and LLVM teams, so far we have not had very good luck in building relationships with companies that produce compilers. The problem (we think) is that we represent a poor short-term value proposition: we generate zero revenue while producing bug reports that suck up valuable engineering resources. Furthermore, being outsiders, we would seem to repesent the potential for causing only bad publicity–to paraphrase Dijkstra, we can find bugs but we cannot certify their absence.] Third, we expect to be able to produce useful “litmus tests”–small test cases that are (empirically) difficult for compilers to translate correctly and that, taken as a group, provide good test coverage of interesting parts of the C++11 concurrency model. Finally, we hope to generate a better understanding of the C++11 concurrency model. By analogy, our Csmith work has (completely automatically) exposed some interesting corner cases in the C programming language, as described here: (link)

The PI’s Prior Work on Compiler Testing

This section briefly describes our work leading up to this proposal. It also explains why we are virtually certain that the proposed work will find bugs in production-quality compilers.

Although the C and C++98 languages do not have a notion of concurrent execution, they do have a simple memory model that is designed to support reliable access to memory-mapped hardware device registers. This model is provided by the volatile qualifier.[0. The rationale for volatile is described here: (link)] A volatile-qualified object must be accessed at the memory system level as it is by the C/C++ abstract machine. In other words, a read from a volatile-qualified variable in C/C++ must result in a load being issued at the machine level–the variable’s value must not be cached in a register. Similarly, a write to a volatile variable in C/C++ must result in a store at the machine level–the compiler must not suppress redundant stores, as an optimizer normally would. The volatile qualifier’s semantics in C++11 are the same as they are in C and C++98.

The PI, while teaching embedded systems courses in the early/mid 2000s, noticed that the compilers being used by students did not always respect the volatile qualifier. This was troubling. In joint work with Eric Eide, he developed a way to test compilers’ implementation of the volatile qualifier. For every compiler we tested–including commercial tools used to compile safety-critical embedded systems–we found errors in its treatment of volatile-qualified variables. Some of these results are reported in our 2008 paper, but since then we have tested quite a few more compilers, with the same result. In summary, no compiler that we tested was able to correctly implement the volatile qualifier, which is basically trivial when compared to the C++11 memory model.

Since volatile-qualified objects are not totally reliable, we developed a workaround where source code is re-written in such a way that every read from a volatile variable is changed into a call to a helper function, and every write to a volatile variable is changed into a different helper function call. We verify by hand that the helper functions perform the necessary memory operations. In a large fraction of our tests (96% of the time) this strategy was successful in working around bugs in compilers’ implementations of volatile. The workaround is successful because compilers tend to be able to perform function calls reliably, even when they cannot access volatiles reliably. Subsequently, we became interested in the 4% of cases where our workaround failed. We hypothesized that these cases were due to “regular” compiler bugs having nothing to do with the volatile qualifier. We began systematically hunting for these bugs.

Since 2008 the PI’s group has found and reported about 400 bugs in widely-used C compilers. These bugs were all found using Csmith, a random C program generator that is a (distant) descendant of our volatile testing tool. Most of the bugs were in the compilers’ middle-ends–the part of a compiler that is independent of both the language being compiled and the architecture being targeted. Despite the fact that most of these 400 bugs have been fixed, we can still crash and find wrong-code bugs in the latest version of all compilers that we routinely test (about six). CompCert is the lone exception. Csmith is open-source software.[0. (link)]

Brief Introduction to the C++11 Memory Model

Although the C++11 standard cannot be freely downloaded, a recent draft can be.[0. (link)] Material about the concurrency model is concentrated in Chapters 1, 29, and 30. However, a “less formal” description by Boehm and a paper by Boehm and Adve are probably better places to begin reading. Additionally, a book about the model is expected to be released in January 2012. On the more formal side, Batty et al. have formalized the C++11 memory model (but not the rest of the language). This kind of activity is essential to understanding the most subtle parts of the model.

Application Programming Interface

The API for the C++11 concurrency model divides into two major parts. The first is a thread support library. It is object oriented and uses the C++ exception system, but otherwise is not very different from Pthreads and other existing thread libraries. C++11 threads are intended to map onto OS-supported threads in a one-to-one fashion. The threads interface is contained in four header files, respectively supporting thread manipulation, mutual exclusion, condition variables, and futures.

A welcome feature of the mutex library is a facility for safely acquiring multiple locks:

std::unique_lock lck1(m1,std::defer_lock);
std::unique_lock lck2(m2,std::defer_lock);
std::unique_lock lck3(m3,std::defer_lock);
// manipulate shared data ...

The advantage of this interface is that the developer does not have to worry about deadlocking the program by acquiring locks in the wrong order.

The second major component of the C++11 API is an atomic operations library. Atomic operations are synchronization operations that are semantically equivalent to locking an object, reading and/or writing it, and then releasing the lock. The difference is that in some cases, the atomic operation may be significantly more efficient than the explicit locking approach. For example, on x86 and x86-64 processors an atomic increment of an integer value can be compiled into a single instruction.

The most interesting feature of the C++11 atomic API is that it provides portable accesses to weak memory models. The default memory ordering for all atomic operations is sequential consistency (memory_order_seq_cst), meaning that the sequentially consistent operation is part of a global total ordering of memory operations. This is intuitive; for example, the Dekker code from Section 1–which is incorrect on a non-sequentially-consistent multiprocessor–can be fixed in C++11 simply by declaring the synchronization variables as atomic. That is, instead of declaring them as:

volatile int flag[2], turn;

They should be declared as:

std::atomic<int> flag[2], turn;

The atomic template replaces the default integer operations (assignment, increment, etc.) with atomic and sequentially consistent code appropriate for the target platform.

The problem with sequential consistency is that it can be expensive. To support high-performance multiprocessor programming, C++11 defines five more memory orderings: memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, and memory_order_acq_rel.

The relaxed memory ordering is the most permissive: atomic operations using it are still atomic, but no ordering constraints are imposed across threads. This ordering is perhaps not very useful by itself, but C++11 provides explicit fence operations that can be combined with relaxed consistency to build useful and performant concurrent data structures.

Using memory_order_acquire and memory_order_release, high-performance lock acquisition and release primitives can be constructed. An acquire operation is a one-way memory fence that prevents other memory operations from being moved above the acquire; a release operation is a one-way memory fence that prevents other memory operations from being moved below the release. Together, they prevent memory operations from being moved outside of a critical section–since this would defeat the purpose of the critical section. An access with memory_order_acq_rel behaves as both an acquire and a release. The memory_order_consume is similar to memory_order_acquire, but weaker: only values that are dependent on the lock variable (via a dependency mechanism explained in the standard) are ordered by the consume/release sequence.

Obligations on the Developer and Compiler

The fundamental obligation placed upon a developer writing C++11 programs is that she must never permit a data race to occur. Informally, a program is free of data races if it contains adequate synchronization. In more detail, the C++11 standard says:

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other.

The definition of atomic is straightforward–operations on atomic types are always atomic. The definition of conflicting is also straightforward: “Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location.” Alas, happens before is more intricate. It basically follows Lamport’s definition but accounts for synchronization that occurs through weak memory-model operations such as acquires and releases.

A data race, like any other undefined behavior in C++, destroys the meaning of the program. Avoiding races is relatively straightforward (if any concurrent programming problem can be called that) as long as the developer sticks with mutexes and sequentially consistent atomics. Static and dynamic race-detection tools are a fairly mature technology and will be able to help once they are adapted to support C++11.

On the other hand, the compiler cannot limit itself to the sequentially consistent case; its obligation to faithfully implement the C++11 standard is severe. The soundness of various compiler optimizations under weak memory models is a matter of current research. The next section explains some of the problems.

What Do C++11 Concurrency Bugs Look Like?

This section uses examples to briefly illustrate what could go wrong when a C++11 compiler translates a concurrent program. The best references about the interactions between optimizers and C++11 are by Boehm and by McKenney et al. (the latter document refers to a port of the C++11 memory model to C, but the discussion applies to C++ as well).

Example #1

Because C++98 and C were specified in the absence of a concurrency model, it is permissible for a compiler to introduce writes to objects that were not written to by the program. For example, consider this struct and accessor function:

struct foo { int a:8; int b:12; char c; } x;
void update_b (void) {
  x.b = 1;

It is perfectly legal for a C or C++98 implementation to translate update_b() into code that loads the entire 32-bit word containing x, appropriately modifies the 12 bits corresponding to x.b, and stores the word back into memory. In C++11 this translation is illegal because it may introduce data races with concurrent threads that are accessing field c. The current development version of the GNU C++ compiler, which is in a “bug fix only” state in preparation for the 4.7.0 release, miscompiles this code even if it is given compiler flags prohibiting it from introducing data races.[0. These flags are discussed in the GCC Wiki: (link)] On x86-64 the output is:

 movl x(%rip), %eax 
 andl $-1048321, %eax 
 orb $1, %ah 
 movl %eax, x(%rip) 

This code reads the entire struct into a register, uses a bitwise-AND and then a bitwise-OR instruction to modify the bits corresponding to b, and then stores the result back to memory. This is wrong in C++11.

Example #2

McKenney et al. provide an illustrative example concerning constant propagation: one of the simplest, most useful compiler optimizations. Given that x is an atomic variable and it is known to be either 0 or 1, how might this code be optimized?

if (x.load(memory_order_consume)) {
} else {
y = 42 * x / 13;

On platforms with expensive multiplication and division, the compiler would like to transform this code to:

if (x.load(memory_order_consume)) {
  y = 3;
} else {
  y = 0;

However, McKenney et al. state that “If the underlying machine preserves control-dependency ordering for writes, this optimization is perfectly legal. If the underlying machine does not preserve control-dependency ordering, then either this optimization must be avoided, a memory fence must be emitted after the load of x, or an artificial data dependency must be manufactured.” If compiler writers miss this subtlety, the generated code will either be suboptimal or wrong.

Details of the Proposed Approach

To recap, the proposed stress-testing work flow is to repeatedly (1) generate a random concurrent C++11 program, (2) compile it using a variety of possibly-buggy C++11 compilers and optimization flags, and (3) run these programs in hostile execution environments that are designed to expose interesting interleavings at the memory system level. We are not proposing to support all of C++11 in Csmith–that needs to be done, but it is well beyond the scope of this proposal. Rather, Csmith will generate C++11 that looks like C code with C++11 concurrency constructs added. We expect this to be sufficient to find many, but not all, bugs in C++11 concurrency implementations.

Generating Data-Race-Free Programs

Csmith already includes sophisticated machinery for generating programs that avoid undefined behaviors. Our PLDI 2011 paper explains this in detail, but the essence of the solution is to interleave static analysis with program generation. First, Csmith generates a statement (e.g., x = *y;) that is a candidate for addition to the current program. Second, Csmith performs a number of static, dataflow-driven safety checks on the candidate statement in the current program context. In this example, x and *y must be compatible types and y must not be null or refer to an out-of-scope memory location. If the safety checks pass, the candidate is committed to the current program; otherwise, the candidate is discarded. Progress is probabilistically guaranteed because Csmith’s random choices always include at least one “easy” alternative that never fails the safety check. Loops and function calls complicate this simple picture, but the basic structure of the solution, which we call “guess and check,” applies.

We will augment Csmith’s existing dataflow analyses to support static data race detection. Plenty of good starting points are available. Innovation may be required to perform race checking in the presence of atomics using C++11’s weak memory orderings. We do not know of a tool that checks for races in weakly ordered C++11, though it seems very likely that these will appear during the next year or two.

Given a race checker in Csmith, we believe that the guess and check approach will be sufficient to generate race-free programs. Even so, creating “good” random programs is as much an art as a science and we expect that additional refinements will be useful. First, accesses to sequentially consistent, atomic variables are always safe–the race detector does not need to be consulted before committing an access to this kind of memory. Second, Csmith’s safety checks can be simplified for data that is not shared between threads. We believe it will be useful to pre-partition variables into unshared ones that can be freely accessed by a single thread, and shared variables that must be synchronized appropriately. We expect unshared data to play an important role in compiler bug detection; for example, unshared data can expose compiler-introduced races like the example presented earlier. Also, code accessing unshared data can trigger optimization passes that may mangle nearby shared data accesses.

Generating Programs with Concurrent Invariants

Currently, just before terminating, a program generated by Csmith takes a checksum of the values held in all global variables and prints it. The checksum is invariant across all platforms making compatible choices for certain implementation-defined parameters (integer width and representation, mainly). Thus, a checksum difference indicates a compiler bug.

In general, the checksum approach will not work for concurrent programs. One approach would be to generate deterministic concurrent programs–where the checksum will again be invariant–but this seems quite restrictive, and will serve as a backup plan in case other proposed approaches do not work. Csmith currently does not track the values of scalar variables; its dataflow analysis will need to be extended to do this. We previously developed cXprop, a value-tracking abstract interpreter for concurrent C code; we do not expect it to be difficult to either repurpose cXprop or else re-implement the same techniques within Csmith. cXprop already knows how to generate assertions corresponding to the invariants that it computes. While we were debugging cXprop, an assertion violation generally signaled a cXprop bug–but of course these assertions can also catch compiler bugs. To ensure that invariants are not violated, Csmith will synthesize a separate thread that asserts that every invariant holds, in an infinite loop.

Thus far, we have discussed using the guess-and-check approach, combined with a concurrency-aware dataflow analysis, to generate concurrent code with invariants. We will also explore an alternative approach where invariants are the starting point for program synthesis, rather than a side effect of their construction. We will exploit a technique developed by Deng et al. for invariant-based synthesis of concurrent programs. This approach shows how to start with a coarse-grained synchronization solution and some invariants, and automatically refine it into a fine-grained solution. Our approach will be very similar: we will begin with randomized invariants and coarse-grained code (implemented using thread-level locks). Then, we will randomly apply transformations that make the solution more fine-grained, while preserving the invariants. First, critical sections will be split into smaller critical sections, then they will be replaced (when possible) with sequentially-consistent operations on atomic types, and finally we will attempt to introduce relaxed memory-model operations and memory barriers.

Finding Invariant Violations

Given an executable test case containing assertions, we need a way to determine if it was compiled correctly. The easiest (but least reliable) way to do this is to run the program and wait for an assertion violation to occur. This is not reliable because it is typically the case that running a concurrent program explores only a vanishingly small part of the space of all possible interleavings among memory operations in its threads. The hardest (but most reliable) way to find invariant violations is to run the compiled code through a formal methods-based tool that finds an execution where an invariant is violated, or else proves that such does not exist. Although a number of research efforts have attacked the problem of program verification on relaxed memory models, all such tools that we are aware of do not scale to even medium-sized codes. The problems that must be solved to make these tools scale are extremely difficult. Thus, while we plan to take advantage of any such tools that become available, we also plan to explore alternatives.

Executing on hardware

It is not difficult to perturb a program’s schedule in order to increase the probability that interesting interleavings are seen. One way is to use binary rewriting to insert randomized nop instructions into the compiled code. The inserted code can also use a random number generator so that delays are parameterized by a seed value. Another method is to cause interrupt handlers to fire on the machine that is doing the testing. This is easy to arrange by having a second machine send it small packets over a network link at short, random intervals. This method has the further benefit of causing substantial memory traffic due to DMA operations initiated by the network interface card.

Alglave et al.’s Litmus tool is based on similar ideas: it randomizes memory locations accessed by threads and also the threads’ CPU affinities. We will explore these techniques in addition to the ones described above. Although there are certainly limits on the effectiveness of techniques like these, we believe they are very much worth trying simply because they are low-cost, easily portable across architectures, and absolutely reliable in the sense that any behavior that is observed is real–there are no false alarms when executing on real hardware.

Yet another approach to executing test cases on real hardware is being explored by the GCC developers.[0. (link)] They plan to execute the code under test in one thread and an assertion checker in a separate thread (as we propose above). They will to single-step through the thread under test (in an automated fashion, using debugger support) running the full assertion check after every instruction. This will ensure that every intermediate memory state is checked. Drawbacks of this approach are that it will incur a significant slowdown and also the act of single-stepping the processor will flush the processor’s pipeline and store barrier–this could easily mask bugs.

Executing in a simulator

Running test codes in a simulator reduces testing throughput, but has other advantages such as being repeatable and making it possible to simulate platforms that are unavailable or inconvenient. Although we do not know of an open-source simulator for ARM, x86, or x86-64 that faithfully implements the relevant memory model, it is possible that one exists or will exist within the next year or two. If not, we do not believe it will be too difficult to add weak memory model support to an existing simulator.

Given a weak memory model simulator, the challenge is to make it actively hostile to the program under test. One approach would be to use state-space exploration to look at many possible interleavings of memory operations. However, we believe this will be too slow to test realistic programs. Rather, it should be possible to focus the simulator’s efforts on weak points in the code under test. By analogy with race-directed software testing, we will explore the idea of causing the simulator to delay a thread until another processor is ready to execute code that touches conflicting memory locations. Similarly, there is significant room for hostility in the implementation of the write buffer, which should act as a “wrong buffer” by delaying flushes to main memory for as long as possible.

Test Case Reduction

To keep the story simple, we have not yet mentioned one problem that we will need to work on: test case reduction–the process of turning a large failure-inducing program into a much smaller one. Reduction is an important part of root-cause analysis for any compiler bug, and in practice it needs to be automated. Although the results are not yet published, automated test case reduction has been an active area of research for us for several years–but only for the case of sequential code.

Automated test case reduction for concurrent codes is more difficult because automated reduction algorithms–generally based on Delta Debugging–require deterministic execution. For example, Choi and Zeller’s work on isolating failure-inducing thread schedules fundamentally depends on a capture/replay mechanism that makes threads execute deterministically. It is not clear how to avoid the need for determinism. It is also not clear how to get efficient deterministic replay on a weak-memory multicore; there are research solutions addressing this problem but they generally require hardware support or are very inefficient; Heydari and Azimi survey this work. In summary, replay on real multiprocessors is effectively a separate research problem and we do not plan to address it.

Determinism in a simulator is trivial. Even so, test case reduction remains challenging. The fundamental difficulty in reducing the size of failure-inducing C/C++ programs is avoiding undefined behaviors. For example, in our experience it is common for a test case reducer to delete the initializer for a variable, resulting in a test case with undefined semantics.

When reducing concurrent C++11 programs the problems are even worse because we must not introduce a data race during reduction.

Two solutions to avoiding undefined behaviors in general, and race conditions in particular, suggest themselves. First, we can use an external race detector to validate test cases during reduction, if one becomes available. Second, since Csmith will already know how to generate race-free (and otherwise well-defined) programs, we can leverage it to also generate smaller versions of the programs. We have already experimented with this approach for reducing sequential test cases in the C language, and it appears to be workable.

What PC-lint is Really For: Enforcing Language Dialects

John Carmack’s piece about using static analyzers is really nice, and he’s right that applying PC-lint to a significant piece of C code results in page after page of warnings. Faced with this problem on a code base I cared about, and not wanting to give up, I first created a config file for PC-lint that suppressed the vast majority of warnings and then, over a few weeks, cleaned up the remaining issues. Once the code base linted clean, I could turn on classes of warnings whenever I had the time and inclination.

As an example, PC-lint supports fine-grained application of stronger typing rules. So one day I might assume that all uses of “bool” or some other enumerated type are segregated from uses of integers. Of course, in any real code base, such an assumption is wrong and a pile of annoying warnings ensues. I’d either fix them or else decide that it wasn’t worth it until some major rearchitecting could be done.

The cool thing was that after a while I realized that I wasn’t really writing C code anymore, but rather a much more strongly typed dialect where I actually had a degree of (justifiable) confidence that certain kinds of errors were not being made. The C compiler, with its near-nonexistent error checking, was acting as a mere code generator. It wasn’t even the case that PC-lint was uncovering a lot of serious bugs in my code (though it did find a few). Rather, it gave me some peace of mind and assured me that what I thought was happening in the code was in fact happening.

Of course, we always write code in dialects, whether we realize it or not. In many cases, language dialects are formalized as coding conventions or standards so a group of people can all read or write the same dialect — this has enormous advantages. The things that I found charming about PC-lint are that (1) it supported a very flexible range of dialects and (2) it quickly and mechanically ensured that my code stayed within the dialect. When writing non-C code (especially Perl) I often wish for a similar dialect checker.

Book Beginnings

If a book starts out just right, I’ll keep going back and rereading the first few sentences long after I’ve finished the book. Here are a few that did that to me.

Fagles’ translation of the Odyssey:

Sing to me of the man, Muse, the man of twists and turns driven time and again off course, once he had plundered the hallowed heights of Troy.

Gravity’s Rainbow:

A screaming comes across the sky. It has happened before, but there is nothing to compare it to now.

Blood Meridian:

See the child. He is pale and thin, he wears a thin and ragged linen shirt. He stokes the scullery fire. Outside lie dark turned fields with rags of snow and darker woods beyond that harbor yet a few last wolves.

I think the last one is my favorite.

Bit Puzzle Results

This is the followup piece to Monday’s bit puzzle. First off, the spec for this problem wasn’t clear what to return when both arguments are the same. But let’s look at Carlos’s comment about where this code would be used:

If you think of a binary number as a sequence of binary tree decisions on the full range of possible numbers, this bit operation gives you the most specific common ancestor on this tree.

This use case makes it seem pretty clear that just returning the argument is the right result. So that’s what I’ve assumed here.

These are the basic approaches to the problem:

  1. Construct the output by iterating over bits.
  2. Compute the result using “standard bit tricks” built on operations found on all common processors.
  3. If the target processor has fast “find first set bit” instructions (e.g. modern x86/x64 has bsf/bsr), these can be applied to the XOR of the arguments, and after that the output can be generated straightforwardly.

Code for approach 1 can be found in my original post. Solutions using approach 2 are exemplified by code from Eddie Kohler, Andrew Hunter, and others:

uint64_t high_common_bits_portable(uint64_t a, uint64_t b)
  uint64_t x = a ^ b;
  x |= x >> 1;
  x |= x >> 2;
  x |= x >> 4;
  x |= x >> 8;
  x |= x >> 16;
  x |= x >> 32;
  return (a & ~x) | (x & ~(x >> 1));

uint64_t low_common_bits_portable(uint64_t a, uint64_t b)
  uint64_t x = (a ^ b) & -(a ^ b);
  return (a & (x - 1)) | x;

The first one is really nice.

Here is code implementing approach 3:

uint64_t high_common_bits_intrinsic(uint64_t a, uint64_t b)
  if (a == b) return a;
  uint64_t bit = 1ULL << 63 - __builtin_clzll(a ^ b);
  return (a & ~(bit - 1)) | bit;

uint64_t low_common_bits_intrinsic(uint64_t a, uint64_t b)
  if (a == b) return a;
  uint64_t bit = 1ULL << __builtin_ctzll(a ^ b);
  return (a & bit - 1) | bit;

These are mostly the code left by Benjamin Kramer, but modified to avoid invoking the builtin function with a zero argument, which the GCC documentation says has undefined behavior. This is the same approach that I used when solving this problem, though Benjamin did a better job with the masking code. This code doesn’t use the actual bsf/bsr instructions but rather GCC intrinsics that are supposed to compile to whatever efficient “find first set bit” primitive exists on the target platform.

Now that we have three implementations of each function, let’s work on testing them. We first need test vectors. It seems clear that making the inputs independently random 64-bit integers is wrong. My approach was to use a random 64-bit number for one of the inputs, and to construct the other input by flipping 0..63 random bits of the first argument. It’s possible that this isn’t great, I don’t know — but it’s not going to matter that much when benchmarking approaches 2 and 3 since they contain almost no data-dependent control.

My approach was to generate 1000 inputs, start a timer, run a function over all of the inputs, then stop the timer. 1000 values seemed reasonable since these arrays fit into the L1 of most any modern machine. Each function runs over the array five times and the fastest result is reported. The code is here.

Here are the results from a Core i7 920 running Linux in 64-bit mode:


My i7 920 is based on the Nehalem architecture — one off from Intel’s latest. If anyone has a Sandy Bridge processor (Core i7 2600 or 2700, most likely) sitting around, please run the code and send me results, but please be careful to turn off processor power management and take other appropriate timing precautions. I’d also like to see results for Clang, GCC, and ARM’s compiler on a modern ARM processor.

Here are some statically linked executables for x86-64 Linux: GCC, Clang, Intel CC.

Finally, thanks, everyone, for the great responses!


Argh, my first attempt failed to stop the compiler from inlining the bit functions into the test harness. (Thanks for pointing this out, Richard and Chris!) The results were not technically incorrect, but they didn’t correspond to a reasonable use case for these functions. In particular, the Intel compiler was able to use SIMD to massively optimize operations on sequential array elements. After turning off inlining, the results are:

[regehr@gamow code]$ clang -O3 -march=corei7 -mtune=corei7 bitdiff.c -o bitdiff ; ./bitdiff

high slow      : time =  52.7 (ignore this: 10649328621511951360)
high portable  : time =  16.6 (ignore this: 10649328621511951360)
high intrinsic : time =   9.5 (ignore this: 10649328621511951360)
low slow       : time =  53.4 (ignore this: 9949931644094715106)
low portable   : time =   7.0 (ignore this: 9949931644094715106)
low intrinsic  : time =   8.6 (ignore this: 9949931644094715106)

[regehr@gamow code]$ current-gcc -Ofast -march=corei7 -mtune=corei7 bitdiff.c -o bitdiff ; ./bitdiff

high slow      : time =  48.4 (ignore this: 8093869745297464062)
high portable  : time =  15.4 (ignore this: 8093869745297464062)
high intrinsic : time =   9.5 (ignore this: 8093869745297464062)
low slow       : time =  48.0 (ignore this: 8862089369885824930)
low portable   : time =   8.0 (ignore this: 8862089369885824930)
low intrinsic  : time =   8.5 (ignore this: 8862089369885824930)

[regehr@gamow code]$ icc -fast bitdiff.c -o bitdiff ; ./bitdiff

high slow      : time =  51.1 (ignore this: 15670527381357769456)
high portable  : time =  18.0 (ignore this: 15670527381357769456)
high intrinsic : time =  12.3 (ignore this: 15670527381357769456)
low slow       : time =  48.3 (ignore this: 17922617506918019178)
low portable   : time =   9.0 (ignore this: 17922617506918019178)
low intrinsic  : time =   9.9 (ignore this: 17922617506918019178)

Times are in CPU cycles per call. The answer appears to be: regardless of compiler, the intrinsic version for high_common_bits gives about a 50% speedup. For low_common_bits, it doesn’t matter which version you use.

Finally, here’s what Clang gives for the version of low_common_bits using intrinsics:

        movq    %rdi, %rax
        cmpq    %rsi, %rax
        je      .LBB5_2
        xorq    %rax, %rsi
        bsfq    %rsi, %rcx
        movl    $1, %edx
        shlq    %cl, %rdx
        leaq    -1(%rdx), %rcx
        andq    %rax, %rcx
        orq     %rdx, %rcx
        movq    %rcx, %rax

I think the final movq could be avoided, but otherwise this looks good.

A Bit Puzzle

I love a good bit-level puzzle. Today’s is one I learned about from a comment in a Google+ discussion:

Given two words a and b, return the high bits that they have in common, with the highest bit where they differ set, and all remaining bits clear. So 10101 and 10011 yields 10100. Then do the same thing, only flipped, so we keep low-order identical bits and mask out the high ones.

The problem comes from trie implementation. Here’s code that does it the slow way:

uint64_t high_common_bits_64_slow (uint64_t a, uint64_t b)
  uint64_t mask = 0x8000000000000000LLU;
  uint64_t output = 0;
  int i;
  for (i=63; i>=0; i--) {
    if ((a & mask) == (b & mask)) {
      output |= (a & mask);
    } else {
      output |= mask;
      goto out;
    mask >>= 1;
  return output;

uint64_t low_common_bits_64_slow (uint64_t a, uint64_t b)
  uint64_t mask = 1;
  uint64_t output = 0;
  int i;
  for (i=0; i<64; i++) {
    if ((a & mask) == (b & mask)) {
      output |= (a & mask);
    } else {
      output |= mask;
      goto out;
    mask <<= 1;
  return output;

The problem is to do much better than these. To keep things simple, let’s say we’re interested primarily in speed on newish x86-64 implementations. Choose any programming language, or just write assembly. In a few days I’ll post my solutions in addition to any good ones that show up here before then. Beware WordPress’s mangling of source code in the comments — feel free to just mail me.

A Different Approach to System Security

I enjoy it when science fiction has something useful to say about computer security. Towards the end of Iain M. Banks’ Matter, there’s a big space battle and we find this passage:

“Compromised,” Hippinse told him. “Taken over by the other side. Persuaded by a sort of thought-infection.”

“Does that happen a lot, sir?”

“It happens.” Hippinse signed. “Not to Culture ships, as a rule; they write their own individual OS as they grow up, so it’s like every human in a population being slightly different, almost their own individual species despite appearances; bugs can’t spread. The Morthanveld like a degree more central control and predictability in their smart machines. That has its advantages too, but it’s still a potential weakness. This Iln machine seems to have exploited it.”

Monoculture is an obvious and serious danger. For example, about the 2003 Slammer/Sapphire worm:

Propagation speed was Sapphire’s novel feature: in the first minute, the infected population doubled in size every 8.5 (±1) seconds. The worm achieved its full scanning rate (over 55 million scans per second) after approximately three minutes, after which the rate of growth slowed down somewhat because significant portions of the network did not have enough bandwidth to allow it to operate unhindered. Most vulnerable machines were infected within 10-minutes of the worm’s release.

Imagine the damage a similar worm could do today, or in 20 or 100 years, if properly weaponized.

I know there are automated approaches to diversity (ASLR, randomized instruction sets, etc.) but I found “they write their own individual OS as they grow up” to be a very charming idea, perhaps in part because it is so wildly impractical today.