Skip to content

Tricking a Whitebox Testcase Generator

The awesome but apparently discontinued Underhanded C Contest invited us to write C code that looks innocent but acts malicious. Today’s post is an alternate take on the same idea: we don’t really care what the malicious code looks like, but it needs to escape detection by an automated whitebox testcase generator. In some respects this is much easier than tricking a person, since whitebox tools are (at best) idiot savants. On the other hand, a number of tried-and-true methods for tricking humans will be stopped cold by this kind of tool.

As a simple example, we’ll start with two solutions to my bit puzzle from last month: the slow reference implementation and a much faster optimized version. We’ll also include a little test harness.

uint64_t high_common_bits_reference(uint64_t a, uint64_t b)
{
  uint64_t mask = 0x8000000000000000LLU;
  uint64_t output = 0;
  int i;
  for (i = 63; i >= 0; i--) {
    if ((a & mask) == (b & mask)) {
      output |= (a & mask);
    } else {
      output |= mask;
      goto out;
    }
    mask >>= 1;
  }
 out:
  return output;
}

uint64_t high_common_bits_portable(uint64_t a, uint64_t b)
{
  uint64_t x = a ^ b;
  x |= x >> 1;
  x |= x >> 2;
  x |= x >> 4;
  x |= x >> 8;
  x |= x >> 16;
  x |= x >> 32;
  return (a & ~x) | (x & ~(x >> 1));
}

int main(void)
{
  srand(time(NULL));
  int i;
  for (i=0; i < 100000000; i++) {
    uint64_t a = rand64();
    uint64_t b = mutate(a);
    assert (high_common_bits_reference (a,b) ==
            high_common_bits_portable (a,b));
  }
  return 0;
}

The code in main() is just a trivial fuzz tester: rand64() gives uniformly distributed 64-bit random integers and mutate() randomly flips 0..63 random bits in its argument (making both arguments independently random does not do a very good job testing this code).

Can we add malicious code that this fuzzer will not detect? Of course this is easy, for example we can add this code to the first line of either function:

if (a==0x12345678 && b==0x9abcdef0) return -77;

Why return -77? Let’s assume that the return value from this function (which is guaranteed to be in the range 0..63) is used as an array index, and offset -77 lets us turn an array write into a stack smash.

First, will our fuzz tester discover this modification? Nope… even with 100 million random trials, the probability of detection is on the order of 10-30.

Next, let’s try to discover the modification using Klee, an excellent open source whitebox test case generator. We’ll change the test harness to this:

int main(void)
{
  uint64_t a,b;
  klee_make_symbolic (&a, sizeof (a), "a");
  klee_make_symbolic (&b, sizeof (b), "b");
  assert (high_common_bits_reference (a,b) == high_common_bits_portable (a,b));
  return 0;
}

As expected, Klee discovers the problem almost immediately:

KLEE: ERROR: /mnt/z/svn/code/bitdiff-klee.c:129: ASSERTION FAIL: high_common_bits_reference (a,b) == high_common_bits_portable (a,b)

How did it find these inputs? The point of a whitebox testcase generator is to take advantage of the structure of the code while generating test cases. So here, it simply asked “What values of the arguments lead to the assertion-failed branch being taken?” The answer, in this case, is trivial.

Do tools like Klee pose a serious problem for our effort to insert a backdoor? Let’s try to stop Klee from seeing our bug. The canonical way to fool a whitebox testcase generator is to exploit the fact that these tools solve hard inverse problems — so we insert a computation that is hard to invert. A checksum computation should do nicely. Our bug code now reads:

if (obfuscate_crc(a)==0xdf590898 && obfuscate_crc(b)==0x970b5f84) return -77;

where:

uint64_t obfuscate_crc (uint64_t x)
{
  return crc32 ((unsigned char *)&x, sizeof (x));
}

crc32() is simply a standard 32-bit checksum function that my stupid WordPress plugin for rendering C code cannot render, but you can find it in the full source code for this example.

Now we run Klee again and — damn! — after about 30 minutes (on my slowish home machine) it finds an input that triggers our buggy code. This is impressive; I did not expect Klee to be able to invert a couple of CRC32 operations. While it’s clear that we could push this line of attack further, and eventually confuse Klee to the point where it gives up, let’s try a different tactic. Our new bug code is:

 if (obfuscate_pipe(a)==0x12345678 && obfuscate_pipe(b)==0x9abcdef0) return -77;

where:

uint64_t obfuscate_pipe (uint64_t x)
{
  int pipefd[2];
  int res = pipe (pipefd);
  // assert (res==0);
  res = write (pipefd[1], &x, sizeof(x));
  // assert (res==sizeof(x));
  uint64_t ret;
  res = read (pipefd[0], &ret, sizeof(x));
  // assert (res==sizeof(x));
  close (pipefd[0]);
  close (pipefd[1]);
  return ret;
}

This time, rather than trying to overcome Klee using brute force, we’re counting on its inability to follow data through certain system calls. This time we get the desired result: not only does Klee not find anything amiss with our code, but it reports the same number of total paths (65) as it reports for the unmodified code.

A third way to trick Klee is to exploit a weakness in its goal of achieving full path coverage of the system under test. It is not hard to create a code path that is malicious for particular values of variables, but not for other choices. For example, Klee does not see the problem in this code:

int indices[] = { 0, 2, 1, 5, 3, 0, -77, 0, 3, 6 };

value_t storage[7];

void stash (int i, value_t v)
{
  if (i<0 || i>=10) {
     error();
  } else {
    storage[indices[i]] = v;
  }
}

When Klee tests the path containing the array store, it won’t notice the malicious index unless it happens to generate inputs that result in stash() being invoked with i==6 (it doesn’t, at least for me).

I feel certain there are additional ways to fool this kind of tool and will be interested to hear people’s thoughts.

Note that Microsoft does massive whitebox testing of its products at the binary level (Klee, in contrast, operates on LLVM bitcodes). Anyone who wants to put a backdoor into these products had better be covering their tracks carefully using techniques like the ones in this post (of course other strategies, such as ensuring that the affected modules are never tested by SAGE, would seem simpler and more reliable).

Finally, I might be able to come up with a new underhanded C contest where the goal is to fool both humans and a whitebox tester; let me know if that sounds interesting.

[Notes: Source code is available here. While preparing this piece I used this self-contained Klee distribution which is super easy to use and is described in a bit more detail on the Klee "getting started" page.]

Update from Jan 25:

As arrowdodger points out below, Klee should find the problem with the third method in this post. The fact that I was unable to get it to do so stemmed from a silly mistake I made: giving Klee the “–optimize” flag, which runs LLVM’s optimizers, which deleted the array write as dead code. Thanks to the people on the Klee mailing list for setting me straight!

Discovering New Instructions

Sometimes I wonder what instruction sets are supposed to look like. That is, what instructions would there be if computers were redesigned by smart people who understood our fabrication capabilities and who knew what we wanted to accomplish using computers, but who didn’t care about backwards compatibility and who haven’t seen our architectures? We can get little glimpses into that world by looking at network processors, DSPs, GPUs, ISA extensions like SSE4 and NEON, extensible processors like Tensilica’s, and others. But still, these are too rooted in the past.

Although the machine code emitted by our compilers is inextricably tied to our processors, perhaps this code can still be leveraged to discover new instructions. As a thought experiment, let’s start with a collection of executables whose performance we care about. Preferably, some of these will have been compiled from programs in Haskell, OCaml, and other languages that are not well-represented in today’s benchmark suites. We’ll run these programs in a heavily instrumented execution environment that creates a dynamic dataflow graph for the computation; the excellent Redux paper shows how this can be done. Next, we’ll need to clean up the dataflow graphs. First, we rewrite processor-specific operations (condition code dependencies, CISC instructions, etc.) into a simpler, portable form. Next, we optimize away as much dead, redundant, and vacuous code as possible; including, hopefully, all irrelevancies such as stack frame manipulation, dynamic linking, and garbage collection. The result — perhaps — will be something beautiful: the essence of the original computation, stripped of all sequential constraints, processor-specific operations, and bookkeeping. Of course this dataflow graph has some limitations. First, it only encodes the meaning of a single execution of the program. Second, it encodes a lot of incidental facts about the computation such as the bitwidths of all integer operations, the specific hashing methods used, etc. We’ll just have to live with these problems. The Redux paper contains a great example where factorial codes written in C, in Haskell, and in a stack machine are shown to all produce basically the same dataflow graph.

So now we have a collection of giant dataflow graphs: one for each execution of each program that we’re interested in. Our goal is to design an instruction set that can compute these dataflow graphs. Trivially, this can be done by partitioning the graphs into very small units of computation corresponding to a RISC instruction set. But that ISA is boring and won’t show any performance wins. To do better we’ll use a search-based strategy to find subgraphs that:

  • occur a large number of times across the collection of dataflow graphs — these are operations that are executed frequently by our workloads
  • contain a lot of parallelism — making them good candidates for hardware acceleration
  • contain a lot of internal symmetry — supporting SIMD-style execution
  • have a small number of dynamic inputs and outputs
  • rely on a small number of constants
  • do not contain dependency chains that are too long — we don’t want to create instructions that are too slow

I think this can be done; none of these properties seems particularly difficult to test for. The major problem necessitating cleverness will be the huge size of the dataflow graphs. We’ll end up with a list of candidate instructions ranked by some fitness function, such as performance or code density. We can build as many of these into our new ISA as we have hardware budget for.

Would this method discover saturating arithmetic instructions when applied to signal processing codes? Would it find clever ways to optimize bounds checks and exception handling in high-level programming programming languages? It’s possible (though I’d be super disappointed) that the new machines are just incidental variations on existing RISC and CISC designs. If this happened, I would suspect that we had failed to abstract away a sufficient number of processor artifacts. Or, perhaps it was a mistake to compile our computations to an existing machine architecture before building the dataflow graphs. Rather, perhaps we should start with a source-level program and its operational semantics, unrolling it into a dataflow graph without any compilation to machine code. This avoids ties to our existing processors, but also risks coming up with graphs that are very hard to map back onto actual hardware. Of course, many languages don’t even have a real semantics, but researchers are diligently working on that sort of thing. An even more difficult option would build up the functional representation of a source program (or executable) without running it, but this has the disadvantage of losing the “profile data” that is built into a dynamic dataflow graph — we’d need to add that in separately.

An aspect of this exercise that I find interesting is that gives insight into what our processors really do. Many years ago (I don’t have a reference handy, unfortunately) a study showed that computers spend most of their time in sorting algorithms. That cannot be true anymore — but what does the average mobile phone do? What about the average supercomputer chip? The average machine in Google or Amazon’s cloud? Of course we know the answers at a high level, but are there surprises lurking inside the computations? I would expect so — it’s tough to take a close look at a complicated system and not learn something new. Are there new instructions out there, waiting to be discovered, that can help these workloads execute more efficiently? I have to think so, at least for for workloads that are not well-represented in Intel’s, AMD’s, and ARM’s benchmark suites.

Wanted: Epitaphs for Hot Topics

Any given research community always has a few hot topics that attract an inordinate number of paper submissions. Sometimes these are flashes in the pan, other times they mature into full-fledged areas having their own workshops and such — but most often they endure for a few years, result in a pile of PhDs, and then slowly die off. After this happens I often find myself with unanswered questions:

  • Were the underlying research problems solved or were they too hard?
  • Did the solutions migrate into practice or are they withering on the academic vine?
  • What lessons were learned?

For example, consider software-based distributed shared memory, where MMU tricks are used to provide a shared address space for a cluster of machines. My former colleague John Carter wrote some of the basic papers in this area and they are hugely cited: more than 2000 citations for just the top four papers (including the patent). Clearly, a lot of followup papers were written. But then, as far as I know, software DSM is not widely used today and certainly it’s not an area where papers are being written. My understanding (which could easily be flawed) is that DSM just abstracted away a few too many things, such as performance and node failures, and this made DSM hard to use compared to explicit messaging. On the other hand, the DSM community created a detailed understanding of memory coherence models and (I believe) this directly fed into the formalization of the Java and C++ memory models. Thus, the practical impact of this research area is large, even if a bit oblique.

Couple more examples:

My experience is that the people who were there — writing papers about DSM or DHTs or P2P or whatever — always know the answers to my questions above. The problem is, by the time the answers become apparent everyone is working on the next hot topic and going up for tenure or whatever, and has zero time to write a proper retrospective. Furthermore, in many cases an honest retrospective would need to be a bit brutal, for example to indicate which papers really just were not good ideas (of course some of these will have won “best paper” awards). In the old days, these retrospectives would have required a venue willing to publish them, perhaps ACM Computing Surveys or similar, but today they could be uploaded to arXiv. I would totally read and cite these papers if they existed, and would make my students do the same.

Updates:

If you know of a good research epitaph, please send a pointer.

  • Michael Hind’s paper Pointer Analysis: Haven’t We Solved This Problem Yet? isn’t quite what I’m looking for, but it’s not too far off either.
  • David Andersen sent a link to this excellent epitaph for QoS.
  • Bhaskar Krishnamachari sent a pointer to this paper about the state of simulation studies in wireless ad hoc networks; it’s not really an epitaph but serves a similar role — capturing lessons that need to be learned.

It’s All About Interfaces

The Frank system — see also this recent post – is intended to reduce the amount of code needed to create a usable desktop software stack by about 1000x. I’m pretty sure that this goal is sufficient but not necessary. In other words, if we can reduce software size to 0.1% of its current size then that’s great and a lot of current software problems will be far less severe. However, the 1000x reduction is probably unnecessarily difficult — there may be easier ways to achieve similar benefits.

To support my claim that “1000x reduction in software size” is too strong of a goal, consider this question: Would a reasonable person object to running Frank on a modern Intel processor containing more than two billion transistors? In other words, does the gigantic complexity of one of these chips somehow detract from Frank’s simplicity? Of course not. The x86-64 ISA forms a fairly leak-proof abstraction boundary between Frank and all those transistors.1 I believe the same argument can be made about other interfaces. Let’s say that we have a compiler or an operating system that we trust, perhaps because it has been proved correct. In this case, does it really matter how much code implements the proved-correct interface? I don’t think so.

What we need, then, are more interfaces that are:

  • durable
  • non-leaky
  • beautiful (or, at least, extremely workable)

A durable interface lasts. For example, the UNIX system call interface is reasonably durable — it is often the case that a statically linked UNIX executable will run on many years’ worth of systems, whereas a dynamically linked program rots fairly quickly. x86 has proved to be highly durable.

A non-leaky interface reveals little or nothing about the interface’s implementation. Trivially, a non-leaky interface requires some sort of language-based or OS-based memory isolation — otherwise peeks and pokes (and their unintentional equivalents) cause leaks. Non-leaky interfaces require a lot of error-checking code to ensure that each side adheres to its part of the contract. Performance leaks, alas, are nearly impossible to avoid, unless we give up on implementing optimizations. Bugs are perhaps the ultimate abstraction leak, forcing evil workarounds in code using the interface. Formal verification and automatically-inserted contract checks constitute partial solutions.

Beautiful interfaces do not expose too much or too little functionality, nor are they more complex than they need to be. I’m not sure that I’ve seen a lot of specific examples of beautiful interfaces lately, but at a slightly more abstract level I would consider regular expressions, database transactions, stream sockets, SMT queries, RISC instruction sets, functional programming languages, and hierarchical filesystems to all have a certain amount of beauty. A lot of work and a lot of iteration is required to create a beautiful interface — so this conflicts somewhat with durability.

The worthwhile research problem, then, is to create interfaces having these properties in order that the code living in between interface layers can be developed, understood, debugged, and maintained in a truly modular fashion. To some extent this is a pipe dream since the “non-leaky” requirement requires both correctness and performance opacity: both extremely hard problems. Another problem with this idea — from the research point of view — is that a grant proposal “to create durable, non-leaky, beautiful interfaces” is completely untenable, nor is it even clear that most of this kind of work belongs in academia. On the other hand, it seems clear that we don’t want to just admit defeat either. If we disregard the people who are purely chasing performance and those who are purely chasing correctness, a substantial sub-theme in computer systems research can be found where people are chasing beautiful interfaces. There are probably a lot of good examples I could give here, but Matthew Flatt (example) and Eddie Kohler (example) come to mind.

In summary, I’ve tried to argue that creating a really small software stack, as in Frank, is a good goal, but not necessarily the very best one.

  1. Is the ISA exported by a modern Intel chip really leak-proof? Pretty much so as far as the functional semantics goes. However, the abstraction is incredibly leaky with respect to performance — but this only matters for real-time systems and performance-sensitive codes — neither of which is in Frank’s target domain.

Fixing Computers

Pretty often, a friend or relative asks if I can fix something that’s wrong with their computer. Because I’m very irritable in some respects, my first impulse is always to say something like “Sure! And I bet a brain surgeon would love to put a band-aid on that ouchie, too.” Usually, I manage to resist. Over the years I’ve found that fixing random PC problems is a surprisingly entertaining form of puzzle solving. Also, when successful, it makes people really happy. This post contains a few things I’ve learned.

Most often the problem is easy to diagnose and also easy to fix, like a flaky fan, a full disk, not enough RAM, flaky RAM, a cable or I/O card is loose, or a peripheral has died.

A noisy fan should always be fixed. Even if the noise isn’t especially irritating, it probably stems from a fan that’s about to fail because it’s rubbing against something or its bearings are wearing out. Failed fans lead to nasty secondary consequences, so it’s best to just nip problems in the bud. Identifying the noisy fan is super easy: use a pencil or similar to stop each externally-reachable fan, one at a time. If this fails, crack open the case and repeat for each internal fan.

Some problems are easiest to diagnose by swapping out parts. Unfortunately, most people don’t keep a reserve of spare parts, but even so, common stuff like keyboard and mouse, monitor and video card, DVD drive, etc. are not hard to deal with. When in doubt, throwing a bit of money at the problem often works.

Some problems are very hard to diagnose. One of the worst I’ve seen stemmed from a slightly flaky DIMM and also a slightly flaky CDROM drive. Swapping out either part made the problem better, but did not make it go away.

Some fraction of the things that go wrong with Windows machines — malware, DLL hell, disk corruption, and similar — simply cannot be fixed. Luckily, these problems often only become serious fairly late in a computer’s useful operating life. Thus, people don’t mind too much being told they’re stuck with the problem unless they feel like re-installing the OS. Other Windows problems can be fixed by a mixture of Googling, registry editing, judicious uninstallation of crapware, etc.

It’s best to err on the side of caution, especially for old machines that are held together using duct tape and baling wire. Not too long ago I permanently killed my wife’s primary work machine either by defragmenting the hard disk or by uninstalling a few programs that I suspected were slowing down the boot process. Neither of these actions should have stopped the machine from booting — but one of them did, possibly by exposing some latent HDD corruption. Of course the dead machine was my fault and I had to dig myself out by building her a new, fast machine.

Finally, people are generally miserably poor about backups and computer problems present an excellent excuse to lecture them on backup discipline. These days it couldn’t be easier to perform proper backups, even if it’s just copying the “my documents” folder to a USB dongle or Dropbox disk every day or two.

Can Simplicity Scale?

Software has gotten really big, with many systems — even, apparently, cars — running into the hundreds of millions of lines of code. The drawbacks of code bases this large are numerous: they are hard to understand, hard to modify, hard to test, and virtually guaranteed to contain huge numbers of bugs. My understanding is that up to this point, we have survived principally by burying mountains of code under abstraction layers, hiding much of the complexity from the next layer in the software stack. This is fine but it only gets us so far: large codes tend to export quirky, unclean abstractions and of course bugs cause even more abstraction leaks. We should be exploring alternative approaches that might be able to radically reduce the size of our code bases.

This annual report describes an ongoing effort to implement Frank, a full software stack for an interactive desktop machine, using less than 20,000 lines of code. This is a great project and I love the fact that they’ve made an annual report — typically an internal-only document required by a funding agency — publicly available. Another nice aspect of this project is that creating an innovative GUI is explicitly not a goal; they simply want to mimic existing systems using 1000 times fewer lines of code.

The technical approach, in a nutshell, is to design a number of domain-specific programming languages, each suited to concisely expressing some part of the system. For example, consider this text from the report which describes the core of Frank’s windowing system, which is written in the Nile language:

The several dozen standard compositing rules, shading, stroking, gradients, sampling, shadowing, etc.—457 lines in total—were written in Nile and debugged to make a working graphics system, strong enough to be used for Frank, and to do all the graphics required for personal computing (and hence this report).

The Nile code can be found here. What is really being talked about — thought the report doesn’t use the term — is executable specifications. Specifications describe how a system should operate; they are written in English, mathematics, pseudocode, etc. and in general they can’t be executed efficiently (or at all). On the other hand, if the domain and the specification language are constrained, it should be possible to create executable specifications. Even in cases where the simple version of the specification cannot be executed efficiently, this approach supports the separation of optimizations from the core specification, making it easier to reason about their correctness.

Although I love this research project, at some point we have to ask ourselves if the basic goal — a 1000x reduction in lines of code, or a full system in 20 KLOC — will survive contact with the enemy. To play devil’s advocate for a moment, I suspect that a minimal OS + GUI + applications could be written in 20,000 lines of code even in C++ if the design was careful, the feature set limited, and the implementers talented. To continue with that line of thought: How much of Frank’s elegance will survive performance tuning, porting to many platforms, feature creep, internationalization and related hacks, mediocre developers, security hardening, and all of the other things that happen to software as it is pushed into production and then used by real people to accomplish complex tasks?

My guess is that a significant fraction of Frank’s gains go away in the face of real-world engineering concerns. However, even if we are left with a 10x or 100x reduction in code size, instead of the original 1000x, the exercise is worthwhile. The thing that worries me most about pushing a system like Frank into the real world is that the hidden costs are larger than we might hope. Let’s make an analogy with microkernels. Speaking abstractly, there is much to love about, for example, a 10 KLOC kernel that has been proved correct. However, people whose judgement I trust have said that (1) debugging code running on a microkernel can be extremely hard and (2) creating high performance code requires very talented developers. The first problem stems mainly from concurrency, the second from the fact that in a microkernel-based system, an arbitrary piece of data is no longer a single pointer indirection away. It seems that a Frank-like system is likely to have similar problems. Creating a clean, coherent, and efficient system using a big pile of little languages, some of them highly mathematical, probably requires serious talent. In contrast, if I’m a medium-grade developer and you give me a big, crappy mountain of C++, I can just start hacking and probably I’ll eventually get the desired effect — even if it is fragile and impossible to maintain. The is a classic worse-is-better situation – which has nothing to do with “worse” or “better”, but rather is about the conflict between two value systems: one which prizes cleanliness and clarity, the other which values pragmatic code that works efficiently in practice and can be easily evolved to solve new problems.

In summary, I desperately hope that we won’t be seeing, in a few years, 50 billion lines of code on the desktop and in the automobile. This future seems all too plausible and we should be actively working to avoid it. Approaches such as Frank are promising. The Racket project has a significant amount of overlap with Frank (though its top-level goals are different, or at least differently stated); see for example the recent CACM article by my colleague Matthew Flatt.

NSF Data Management Plans

As of a year ago, all grant proposals submitted to NSF must be accompanied by a data management plan. Basically, the PIs must explain:

  • how sensitive data (for example, data that contains personal information about experimental subjects) will be managed
  • how data resulting from the research will be archived and how access to it will be preserved

The requirements seem reasonable and certainly people asking for public funding should have thought about these issues. The problem is that little guidance is given about what constitutes an appropriate plan, or how the quality of the plan will play into the evaluation of the proposal. Since I (happily!) do not work with sensitive data, this post will deal primarily with the second item — preserving public access to data.

The NSF policy on Dissemination and Sharing of Research Results makes it clear that researchers are expected to share the results of their research in three ways. First, research results must be published. Second, peripheral data must be shared “at no more than incremental cost and within a reasonable time.” Third, NSF grantees are permitted to “retain principal legal rights to intellectual property developed under NSF grants to provide incentives for development and dissemination of inventions, software and publications that can enhance their usefulness, accessibility and upkeep.” In other words, it is permissible to create proprietary products based on technologies developed using NSF money.

It is the second item in the previous paragraph that concerns us here, and which seems to be a main focus of the data management plan. The problems that PIs need to solve when writing this plan include determining:

  • How much data will need to be archived? A few megabytes of code or a few petabytes of imaging data?
  • Over what time period must data be stored? The duration of the grant or 75 years?
  • What kind of failures must be tolerated? The failure of a single disk or tape? A major, region-wide disaster (like the one Salt Lake City is certain to get, eventually)?
  • In the absence of disasters, how many bit flips in archived data are permissible per year?

Of course the cost of implementing such a plan might vary over many orders of magnitude depending on the answers. Furthermore, there is the difficult question of how to pay for archiving services after the grant expires. I’m personally a bit cynical about the ability of an educational organization to make a funding commitment that will weather unforeseeable financial problems and changes in administration. Also, I’m guessing that data management plans along the lines of “we’ll mail a couple of TB disks to our buddies across the country” aren’t going to fly anymore.

For large-scale data, a serious implementation of the NSF’s requirements seems to be beyond the capabilities of most individual PIs and probably also out of reach of smallish organizations like academic departments. Rather, it will make sense to pool resources at least across an entire institution and probably across several of them. Here’s an example where this is happening for political and social research. Here’s some guidance provided by my institution, which includes the Uspace repository. I don’t believe they are (yet) equipped to handle very large data sets, and I’d guess that similar repositories at other institutions are in the same boat.

This post was motivated by the fact that researchers like me, who haven’t ever had to worry about data management — because we don’t produce much data — now need to care about it, at least to the level of producing a plausible plan. This has lead to a fair amount of confusion among people I’ve talked to. I’d be interested to hear what’s going on at other US CS departments to address this.

Proposal to Stress-Test Implementations of C++11 Concurrency

I recently submitted a proposal to NSF’s “small” Computer Systems Research solicitation. The rest of this post contains the body of the proposal, edited a bit for formatting (citations are missing and there are no doubt some leftover latex-isms) and to remove some sections that are tedious to people not sitting on an NSF panel. If you’re curious, here’s the entire proposal in the form that the panel will see. I’m posting this material in hopes of getting useful feedback and also because I think requests for public money should be open. I wrote a bit more about this earlier.

Motivation and Overview of the Proposed Work

Many large and important concurrent software systems are written partly or entirely in C++98:1 2

  • Apple’s Mac OS X
  • Adobe Illustrator
  • Facebook
  • Google’s Chrome browser
  • Microsoft Windows 7 and Internet Explorer
  • Firefox
  • MySQL

Similarly, many large and important concurrent software systems are implemented in C, including almost every operating system kernel and hypervisor, most Java virtual machines and many other programming language runtimes, and many embedded systems including those that control critical transportation and medical care systems.

It is ironic, then, that these languages–in which massive amounts of mission- and safety-critical concurrent code are written–have no built-in support for concurrent execution. This omission can cause problems for software development efforts. First, the compiler–not designed for concurrency–can introduce race conditions into programs that appear correct at the source level. Second, consider this code which implements a two-process mutual exclusion scheme using the well-known Dekker algorithm:

volatile int flag[2], turn;

void dekker_unlock (int i)
  turn = 1-i;
  flag[i] = 0;
}
void dekker_lock (int i) {
  flag[i] = 1;
  while (flag[1-i]) {
    if (turn != i) {
      flag[i] = 0;
      while (turn != i) { }
      flag[i] = 1;
    }
  }
}

This code can be proved to be correct (that is, it provides progress, mutual exclusion, and bounded wait) as long as operations on volatile global variables are sequentially consistent.3 Therefore, this code works on uniprocessor machines (of course, a spinlock is not efficient on a uniprocessor, but at least it is correct). On the other hand, on most multiprocessor machines this Dekker implementation is incorrect: it fails to provide mutual exclusion. This is even the case on machines with a relatively strong memory model such as multicore Intel and AMD processors implementing the x86 and x86-64 architectures.

To fix the Dekker code, processor-specific memory fence operations must be added. Fences enforce ordering constraints on memory operations that would otherwise be reordered by hardware. Because memory system and fence semantics vary, code with fences is not portable, and also it can be extremely difficult to figure out where fences are required. If one errs on the side of adding too many fences, performance will suffer. If one errs on the side of adding too few, the Dekker code will fail to properly implement mutual exclusion.

Of course, the vast majority of programmers are not interested in writing platform-specific code for a weak memory model. The alternative is to target a specific concurrency library such as POSIX threads or Win32 threads. However, libraries are typically not universally available; POSIX threads are generally not found on non-UNIX platforms and Win32 is not found outside of Windows machines. Additionally, libraries such as POSIX threads fail to provide direct access to low-level hardware operations such as compare-and-swap instructions, preventing certain high-performance idioms from being expressed portably.

The C++11 Memory Model

The recently-adopted C++11 standard was designed to fix the problems discussed in the previous subsection. First, it standardizes C++ concurrency across platforms. The standard formalizes the notion of data race and prohibits compilers from introducing races. Second, it provides an extensive suite of powerful, low-level operations that are intended to make it possible for portable code to attain very high performance. It is believed that the next version of the C standard (the current draft is referred to as “C1X”) will incorporate a memory model very similar to the C++11 concurrent memory model.

Problem: Correctness of Implementations of the C++11 Memory Model

The problem that the proposed work aims to solve is:

  1. The C++11 memory model is large and complicated, containing dozens of classes and templates, and hundreds of functions. These interact closely with OS-specific synchronization calls and hardware-specific synchronization instructions.
  2. Today’s C++98 compilers will serve as the basis for all known C++11 implementations. These tools–which are large, complicated, and contain highly aggressive optimizations–were developed in the absence of any non-trivial memory model.
  3. Compiler optimizations and the C++11 memory model are intimately related. For example, some transformations that were previously permissible are now illegal. Although the full extent of the interactions between existing compiler implementations and the new memory model is not yet totally understood, it is clear that a lot of work will be required to iron out the details. Subtle bugs are likely to persist for some time.

We have talked to compiler developers who work on LLVM, GCC, and on several important commercial compilers. None of these developers expects that implementation of the C++11 memory model will be quick or easy. In fact, they all expressed concern about interactions between existing optimization passes and constraints imposed by the new model.

Intellectual Merit — Major Research Challenges and Solution Sketch

In a nutshell, the proposed work is to randomly generate concurrent C++11 programs, compile them, and then run them on an infrastructure designed to reveal concurrency errors. This is a logical extension of our previous work that has resulted in more than 400 previously unknown compiler bugs being identified, reported, and usually fixed. However, there are several significant challenges that must be overcome.

Challenge #1: Generating Valid Code

The major challenge in creating Csmith was to generate code that is expressive and well-defined. Expressive code exercises a large subset of language features, and well-defined code avoids executing any of the 191 undefined behaviors from the C standard (integer overflow, array and pointer errors, union errors, divide by zero, shift past bitwidth, etc.). In C++11, data races are an undefined behavior. The definition of data race is intricate, but the core idea is to determine if every access to each shared data item is sufficiently synchronized with respect to all other accesses to that item. Generating random, race-free programs is expected to be significantly non-trivial in the presence of arrays and pointers. Our solution will exploit the dataflow analysis that is already a part of Csmith. Essentially, we will implement a static race checker that runs in tandem with program generation.

Challenge #2: Generating Interesting Invariants

Automated random testing can succeed only when useful invariants are identified in the system under test. Our previous work has exploited three invariants. First, our testing of the volatile qualifier made use of the volatile invariant, which is implied by the C and C++ standards. It specifies that the number of load and store operations issued to each byte of a volatile-qualified object must not change across compilers or optimization options. Second, our wrong-code bug hunting using Csmith relied on the invariant that for a well-defined test program, a checksum taken over the global variables at the end of execution must not change across compilers or optimization options. Third, randomized compiler testing relies on the simple invariant that the compiler should not crash when presented with valid input.

Randomly testing implementations of the C++ memory model will only be possible if we can generate code with interesting concurrent invariants. Of course, concurrent programs written by humans have invariants (e.g., “the empty flag is set iff the list is empty”). We plan to use ideas from invariant-based program synthesis to create random programs with non-trivial invariants. A useful consequence of the invariant-based approach to compiler testing is that Csmith’s test cases will become self-checking–compiler bugs will cause them to fail with an assertion violation. In contrast, our compiler testing work up to this point has used differential testing where multiple compilers (or multiple optimization levels of a single compiler) were tested against each other.

Challenge #3: Exploring the Scheduling Space

The space of thread interleavings is large, and finding bugs in concurrent programs boils down to exploring as much of this space as possible. Our compiler stress testing work will explore three approaches. First, we hypothesize that early progress can be made using simple mechanisms to introduce scheduling noise, for example by introducing randomized nop instructions or sleep calls into a compiled program using binary rewriting, and by using a second machine to send randomized network interrupts to the machine that is executing test cases. Second, there is much existing work on designing runtime systems that attempt to make concurrency errors show themselves; we will reuse this work to the maximum extent possible. Third, we will explore the idea of implementing a hostile x86-64 simulator. Although we believe that it is probably not feasible to compute a pessimal schedule for a program, hostility is an attainable goal. It will be implemented using techniques such as triggering context switches near and inside (inferred) critical sections and delaying inter-processor communication as long as possible, in effect replacing the CPU’s write buffer with a “wrong buffer.”4

The thesis of this proposal: Randomized stress testing of implementations of the C++11 memory model will significantly shorten the period of time during which compilers are flaky and immature. In the long run, randomized stress testing will lead to a better understanding of corner-case behaviors of the C++11 memory model and to test suites with improved coverage of interactions between the C++11 memory model, compiler optimizers, OS thread systems, and processor memory coherence mechanisms. The intellectual work required to produce concurrent test cases is a major increment upon our existing work on randomized test case generation which has had great success in finding compiler bugs, but which is limited to sequential code.

The State of the Art is Inadequate

Existing methods for testing compiler implementations break down into three broad methods:

  1. Compiling large, real applications. For example, a significant milestone in the development of any compiler is when it becomes self-hosting (can compile itself).
  2. Compiling collections of small, troublesome codes (e.g., GCC’s “torture test,” which contains about 2800 programs).
  3. Randomized testing. Our Csmith project is one example, but the history of this technique goes back about 50 years.

All three methods are useful and necessary. Compiler developers we have talked to, who are working on C++11 implementations, plan to use methods 1 and 2. We do not believe these methods will be sufficient to rapidly squash all of the relevant bugs (nor do the compiler developers appear to believe this). Consequently, we claim the proposed work is useful and important.

Why Testing Instead of Verification?

CompCert is an existence proof for a verified, usable, optimizing compiler for a realistic programming language. One might ask: Why should a testing proposal be funded? Why not be more ambitious and create a proved-correct C++ compiler? There are several reasons:

  • Testing is end-to-end, whereas formal verification typically addresses a narrow layer of the software stack. For example, in addition to finding compiler bugs, our compiler testing work has found linker bugs, assembler bugs, runtime library bugs, CPU simulator bugs, and even bugs caused by the fact that the compiler had itself been miscompiled. Formal verification cannot find these errors without specifically verifying the linker, assembler, runtime library, simulator, and the compiler that compiled the compiler under test. While these are all laudable goals, the combined effort to do these verifications is gigantic.
  • Even verified compilers require testing. For example, we have discovered and reported 13 CompCert bugs. The core issue is that creating a machine-checked mathematical proof is one thing, but proving the right thingis a separate and very hard problem. Typically it is not possible to once and for all “prove that one has proved the right thing.”
  • The C++11 language is dramatically more complex than C. CompCert does not even compile all of C, but rather only a useful subset. Creating a verified, optimizing compiler for a broadly useful subset of C++11 would be a massive undertaking, requiring (we guess) more than a few person-decades of effort.

Summary: A verified C++11 compiler is out of reach for now. But even if we had this compiler, we would still need to stress test it.

Specific Outcomes

The proposed work, if successful, will have several useful outcomes. First, and most obviously, we will create tools for stress-testing implementations of the C++11 concurrency model. These tools will be open-source, and consequently will benefit every team that is working on a C++11 implementation (in addition to transitively benefiting users of those compilers and users of systems built by those compilers). Second, as a side effect of developing and validating our tools, we will report bugs against open source compilers (GCC and LLVM, in particular) and also we will work with any commercial compiler vendors who seek us out.5 Third, we expect to be able to produce useful “litmus tests”–small test cases that are (empirically) difficult for compilers to translate correctly and that, taken as a group, provide good test coverage of interesting parts of the C++11 concurrency model. Finally, we hope to generate a better understanding of the C++11 concurrency model. By analogy, our Csmith work has (completely automatically) exposed some interesting corner cases in the C programming language, as described here: (link)

The PI’s Prior Work on Compiler Testing

This section briefly describes our work leading up to this proposal. It also explains why we are virtually certain that the proposed work will find bugs in production-quality compilers.

Although the C and C++98 languages do not have a notion of concurrent execution, they do have a simple memory model that is designed to support reliable access to memory-mapped hardware device registers. This model is provided by the volatile qualifier.6 A volatile-qualified object must be accessed at the memory system level as it is by the C/C++ abstract machine. In other words, a read from a volatile-qualified variable in C/C++ must result in a load being issued at the machine level–the variable’s value must not be cached in a register. Similarly, a write to a volatile variable in C/C++ must result in a store at the machine level–the compiler must not suppress redundant stores, as an optimizer normally would. The volatile qualifier’s semantics in C++11 are the same as they are in C and C++98.

The PI, while teaching embedded systems courses in the early/mid 2000s, noticed that the compilers being used by students did not always respect the volatile qualifier. This was troubling. In joint work with Eric Eide, he developed a way to test compilers’ implementation of the volatile qualifier. For every compiler we tested–including commercial tools used to compile safety-critical embedded systems–we found errors in its treatment of volatile-qualified variables. Some of these results are reported in our 2008 paper, but since then we have tested quite a few more compilers, with the same result. In summary, no compiler that we tested was able to correctly implement the volatile qualifier, which is basically trivial when compared to the C++11 memory model.

Since volatile-qualified objects are not totally reliable, we developed a workaround where source code is re-written in such a way that every read from a volatile variable is changed into a call to a helper function, and every write to a volatile variable is changed into a different helper function call. We verify by hand that the helper functions perform the necessary memory operations. In a large fraction of our tests (96% of the time) this strategy was successful in working around bugs in compilers’ implementations of volatile. The workaround is successful because compilers tend to be able to perform function calls reliably, even when they cannot access volatiles reliably. Subsequently, we became interested in the 4% of cases where our workaround failed. We hypothesized that these cases were due to “regular” compiler bugs having nothing to do with the volatile qualifier. We began systematically hunting for these bugs.

Since 2008 the PI’s group has found and reported about 400 bugs in widely-used C compilers. These bugs were all found using Csmith, a random C program generator that is a (distant) descendant of our volatile testing tool. Most of the bugs were in the compilers’ middle-ends–the part of a compiler that is independent of both the language being compiled and the architecture being targeted. Despite the fact that most of these 400 bugs have been fixed, we can still crash and find wrong-code bugs in the latest version of all compilers that we routinely test (about six). CompCert is the lone exception. Csmith is open-source software.7

Brief Introduction to the C++11 Memory Model

Although the C++11 standard cannot be freely downloaded, a recent draft can be.8 Material about the concurrency model is concentrated in Chapters 1, 29, and 30. However, a “less formal” description by Boehm and a paper by Boehm and Adve are probably better places to begin reading. Additionally, a book about the model is expected to be released in January 2012. On the more formal side, Batty et al. have formalized the C++11 memory model (but not the rest of the language). This kind of activity is essential to understanding the most subtle parts of the model.

Application Programming Interface

The API for the C++11 concurrency model divides into two major parts. The first is a thread support library. It is object oriented and uses the C++ exception system, but otherwise is not very different from Pthreads and other existing thread libraries. C++11 threads are intended to map onto OS-supported threads in a one-to-one fashion. The threads interface is contained in four header files, respectively supporting thread manipulation, mutual exclusion, condition variables, and futures.

A welcome feature of the mutex library is a facility for safely acquiring multiple locks:

std::unique_lock lck1(m1,std::defer_lock);
std::unique_lock lck2(m2,std::defer_lock);
std::unique_lock lck3(m3,std::defer_lock);
lock(lck1,lck2,lck3);
// manipulate shared data ...

The advantage of this interface is that the developer does not have to worry about deadlocking the program by acquiring locks in the wrong order.

The second major component of the C++11 API is an atomic operations library. Atomic operations are synchronization operations that are semantically equivalent to locking an object, reading and/or writing it, and then releasing the lock. The difference is that in some cases, the atomic operation may be significantly more efficient than the explicit locking approach. For example, on x86 and x86-64 processors an atomic increment of an integer value can be compiled into a single instruction.

The most interesting feature of the C++11 atomic API is that it provides portable accesses to weak memory models. The default memory ordering for all atomic operations is sequential consistency (memory_order_seq_cst), meaning that the sequentially consistent operation is part of a global total ordering of memory operations. This is intuitive; for example, the Dekker code from Section 1–which is incorrect on a non-sequentially-consistent multiprocessor–can be fixed in C++11 simply by declaring the synchronization variables as atomic. That is, instead of declaring them as:

volatile int flag[2], turn;

They should be declared as:

std::atomic<int> flag[2], turn;

The atomic template replaces the default integer operations (assignment, increment, etc.) with atomic and sequentially consistent code appropriate for the target platform.

The problem with sequential consistency is that it can be expensive. To support high-performance multiprocessor programming, C++11 defines five more memory orderings: memory_order_relaxed, memory_order_consume, memory_order_acquire, memory_order_release, and memory_order_acq_rel.

The relaxed memory ordering is the most permissive: atomic operations using it are still atomic, but no ordering constraints are imposed across threads. This ordering is perhaps not very useful by itself, but C++11 provides explicit fence operations that can be combined with relaxed consistency to build useful and performant concurrent data structures.

Using memory_order_acquire and memory_order_release, high-performance lock acquisition and release primitives can be constructed. An acquire operation is a one-way memory fence that prevents other memory operations from being moved above the acquire; a release operation is a one-way memory fence that prevents other memory operations from being moved below the release. Together, they prevent memory operations from being moved outside of a critical section–since this would defeat the purpose of the critical section. An access with memory_order_acq_rel behaves as both an acquire and a release. The memory_order_consume is similar to memory_order_acquire, but weaker: only values that are dependent on the lock variable (via a dependency mechanism explained in the standard) are ordered by the consume/release sequence.

Obligations on the Developer and Compiler

The fundamental obligation placed upon a developer writing C++11 programs is that she must never permit a data race to occur. Informally, a program is free of data races if it contains adequate synchronization. In more detail, the C++11 standard says:

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other.

The definition of atomic is straightforward–operations on atomic types are always atomic. The definition of conflicting is also straightforward: “Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location.” Alas, happens before is more intricate. It basically follows Lamport’s definition but accounts for synchronization that occurs through weak memory-model operations such as acquires and releases.

A data race, like any other undefined behavior in C++, destroys the meaning of the program. Avoiding races is relatively straightforward (if any concurrent programming problem can be called that) as long as the developer sticks with mutexes and sequentially consistent atomics. Static and dynamic race-detection tools are a fairly mature technology and will be able to help once they are adapted to support C++11.

On the other hand, the compiler cannot limit itself to the sequentially consistent case; its obligation to faithfully implement the C++11 standard is severe. The soundness of various compiler optimizations under weak memory models is a matter of current research. The next section explains some of the problems.

What Do C++11 Concurrency Bugs Look Like?

This section uses examples to briefly illustrate what could go wrong when a C++11 compiler translates a concurrent program. The best references about the interactions between optimizers and C++11 are by Boehm and by McKenney et al. (the latter document refers to a port of the C++11 memory model to C, but the discussion applies to C++ as well).

Example #1

Because C++98 and C were specified in the absence of a concurrency model, it is permissible for a compiler to introduce writes to objects that were not written to by the program. For example, consider this struct and accessor function:

struct foo { int a:8; int b:12; char c; } x;
void update_b (void) {
  x.b = 1;
}

It is perfectly legal for a C or C++98 implementation to translate update_b() into code that loads the entire 32-bit word containing x, appropriately modifies the 12 bits corresponding to x.b, and stores the word back into memory. In C++11 this translation is illegal because it may introduce data races with concurrent threads that are accessing field c. The current development version of the GNU C++ compiler, which is in a “bug fix only” state in preparation for the 4.7.0 release, miscompiles this code even if it is given compiler flags prohibiting it from introducing data races.9 On x86-64 the output is:

_Z8update_bv:
 movl x(%rip), %eax 
 andl $-1048321, %eax 
 orb $1, %ah 
 movl %eax, x(%rip) 
 ret

This code reads the entire struct into a register, uses a bitwise-AND and then a bitwise-OR instruction to modify the bits corresponding to b, and then stores the result back to memory. This is wrong in C++11.

Example #2

McKenney et al. provide an illustrative example concerning constant propagation: one of the simplest, most useful compiler optimizations. Given that x is an atomic variable and it is known to be either 0 or 1, how might this code be optimized?

if (x.load(memory_order_consume)) {
  ...
} else {
  ...
}
y = 42 * x / 13;

On platforms with expensive multiplication and division, the compiler would like to transform this code to:

if (x.load(memory_order_consume)) {
  ...
  y = 3;
} else {
  ...
  y = 0;
}

However, McKenney et al. state that “If the underlying machine preserves control-dependency ordering for writes, this optimization is perfectly legal. If the underlying machine does not preserve control-dependency ordering, then either this optimization must be avoided, a memory fence must be emitted after the load of x, or an artificial data dependency must be manufactured.” If compiler writers miss this subtlety, the generated code will either be suboptimal or wrong.

Details of the Proposed Approach

To recap, the proposed stress-testing work flow is to repeatedly (1) generate a random concurrent C++11 program, (2) compile it using a variety of possibly-buggy C++11 compilers and optimization flags, and (3) run these programs in hostile execution environments that are designed to expose interesting interleavings at the memory system level. We are not proposing to support all of C++11 in Csmith–that needs to be done, but it is well beyond the scope of this proposal. Rather, Csmith will generate C++11 that looks like C code with C++11 concurrency constructs added. We expect this to be sufficient to find many, but not all, bugs in C++11 concurrency implementations.

Generating Data-Race-Free Programs

Csmith already includes sophisticated machinery for generating programs that avoid undefined behaviors. Our PLDI 2011 paper explains this in detail, but the essence of the solution is to interleave static analysis with program generation. First, Csmith generates a statement (e.g., x = *y;) that is a candidate for addition to the current program. Second, Csmith performs a number of static, dataflow-driven safety checks on the candidate statement in the current program context. In this example, x and *y must be compatible types and y must not be null or refer to an out-of-scope memory location. If the safety checks pass, the candidate is committed to the current program; otherwise, the candidate is discarded. Progress is probabilistically guaranteed because Csmith’s random choices always include at least one “easy” alternative that never fails the safety check. Loops and function calls complicate this simple picture, but the basic structure of the solution, which we call “guess and check,” applies.

We will augment Csmith’s existing dataflow analyses to support static data race detection. Plenty of good starting points are available. Innovation may be required to perform race checking in the presence of atomics using C++11′s weak memory orderings. We do not know of a tool that checks for races in weakly ordered C++11, though it seems very likely that these will appear during the next year or two.

Given a race checker in Csmith, we believe that the guess and check approach will be sufficient to generate race-free programs. Even so, creating “good” random programs is as much an art as a science and we expect that additional refinements will be useful. First, accesses to sequentially consistent, atomic variables are always safe–the race detector does not need to be consulted before committing an access to this kind of memory. Second, Csmith’s safety checks can be simplified for data that is not shared between threads. We believe it will be useful to pre-partition variables into unshared ones that can be freely accessed by a single thread, and shared variables that must be synchronized appropriately. We expect unshared data to play an important role in compiler bug detection; for example, unshared data can expose compiler-introduced races like the example presented earlier. Also, code accessing unshared data can trigger optimization passes that may mangle nearby shared data accesses.

Generating Programs with Concurrent Invariants

Currently, just before terminating, a program generated by Csmith takes a checksum of the values held in all global variables and prints it. The checksum is invariant across all platforms making compatible choices for certain implementation-defined parameters (integer width and representation, mainly). Thus, a checksum difference indicates a compiler bug.

In general, the checksum approach will not work for concurrent programs. One approach would be to generate deterministic concurrent programs–where the checksum will again be invariant–but this seems quite restrictive, and will serve as a backup plan in case other proposed approaches do not work. Csmith currently does not track the values of scalar variables; its dataflow analysis will need to be extended to do this. We previously developed cXprop, a value-tracking abstract interpreter for concurrent C code; we do not expect it to be difficult to either repurpose cXprop or else re-implement the same techniques within Csmith. cXprop already knows how to generate assertions corresponding to the invariants that it computes. While we were debugging cXprop, an assertion violation generally signaled a cXprop bug–but of course these assertions can also catch compiler bugs. To ensure that invariants are not violated, Csmith will synthesize a separate thread that asserts that every invariant holds, in an infinite loop.

Thus far, we have discussed using the guess-and-check approach, combined with a concurrency-aware dataflow analysis, to generate concurrent code with invariants. We will also explore an alternative approach where invariants are the starting point for program synthesis, rather than a side effect of their construction. We will exploit a technique developed by Deng et al. for invariant-based synthesis of concurrent programs. This approach shows how to start with a coarse-grained synchronization solution and some invariants, and automatically refine it into a fine-grained solution. Our approach will be very similar: we will begin with randomized invariants and coarse-grained code (implemented using thread-level locks). Then, we will randomly apply transformations that make the solution more fine-grained, while preserving the invariants. First, critical sections will be split into smaller critical sections, then they will be replaced (when possible) with sequentially-consistent operations on atomic types, and finally we will attempt to introduce relaxed memory-model operations and memory barriers.

Finding Invariant Violations

Given an executable test case containing assertions, we need a way to determine if it was compiled correctly. The easiest (but least reliable) way to do this is to run the program and wait for an assertion violation to occur. This is not reliable because it is typically the case that running a concurrent program explores only a vanishingly small part of the space of all possible interleavings among memory operations in its threads. The hardest (but most reliable) way to find invariant violations is to run the compiled code through a formal methods-based tool that finds an execution where an invariant is violated, or else proves that such does not exist. Although a number of research efforts have attacked the problem of program verification on relaxed memory models, all such tools that we are aware of do not scale to even medium-sized codes. The problems that must be solved to make these tools scale are extremely difficult. Thus, while we plan to take advantage of any such tools that become available, we also plan to explore alternatives.

Executing on hardware

It is not difficult to perturb a program’s schedule in order to increase the probability that interesting interleavings are seen. One way is to use binary rewriting to insert randomized nop instructions into the compiled code. The inserted code can also use a random number generator so that delays are parameterized by a seed value. Another method is to cause interrupt handlers to fire on the machine that is doing the testing. This is easy to arrange by having a second machine send it small packets over a network link at short, random intervals. This method has the further benefit of causing substantial memory traffic due to DMA operations initiated by the network interface card.

Alglave et al.’s Litmus tool is based on similar ideas: it randomizes memory locations accessed by threads and also the threads’ CPU affinities. We will explore these techniques in addition to the ones described above. Although there are certainly limits on the effectiveness of techniques like these, we believe they are very much worth trying simply because they are low-cost, easily portable across architectures, and absolutely reliable in the sense that any behavior that is observed is real–there are no false alarms when executing on real hardware.

Yet another approach to executing test cases on real hardware is being explored by the GCC developers.10 They plan to execute the code under test in one thread and an assertion checker in a separate thread (as we propose above). They will to single-step through the thread under test (in an automated fashion, using debugger support) running the full assertion check after every instruction. This will ensure that every intermediate memory state is checked. Drawbacks of this approach are that it will incur a significant slowdown and also the act of single-stepping the processor will flush the processor’s pipeline and store barrier–this could easily mask bugs.

Executing in a simulator

Running test codes in a simulator reduces testing throughput, but has other advantages such as being repeatable and making it possible to simulate platforms that are unavailable or inconvenient. Although we do not know of an open-source simulator for ARM, x86, or x86-64 that faithfully implements the relevant memory model, it is possible that one exists or will exist within the next year or two. If not, we do not believe it will be too difficult to add weak memory model support to an existing simulator.

Given a weak memory model simulator, the challenge is to make it actively hostile to the program under test. One approach would be to use state-space exploration to look at many possible interleavings of memory operations. However, we believe this will be too slow to test realistic programs. Rather, it should be possible to focus the simulator’s efforts on weak points in the code under test. By analogy with race-directed software testing, we will explore the idea of causing the simulator to delay a thread until another processor is ready to execute code that touches conflicting memory locations. Similarly, there is significant room for hostility in the implementation of the write buffer, which should act as a “wrong buffer” by delaying flushes to main memory for as long as possible.

Test Case Reduction

To keep the story simple, we have not yet mentioned one problem that we will need to work on: test case reduction–the process of turning a large failure-inducing program into a much smaller one. Reduction is an important part of root-cause analysis for any compiler bug, and in practice it needs to be automated. Although the results are not yet published, automated test case reduction has been an active area of research for us for several years–but only for the case of sequential code.

Automated test case reduction for concurrent codes is more difficult because automated reduction algorithms–generally based on Delta Debugging–require deterministic execution. For example, Choi and Zeller’s work on isolating failure-inducing thread schedules fundamentally depends on a capture/replay mechanism that makes threads execute deterministically. It is not clear how to avoid the need for determinism. It is also not clear how to get efficient deterministic replay on a weak-memory multicore; there are research solutions addressing this problem but they generally require hardware support or are very inefficient; Heydari and Azimi survey this work. In summary, replay on real multiprocessors is effectively a separate research problem and we do not plan to address it.

Determinism in a simulator is trivial. Even so, test case reduction remains challenging. The fundamental difficulty in reducing the size of failure-inducing C/C++ programs is avoiding undefined behaviors. For example, in our experience it is common for a test case reducer to delete the initializer for a variable, resulting in a test case with undefined semantics.

When reducing concurrent C++11 programs the problems are even worse because we must not introduce a data race during reduction.

Two solutions to avoiding undefined behaviors in general, and race conditions in particular, suggest themselves. First, we can use an external race detector to validate test cases during reduction, if one becomes available. Second, since Csmith will already know how to generate race-free (and otherwise well-defined) programs, we can leverage it to also generate smaller versions of the programs. We have already experimented with this approach for reducing sequential test cases in the C language, and it appears to be workable.

  1. “C++98” refers to the 1998 version of the C++ standard. It is the dialect in which nearly all contemporary C++ code is written. The recently approved C++11 standard has not yet been entirely implemented by any compiler we are aware of.
  2. (link)
  3. According to Lamport, sequential consistency requires that “the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.”
  4. The term “wrong buffer” appears to have originated with system developers at ARM; we know of no appearances of this term in print.
  5. Although we have built solid relationships with the GCC and LLVM teams, so far we have not had very good luck in building relationships with companies that produce compilers. The problem (we think) is that we represent a poor short-term value proposition: we generate zero revenue while producing bug reports that suck up valuable engineering resources. Furthermore, being outsiders, we would seem to repesent the potential for causing only bad publicity–to paraphrase Dijkstra, we can find bugs but we cannot certify their absence.
  6. The rationale for volatile is described here: (link)
  7. (link)
  8. (link)
  9. These flags are discussed in the GCC Wiki: (link)
  10. (link)

What PC-lint is Really For: Enforcing Language Dialects

John Carmack’s piece about using static analyzers is really nice, and he’s right that applying PC-lint to a significant piece of C code results in page after page of warnings. Faced with this problem on a code base I cared about, and not wanting to give up, I first created a config file for PC-lint that suppressed the vast majority of warnings and then, over a few weeks, cleaned up the remaining issues. Once the code base linted clean, I could turn on classes of warnings whenever I had the time and inclination.

As an example, PC-lint supports fine-grained application of stronger typing rules. So one day I might assume that all uses of “bool” or some other enumerated type are segregated from uses of integers. Of course, in any real code base, such an assumption is wrong and a pile of annoying warnings ensues. I’d either fix them or else decide that it wasn’t worth it until some major rearchitecting could be done.

The cool thing was that after a while I realized that I wasn’t really writing C code anymore, but rather a much more strongly typed dialect where I actually had a degree of (justifiable) confidence that certain kinds of errors were not being made. The C compiler, with its near-nonexistent error checking, was acting as a mere code generator. It wasn’t even the case that PC-lint was uncovering a lot of serious bugs in my code (though it did find a few). Rather, it gave me some peace of mind and assured me that what I thought was happening in the code was in fact happening.

Of course, we always write code in dialects, whether we realize it or not. In many cases, language dialects are formalized as coding conventions or standards so a group of people can all read or write the same dialect — this has enormous advantages. The things that I found charming about PC-lint are that (1) it supported a very flexible range of dialects and (2) it quickly and mechanically ensured that my code stayed within the dialect. When writing non-C code (especially Perl) I often wish for a similar dialect checker.

Book Beginnings

If a book starts out just right, I’ll keep going back and rereading the first few sentences long after I’ve finished the book. Here are a few that did that to me.

Fagles’ translation of the Odyssey:

Sing to me of the man, Muse, the man of twists and turns driven time and again off course, once he had plundered the hallowed heights of Troy.

Gravity’s Rainbow:

A screaming comes across the sky. It has happened before, but there is nothing to compare it to now.

Blood Meridian:

See the child. He is pale and thin, he wears a thin and ragged linen shirt. He stokes the scullery fire. Outside lie dark turned fields with rags of snow and darker woods beyond that harbor yet a few last wolves.

I think the last one is my favorite.