Testing Dataflow Analyses for Precision and Soundness

[This piece is co-authored by Jubi Taneja, Zhengyang Liu, and John Regehr; it’s a summary of some of the findings from a paper that we just recently completed the camera ready copy for, that is going to be published at CGO (Code Generation and Optimization) 2020.]

Update from Jan 12 2020: Looks like there’s a patch in the works fixing some imprecisions we found!

Optimizing compilers use dataflow analyses to compute properties that are true in every execution of the program being compiled. For example, if an array access is provably in-bounds in every execution, the bounds check can be elided. A dataflow analysis should be:

  • Sound: It must never erroneously claim that a property holds. Unsound dataflow analyses typically lead to miscompilations, for example by helping eliminate a bounds check that could fail.
  • Precise: When the property holds, the dataflow analysis should be able to see this. The halting problem tells us that this is not always possible, but we’d like the analysis to mostly work pretty well in practice.
  • Fast: We don’t want dataflow analysis to slow down the overall compilation very much.

Meeting these constraints isn’t easy, and moreover compiler developers generally attempt to do this in a seat-of-the-pants fashion, without any kind of help from formal-methods-based tools. Our paper is based on the premise that formal methods can help! Using an SMT solver, we can create dataflow analyses that are sound and maximally precise: they never erroneously claim that the property holds and they always conclude that the desired property holds when this can be done at all by the analysis in question. (But keep in mind these claims require an ideal SMT solver: solver bugs and solver timeouts respectively defeat them.)

Our work is only about soundness and precision: it does not help make dataflow analyses faster. Moreover, since it involves lots of solver calls, it is slow: we can use it offline to find places where the compiler might want to be improved, but it is not possible to just drop it into the compiler as a replacement for existing dataflow analyses. (We do have some preliminary work on synthesizing dataflow implementations, and will write about that some other time.)

The first dataflow analysis we’ll look at is one of LLVM’s workhorses: known bits. This analysis tries to prove that individual bits of SSA values are either 0 or 1 in all executions. Many cases handled by this analysis are easy to understand. For example, a value that has been zero-extended obviously contains zeroes in its high-order bits. Similarly, a value that has been orred with a bitmask will have some ones in it. Other cases are harder. As an exercise, you might try to work out the known bits in the result of a subtraction or a signed division.

Known bits has a variety of customers in LLVM middle-ends and backends. For example, the “simplify CFG” pass computes the known bits in the argument to a switch instruction in order to optimize away dead switch cases; a switch case is provably dead if at least one bit of its value conflicts with a known bit in the value being switched on.

Ok, it’s time for an example. Consider this LLVM IR which shifts the value four to the right by the number of bit positions specified in its argument value:

define i32 @foo(i32) {
  %a = lshr i32 4, %0
  ret i32 %a

As of January 2020, LLVM doesn’t return any known bits for %a. If you would like to see this for yourself, first build a fresh LLVM and then second build this out-of-tree pass which calls LLVM’s dataflow analysis and pretty-prints its results. Run it like so:

$ opt -load ./LLVMDataflowInfo.so -dataflow-info ../test/test1.ll -o /dev/null
function: foo
  %a = lshr i32 4, %0
    known: ????????????????????????????????
    isKnownToBePowerOfTwo = 0
    LVI: [0,5)
    SCEV: %a  full-set  full-set
    demanded: 11111111111111111111111111111111

It is easy to see that a better answer is possible: LLVM could have returned 00000000000000000000000000000???. In other words, the result will always have 29 zeroes in the high end. This is the best possible result: a sound implementation of known bits cannot prove anything about the last three bits. This is because, for each of these bits, there exists an execution in which the bit is zero and also an execution in which the bit is one.

Now let’s switch gears and look at a different dataflow analysis from LLVM: integer ranges. This analysis lives in the “Lazy Value Info” analysis pass and it computes one of four results for an SSA value; the one we care about right now is a “regular range” [a, b) where a is smaller (using unsigned comparison) than b. This result means that in every execution of the program, the SSA value will be at least a and also less than b, again using unsigned comparisons.

For example, consider this IR which zero-extends its 8-bit arguments to 64 bits, xors the extended values, and returns the result:

define i64 @foo(i8, i8) {
  %x = zext i8 %0 to i64
  %y = zext i8 %1 to i64
  %z = xor i64 %x, %y
  ret i64 %z

It is obvious that [0, 256) is the result we want for %z: no smaller range captures all of the possibilities, and there are no possible executions that require a larger range. LLVM, on the other hand, returns a special “full set” result for %z, meaning that it is unable to compute any useful information about it. This happens because the code for modeling the effect of binary operators on integer ranges does not handle xor at all.

In our paper we deal with some additional dataflow analyses, such as “demanded bits,” which is LLVM’s funny name for a don’t-care analysis that looks for bits whose value does not affect subsequent computations. Demanded bits is interesting because it propagates information backwards along SSA edges. We won’t cover demanded bits or any of the others in this post.

Ok, now let’s get to the good stuff: computing maximally precise dataflow results. We build on Souper, since it already knows how to ask a solver questions about LLVM code, but besides that we don’t make interesting use of Souper’s capabilities. The bad news is that we don’t know how to just plug in a formal definition of a dataflow analysis and get answers directly from that. Rather, we need to spend a bit of time studying the structure of each abstract domain that we target, in order to create a solver-based algorithm for computing maximally precise results reasonably efficiently.

For known bits, we can deal with each bit individually. For a given bit, we’ll first ask the solver whether there exists any valuation of the inputs that can force the bit to be zero. If this returns UNSAT, then we have just proven that this bit is one in every execution. If this query is SAT, we issue a second, analogous, query asking whether this bit can ever be one. If not, it is known to be zero. If neither query returns UNSAT, we know that this bit can be both 0 and 1, and we move on to the next bit. This procedure gives the most precise result using at most 2n solver queries (for a value n bits wide). The gist of the argument for maximal precision is that known bits are independent of each other. This is a nice result because a naive search of all elements of the known bits lattice would require 3n solver calls and this is clearly infeasible.

Looking back at the first example above, we can issue these commands:

$ llvm-as test1.ll
$ souper $SOUPER_SOLVER --infer-known-bits test1.bc 
; Listing valid replacements.
; Using solver: Z3 + internal cache

; Function: foo
%0:i32 = var
%1:i32 = lshr 4:i32, %0
infer %1
; knownBits from souper: 00000000000000000000000000000xxx

Integer ranges, it turns out, are more difficult. First, due to a “wrapped range” feature that we’ll not discuss here (again, see our paper) this abstract domain doesn’t form a lattice, so the definition of “maximal precision” becomes problematic. We pave over this issue by relaxing the notion of maximal precision a bit. Instead of insisting on the unique maximally precise result that would have to exist if the domain had been a lattice (but does not, since it isn’t), we’ll accept any result as maximally precise if there does not exist an integer range that excludes a larger number of values than is excluded by the given range. (If you find static analysis using non-lattice abstract domains interesting, this very nice paper contains more details.) The second problem with integer ranges is that there’s no obvious bit-level divide and conquer strategy for finding the maximally precise result. We’ll need a different approach.

Ignoring a special “empty set” value which indicates dead code or unconditional undefined behavior, we’re looking for an integer range [x, x+c) such that c is minimized. For example, we can start with c=1, corresponding to the hypothesis that there is a valid dataflow result [x, x+1), which excludes all integer values but one. In other words, this hypothesis corresponds to the SSA value being a constant, x. But how can we find the concrete value of x in a large search space such as 264? We do this using solver-based constant synthesis. If synthesis succeeds, we have found the maximally precise result. If no x exists such that [x, x+1) is a valid dataflow result, we launch a binary search to find the smallest c for which a suitable x does exist. This range will satisfy the maximal precision condition from the previous paragraph. The details of synthesizing an integer value meeting constraints don’t matter here as long as we have a procedure that works.

Going back to the second example:

$ llvm-as test2.ll
$ souper $SOUPER_SOLVER --infer-range test2.bc 

... a few irrelevant lines elided ...

; Function: foo
%0:i8 = var
%1:i64 = zext %0
%2:i8 = var
%3:i64 = zext %2
%4:i64 = xor %1, %3
infer %4
; range from souper: [0,256)

So in both of these cases we get the obvious maximally precise result. However, these examples are contrived. What code should we look at, when doing this work for real? We looked at all of the bitcode from compiling the C and C++ programs in SPEC CPU 2017, but really any large collection of LLVM IR should be fine. For each SSA value, we use our code and LLVM’s to compute dataflow results and then compare them. The possibilities are:

  • Our result is more precise than LLVM’s: lost precision, flag for human inspection.
  • LLVM’s result is more precise than ours: LLVM is unsound, flag for human inspection.
  • Results have the same precision: ignore.

Across the LLVM IR from compiling SPEC CPU 2017, we find many examples where LLVM’s dataflow analyses leave some precision on the floor. Is it worth making the analyses more precise? This is subjective, and can only be answered by compiler developers. Overall — despite the fact that we find a lot of imprecision — it looks to us like LLVM’s dataflow analyses are quite mature and provide a lot of precision where it matters. This would not have been the case around 2010, but these analyses have gotten a lot of tweaks since then. Even so, there is still plenty of room for improvement.

We thought that while doing this work we might find examples where LLVM’s dataflow result is “more precise” than our maximally precise one. This would indicate soundness errors in LLVM. We did not find any such examples, but we did re-introduce some old soundness bugs in LLVM into the version we used for testing, in order to verify that we can automatically detect these. Details are in the paper.

Since our method requires concrete code fragments as the basis for comparing the precision of LLVM’s analyses and ours, we might ask: is this is an advantage or a disadvantage? On the plus side, we’ll find imprecisions and soundness errors that occur when compiling real programs, and presumably these are the ones that should be a higher priority for fixing. On the other side, we’ll miss cases not exercised by the test programs.

In summary, we have presented several solver-based algorithms for computing maximally precise dataflow results, with the goal of spotting soundness errors and lost precision in a real compiler, LLVM. This is part of a larger research program outlined here. The idea is to move compiler construction away from the current situation, where compilers are huge collections of bespoke code, and towards a situation where the mathematical foundations of compilation can be directly used to support their construction.

Helping Generative Fuzzers Avoid Looking Only Where the Light is Good, Part 1

Let’s take a second to recall this old joke:

A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, “this is where the light is.”

Using a generative fuzzer — which creates test cases from scratch, rather than mutating a collection of seed inputs — feels to me a lot like being the drunk guy in the joke: we’re looking for bugs that can be triggered by inputs that the generator is likely to generate, because we don’t have an obviously better option, besides doing some hard work in improving the generator. This problem has bothered me for a long time.

One way to improve the situation is swarm testing, which exploits coarse-grained configuration options in the fuzzer to reach diverse parts of the search space with higher probability. Swarm is a good idea in the sense that it costs almost nothing and seems to give good results, but it’s not remotely the end of the story. Today we’ll work on a different angle.

Let’s start out with a high-level view of the situation: we have a system under test, and its inputs come from a vast, high-dimensional space. If we confine ourselves to generating inputs up to N bits long, it is often the case that only a very small fraction of the 2N possibilities is useful. (Consider testing Microsoft Word by generating uniform random bit strings of 100 KB. Very nearly 0% of these would be enough like a Word document to serve as an effective test case.) This is the reason generative fuzzers exist: they avoid creating invalid inputs. (Of course mutation-based fuzzers work on a related principle, but this piece isn’t about them.)

Now we can ask: out of the space of inputs to the system under test, which parts of the space will trigger bugs? I would argue that unless we are doing some sort of focused testing, for example to evaluate the hypothesis “formulas containing trig functions often make Excel fall over,” we should assume that bugs are distributed uniformly. (And even if we are restricting our attention to formulas containing a trig function, we probably want to assume that this subset of Excel formulas contains uniformly distributed bugs.) It follows that a good generative fuzzer should explore the space of inputs with uniform probability, or at least it should be capable of doing this. Another way to state the desired property is: if the fuzzer is capable of producing X different test cases, each of them should be generated with probability 1/X. I don’t know that this property is optimal in any meaningful sense, but I do believe it to be a good and reasonable design goal for a generative fuzzer. Alas, a non-trivial generative fuzzer is very unlikely to have this property by accident.

Why is a generative fuzzer not going to naturally emit each of its X outputs with probability 1/X? This follows from how these tools are implemented: as decision trees. A generative fuzzer makes a series of decisions, each determined by consulting a PRNG, on the way to emitting an actual test case. In other words, this kind of tool starts at the root of a decision tree and walks to a randomly chosen leaf. Choices made near the root of the decision tree are effectively given more weight than choices made near the leaves, making it possible for some leaves to have arbitrarily low probabilities of being emitted by a decision tree. Let’s consider this very simple decision tree:

If we use it to generate output a bunch of times, we’ll end up getting outputs with percentages something like this:

A 49.979
B 25.0018
C 12.509
D 6.26272
E 3.1265
F 3.12106

If you are thinking to yourself “this example is highly contrived” then please consider how structurally similar it is to the core of a simple LLVM IR generator that I wrote, which has been used to find a number of optimizer bugs. (At the moment it only works as an enumerator, I ripped out its randomized mode because of the problems described in this post.)

So what can we do about this? How can we hack our decision tree to generate its leaves uniformly? A few solutions suggest themselves:

  • Enumerate all leaves and then perform uniform sampling of the resulting list.
  • Skew the decision tree’s probabilities so that they are biased towards larger subtrees.
  • Structure the generator so that its decision tree is flattened, to stop this problem from happening in the first place.

Unfortunately, the first solution fails when the tree is too large to enumerate, the second fails when we don’t know subtree sizes, and the third presents software engineering difficulties. In Part 2 we’ll look at more sophisticated solutions. Meanwhile, if you’d like to play with a trivial generative fuzzer + leaf enumerator based on this decision tree, you can find a bit of code here.

Finally I want to briefly explain why generative fuzzers are structured as decision trees: it’s because they are effectively inverted parsers. In other words, while a parser takes lexical elements and builds them into a finished parse tree, a generative fuzzer — using the same grammar — starts at the root node and recursively walks down to the leaves, resulting in test cases that always parse successfully.

Write Fuzzable Code

Fuzzing is sort of a superpower for locating vulnerabilities and other software defects, but it is often used to find problems baked deeply into already-deployed code. Fuzzing should be done earlier, and moreover developers should spend some effort making their code more amenable to being fuzzed.

This post is a non-comprehensive, non-orthogonal list of ways that you can write code that fuzzes better. Throughout, I’ll use “fuzzer” to refer to basically any kind of randomized test-case generator, whether mutation-based (afl, libFuzzer, etc.) or generative (jsfunfuzz, Csmith, etc.). Not all advice will apply to every situation, but a lot of it is sound software engineering advice in general. I’ve bold-faced a few points that I think are particularly important.

Invest in Oracles

A test oracle decides whether a test case triggered a bug or not. By default, the only oracle available to a fuzzer like afl is provided by the OS’s page protection mechanism. In other words, it detects only crashes. We can do much better than this.

Assertions and their compiler-inserted friends — sanitizer checks — are another excellent kind of oracle. You should fuzz using as many of these checks as possible. Beyond these easy oracles, many more possibilities exist, such as:

  • function-inverse pairs: does a parse-print loop, compress-decompress loop, encrypt-decrypt loop, or similar, work as expected?
  • differential: do two different implementations, or modes of the same implementation, show the same behavior?
  • metamorphic: does the system show the same behavior when a test case is modified in a semantics-preserving way, such as adding a layer of parentheses to an expression?
  • resource: does the system consume a reasonable amount of time, memory, etc. when processing an input?
  • domain specific: for example, is a lossily-compressed image sufficiently visually similar to its uncompressed version?

Strong oracles are worth their weight in gold, since they tend to find application-level logic errors rather than the lower-level bugs that are typically caught by looking for things like array bounds violations.

I wrote a bit more about this topic a few years ago. Finally, a twitter user suggested “If you’re testing a parser, poke at the object it returns, don’t just check if it parses.” This is good advice.

Interpose on I/O and State

Stateless code is easier to fuzz. Beyond that, you will want APIs for taking control of state and for interposing on I/O. For example, if your program asks the OS for the number of cores, the current date, or the amount of disk space remaining, you should provide a documented method for setting these values. It’s not that we necessarily want to randomly change the number of cores, but rather that we might want to fuzz our code when set to single-core mode and then separately fuzz it in 128-core mode. Important special cases of taking control of state and I/O include making it easy to reset the state (to support persistent-mode fuzzing) and avoiding hidden inputs that lead to non-deterministic execution. We want as much determinism as possible while fuzzing our code.

Avoid or Control Fuzzer Blockers

Fuzzer blockers are things that make fuzzing gratuitously difficult. The canonical fuzzer blocker is a checksum included somewhere in the input: the random changes made to the input by a mutation-based fuzzer will tend to cause the checksum to not validate, resulting in very poor code coverage. There are basically two solutions. First, turn off checksum validation in builds intended for fuzzing. Second, have the fuzzer generate inputs with valid checksums. A generation-based fuzzer will have this built in; with a mutation-based fuzzer we would write a little utility to patch up the test case with a valid checksum after it is generated and before it is passed to the program being fuzzed. afl has support for this.

Beyond checksums, hard-to-satisfy validity properties over the input can be a serious problem. For example, if you are fuzzing a compiler for a strongly typed programming language, blind mutation of compiler inputs may not result in valid compiler inputs very often. I like to think of validity constraints as being either soft (invalid inputs waste time, but are otherwise harmless) or hard (the system behaves arbitrarily when processing an invalid input, so they must be avoided for fuzzing to work at all). When we fuzz a C++ compiler to look for wrong code bugs, we face a hard validity constraint because compiler inputs that have UB will look like wrong code bugs. There is no simple, general-purpose solution to this kind of problem, but rather a family of techniques for explicitly taking validity properties into account. The most obvious solution — but often not the right one — is to write a new generational fuzzer. The problem is that if you do this, you cannot take advantage of modern coverage-driven fuzzing techniques, which are amazing. To fit into a coverage-driven fuzzing framework you have a couple of options. First, write a custom mutator that respects your validity constraints. Second, structure-aware fuzzing, which basically means taking the mutated data from the fuzzer and translating it into something like what the program being fuzzed expects to see. There’s a lot of research left to be done in making coverage-driven fuzzers work well in the presence of validity constraints without requiring a lot of manual effort. There are some significant subtleties here, maybe I’ll go into them another time. Putting something like a SAT solver into the fuzzer is not, generally speaking, the answer here, first because some validity constraints like checksums are specifically difficult for solvers, and second because some validity constraints (such as UB-freedom in a C++ program) are implicit and cannot be inferred, even in principle, by looking at the system being fuzzed.

A lot of code in a typical system cannot be fuzzed effectively by feeding input to public APIs because access is blocked by other code in the system. For example, if you use a custom memory allocator or hash table implementation, then fuzzing at the application level probably does not result in especially effective fuzzing of the allocator or hash table. These kinds of APIs should be exposed to direct fuzzing. There is a strong synergy between unit testing and fuzzing: if one of these is possible and desirable, then the other one probably is too. You typically want to do both.

Sanitizers and fuzzers often require tweaks or even significant changes to the build process. To make this easier, keep the build process as clean and simple as possible. Make it easy to switch out the compiler and modify the compiler options. Depend on specific tools (and versions of tools) as little as possible. Routinely build and test your code with multiple compilers. Document special build system requirements.

Finally, some fuzzer blockers are sort of silly and easy to avoid. If your code leaks memory or terminates its process with a deep call stack, it will be painful to test using a persistent-mode fuzzer, so don’t do these things. Avoid handling SIGSEGV or, if you really must do this, have a way to disable the handler for fuzzer builds. If your code is not compatible with ASan or UBSan, then these extremely useful oracles are harder to use. In particular, if your code uses a custom memory allocator you should consider turning it off for fuzzer builds, or else adapting it to work with ASan, or else you’ll miss important bugs.

Unblock Coverage-Driven Fuzzers

Because coverage-driven fuzzers refocus their effort to try to hit uncovered branches, they can be blocked in certain specific ways. For example, if a coverage-driven fuzzer is presented with too many uncoverable branches, it can spend so much time on them that it becomes less likely to hit more interesting branches elsewhere in the program. For example, one time I compared afl’s coverage on a program compiled with and without UBSan, and found that (in whatever time limit I used) it covered quite a lot less of the sanitized program, compared to the unsanitized build. On the other hand, we definitely want our fuzzer to look for sanitizer failures. My advice is to fuzz both sanitized and unsanitized builds of your program. I don’t know how to budget fuzzing resources for these different activities and don’t know of any principled work on that problem. It may not matter that much, since fuzzing is all about overkill.

Sometimes your program will call branchy, already-heavily-fuzzed code early in its execution. For example, you might decompress or decrypt input before processing it. This is likely to distract the coverage-driven fuzzer, causing it to spend a lot of time trying to fuzz the crypto or compression library. If you don’t want to do this, provide a way to disable crypto or compression during fuzzing.

Any interpreter in your program is likely to make life difficult for a coverage-driven fuzzer, since the relevant program paths are now encoded in the data being interpreted, which is generally opaque to the fuzzer. If you want maximum mileage out of coverage-driven fuzzers, you may want to try to avoid writing interpreters, or at least keep them extremely simple. An obvious way to deal with embedded interpreters — which someone must have thought of and tried, but I don’t have any pointers — would be to have an API for teaching the fuzzer how to see coverage of the language being interpreted.

Enable High Fuzzing Throughput

Fuzzing is most effective when throughput is very high; this seems particularly the case for feedback-driven fuzzers that may take a while to learn how to hit difficult coverage targets. An easy throughput hack is to make it possible to disable slow code (detailed logging, for example) when it is not germane to the fuzzing task. Similarly, interposing on I/O can help us avoid speed hacks such as running the fuzzer in a ramdisk.

“But I Want Fuzzing My Code to be Harder, Not Easier”

I don’t have a lot of sympathy for this point of view. Instead of aiming for security through obscurity, we would do better to:

  • fuzz early and thoroughly, eliminating fuzzable defects before releasing code into the wild
  • write code in a programming language with as strong of a type system as we are willing to tolerate — this will statically eliminate classes of bad program behaviors, for example by stopping us from putting the wrong kind of thing into a hashmap
  • aggressively use assertions and sanitizers to get dynamic checking for properties that the type system can’t enforce statically

Anti-fuzzing techniques are a thing, but I don’t think it represents a useful kind of progress towards better software.


Randomized testing is incredibly powerful and there’s no point avoiding it: if you don’t fuzz your code, someone else will. This piece has described some ways for you, the software developer, to make fuzzing work better. Of course there are plenty of other aspects, such as choosing a good corpus and writing a good fuzzer driver, that are not covered here.

Acknowledgments: Pascal Cuoq and Alex Groce provided feedback on a draft of this piece, and it also benefited from suggestions I received on Twitter. You can read the conversation here; it contains some suggestions and nuances that I did not manage to capture.

Design and Evolution of C-Reduce (Part 2)

Part 1 of this series introduced C-Reduce and showed how it combines a domain-independent core with a large collection of domain-specific passes in order to create a highly effective test-case reducer for C and C++ code. This part tells the rest of the story and concludes.

Parallel Test-Case Reduction

C-Reduce’s second research contribution is to perform test-case reduction in parallel using multiple cores. The parallelization strategy, based on the observation that most variants are uninteresting, is to speculate along the uninteresting branch of the search tree. Whenever C-Reduce discovers an interesting variant, all outstanding interestingness tests are killed and a new line of speculation is launched. This is the same approach that was subsequently rediscovered by the creator of halfempty.

C-Reduce has a policy choice between taking the first interesting variant returned by any CPU, which provides a bit of extra speedup but makes parallel reduction non-deterministic, or only taking an interesting variant when all interestingness tests earlier in the search tree have reported that their variants are uninteresting. We tested both alternatives and found that the speedup due to non-deterministic choice of variants was minor. Therefore, C-Reduce currently employs the second option, which always follows the same path through the search tree that non-parallel C-Reduce would take. The observed speedup due to parallel C-Reduce is variable, and is highest when the interestingness test is relatively slow. Speedups of 2-3x vs. sequential reduction are common in practice.

Applicability and Moving Towards Domain-Independence

Something that surprised us is how broadly applicable C-Reduce is to test cases in languages other than C and C++. Our users have found it effective in reducing Java, Rust, Julia, Haskell, Racket, Python, SMT-LIB, and a number of other languages. When used in this mode, the highly C/C++-specific C-Reduce passes provide no benefit, but since they fail quickly they don’t do much harm. (C-Reduce also has a --not-c command line option that avoids running these passes in the first place.)

One might ask why C-Reduce is able to produce good results for languages other than C and C++; the answer appears to be based on the common structural elements found across programming languages in the Algol and Lisp families. In contrast, in our limited experience, C-Reduce does a very poor job reducing test cases in binary formats such as PDF and JPEG. Other reducers, such as the afl-tmin tool that ships with afl-fuzz, work well for binary test cases.

A consequence of the modular structure of C-Reduce is that while its transformation passes are aimed at reducing C and C++ code, the C-Reduce core is completely domain-independent. C-Reduce has been forked to create a highly effective reducer for OpenCL programs. We believe it would be relatively straightforward to do something similar for almost any other programming language simply by tailoring the passes that C-Reduce uses.


When used for its intended purpose, when does C-Reduce work badly? We have seen two main classes of failure. First, C-Reduce can be annoyingly slow. This typically happens when the passes early in the phase ordering, which are intended to remove a lot of code quickly, fail to do this. Second, highly templated C++ sometimes forces C-Reduce to terminate with an unacceptably large (say, >1 KB) final result. Of course this is better than nothing, and subsequent manual reduction is usually not too difficult, but it is frustrating to have written 69 different Clang-based passes and to still find effectively irreducible elements in test cases. The only solution — as far as we know — is to strengthen our existing transformation passes and to create more such passes.

A small minority of compiler bugs appears to be impossible to trigger using a small test case. These bugs are exceptions to the small scope hypothesis. They typically stem from resource-full bugs in the compiler. For example, a bug in register spilling code requires the test case to use enough registers that spilling is triggered; a bug in logic for emitting long-offset jumps requires the test case to contain enough code that a long offset is required; etc. These test cases are just difficult to work with, and it is not clear to us that there’s anything we can do to make it easier to debug the issues that they trigger.

C-Reduce Design Principles

In summary, C-Reduce was designed and implemented according to the following principles:

  1. Be aggressive: make the final reduced test case as small as possible.
  2. Make the reducer fast, for example using parallelism, careful phase ordering of passes, and avoiding unnecessary I/O traffic, when this can be done without compromising the quality of the final output.
  3. Make it as easy as possible to implement new passes, so that many domain-specific passes can be created.
  4. Keep the C-Reduce core domain-independent.
  5. Focus only on reduction, delegating all other criteria to the user-supplied interestingness test.

Directions for Future Test-Case Reduction Research

Although perhaps a few dozen papers have been written about test-case reduction since Hildebrandt and Zeller’s initial paper 19 years ago, I believe that this area is under-studied relative to its practical importance. I’ll wrap up this article with a collection of research questions suggested by our experience in over a decade of work on creating a highly aggressive reducer for C and C++.

What is the role of domain-specificity in test case reduction? Researchers who cite C-Reduce appear to enjoy pointing out that it is very domain-specific (nobody seems to notice that the C-Reduce core is domain-independent, and that the pass schedule is easy to modify). The implication is that domain-specific hacks are undesirable and, of course, an argument against such hacks would be forceful if backed up by a test-case reducer that produced smaller final C and C++ code than C-Reduce does, without using domain knowledge. So far, such an argument has not been made.

The open research question is whether domain knowledge is actually necessary, or if a domain-independent test-case reducer can beat C-Reduce at its own game. The most impressive such effort that we are aware of is David MacIver’s structureshrink, which uses relatively expensive search techniques to infer structural elements of test cases that can be used to create variants. Anecdotally, we have seen structureshrink produce reduced versions of C++ files that are smaller than we would have guessed was possible without using domain knowledge.

What is the role of non-greedy search in test-case reduction? In many cases, the order in which C-Reduce runs its passes has little or no effect on the final test case. In other words, the search is often diamond-shaped, terminating at the same point regardless of the path taken through the search space. On the other hand, this is not always the case, and when the search is not diamond-shaped, a greedy algorithm like C-Reduce’s risks getting stopped at a local minimum that is worse than some other, reachable minimum. The research question is how to get the benefit of non-greedy search algorithms without making test-case reduction too much slower.

What other parallelization methods are there? C-Reduce’s parallelization strategy is pretty simple, gives a modest speedup in practice, and always returns the same result as sequential reduction. There must be other parallel test-case reduction strategies that hit other useful points in the design space. This is, of course, related to the previous research question. That is, if certain branchings in the search tree can be identified as being worth exploring in both directions, this could be done on separate cores or machines.

What is the role of canonicalization in test-case reduction? A perfectly canonicalizing reducer — that reduces every program triggering a given bug to the same reduced test case — would be extremely useful. Although it seems impossible to achieve this in all cases, there are many relatively simple strategies that can be employed to increase the degree of canonicalization, such as putting arithmetic expressions into a canonical form, assigning canonical names to identifiers, etc. C-Reduce has a number of transformations that are aimed at canonicalization rather than reduction. For example, the reduced test case at the top of part 1 of this piece has four variables a, b, c, and d, which appear in that order. I believe that more work in this direction would be useful.

Can we avoid bug hijacking? Test reduction sometimes goes awry when the bug targeted by the reduction is “hijacked” by a different bug. In other words, the reduced test case triggers a different bug than the one triggered by the original. During a compiler fuzzing campaign this doesn’t matter much since one bug is as good as another, but hijacking can be a problem when the original bug is, for example, blocking compilation of an application of interest. Hijacking is particularly common when the interestingness test is looking for a non-specific behavior such as a null pointer dereference. C-Reduce pushes the problem of avoiding hijacking onto the user who can, for example, add code to the interestingness test looking for specific elements in a stack trace. The research question here is whether there are better, more automated ways to prevent bug hijacking.

Obtaining C-Reduce

Binary C-Reduce packages are available as part of software distributions including Ubuntu, Fedora, FreeBSD, OpenBSD, MacPorts, and Homebrew. Source code can be found here.


C-Reduce was initially my own project, but by lines of code the largest contributor by a factor of two is my former student Yang Chen, now a software engineer at Microsoft. Yang wrote effectively all of the Clang-based source-to-source transformation code, more than 50,000 lines in total. Eric Eide, a research professor in computer science at the University of Utah, is the other major C-Reduce contributor (and gave useful feedback on a draft of this piece). Our colleagues Pascal Cuoq, Chucky Ellison, and Xuejun Yang also contributed to the project, and we have gratefully received patches from a number of external contributors. Someone created a fun visualization of the part of C-Reduce’s history that happened on Github.

Design and Evolution of C-Reduce (Part 1)

[This piece is posted in parallel on the IEEE Software blog. Karim Ali copyedited.]

Since 2008, my colleagues and I have developed and maintained C-Reduce, a tool for programmatically reducing the size of C and C++ files that trigger compiler bugs. C-Reduce also usually does a credible job reducing test cases in languages other than C and C++; we’ll return to that later.

Why Reduce Test Cases?

Here’s a typical C-Reduce output, as found in the LLVM bug database:

int a[1];
int b;
void c() {
  void *d = b = 0;
  for (;; b++)
    a[b] = (int) d++;

Compiling this code at the -O2 optimization level causes LLVM to crash. The bug report doesn’t contain the original, unreduced test case, but most likely it was larger.

A reduced test case is preferable because:

  • it usually gets the compiler to misbehave quickly, reducing the number of function calls, memory allocations, etc. that the compiler developer has to step through and reason about while debugging
  • the reduced file contains little code not directly related to triggering the bug, reducing the likelihood that compiler developers will be distracted by extraneous features of the test case
  • reduced test cases for the same bug often look similar to each other, whereas this is not normally true for unreduced files that trigger the same bug
  • there is often little or no discernible similarity between an unreduced source file and its reduced version, making it easier for compiler bugs triggered by proprietary code to be reported externally

The minimum-sized input triggering any particular compiler bug can be found using a trivial algorithm: exhaustive search of text strings of increasing size. This is, of course, almost always intractable. In practice, test case reduction proceeds in the other direction: starting with a large, failure-inducing test case, incrementally making it smaller until a local minimum is reached.

A Bit of Background

The history of automated test case reduction does not seem to be well documented, but several examples can be found in software testing papers from the 1990s, such as Differential Testing for Software and Massive Stochastic Testing of SQL. Test-case reduction was first studied in its own right in 2000 when Hildebrandt and Zeller introduced Delta Debugging: a general-purpose technique for test case reduction. Their algorithm uses a greedy search where a series of “variants” (our term, not theirs, for partially-reduced test case candidates) is produced by removing chunks of the test case. As reduction progresses, the chunk size is reduced, until it reaches some minimum-sized unit, such as a single line, token, or character. When no minimum-sized chunk can be removed from the test case without breaking the property that it triggers the bug, the Delta Debugger terminates. Almost all subsequent test-case reduction work, including ours, builds upon this work.

Towards C-Reduce

I became interested in test case reduction when my colleagues and I began to find a lot of bugs in C compilers using random testing. We found so many bugs that reporting them became bottlenecked on reducing the bug-triggering programs. Since I was the one reporting the bugs we found, I was the one who felt the pain of manual test-case reduction, and it quickly got old. I eventually reported around 500 compiler bugs and I could not have done this without first creating C-Reduce.

At the time, the best open-source implementation of Delta Debugging, from UC Berkeley, was line-based and contained a significant innovation over the original algorithm: it could reorganize a file in such a way that all nested curly braces deeper than a configurable level would appear on a single line. Thus, at level zero, entire functions would be placed on the same line, enabling the line-based reducer to remove an entire function at once. At higher nesting levels, functions would be split across lines, enabling finer-grained reduction. This worked well but the tool ended up being inadequate for my needs: it got stuck at local minima that were often orders of magnitude larger than what could be achieved when reducing by hand.

The limiting factor in this existing tool (“Delta” from now on) was obvious: it was not able to exploit enough of the structure of the file being reduced. For example, it could usually not do much to simplify arithmetic expressions. These sorts of simplifications tend to have a cascading effect: eliminating the last use of a variable allows its definition to be eliminated, etc. The obvious path forward was to write a new tool that solved a reduction problem that Delta could not solve, and then to alternate running this tool and Delta until a global fixpoint was reached. I did this, adding more and more reduction techniques over time. At some point I implemented a line-elimination pass in my new reducer, at which point Delta was subsumed and could be dropped.

We ended up keeping two elements of Delta’s design. First, the configurable hierarchical reformatting of a test case based on curly brace nesting. This technique, followed by removing contiguous chunks of code, is still one of C-Reduce’s most useful first lines of attack on a test case. Second, Delta’s mechanism for determining whether a given variant is “interesting.” An interesting variant is used as the basis for further reduction steps; an uninteresting variant is a dead end, and is discarded. Delta determined interestingness by invoking a user-supplied program — typically a shell script — whose process exit code determines the interestingness of the current variant. The flexibility afforded by this small element of user extensibility ends up being extremely useful. For example, the interestingness test can discard test cases that trigger certain compiler warnings, it can attempt to disambiguate different crash bugs, etc.

It is more challenging to reduce test cases that cause the compiler to emit incorrect object code than it is to reduce test cases that merely cause the compiler to crash. C-Reduce itself is agnostic about the character of the bug of interest: we push all of the difficulties in reducing miscompilation triggers into the interestingness test, which should try to answer questions such as:

  • is the variant well-defined by the C or C++ standard?
  • does the variant avoid depending on behaviors that are unspecified by the C or C++ standard?
  • does the buggy compiler turn the variant into an executable?
  • does this executable terminate within a specified time?
  • does the reference compiler (assumed to not contain the bug of interest) turn the variant into an executable?
  • does this executable also terminate within a specified time?
  • does the behavior of the two executables differ in a way that indicates that a miscompilation occurred?

The variant is interesting if the answer to all of these questions is “yes.”

The hardest part of reducing programs that trigger miscompilation bugs is ensuring that variants avoid undefined behaviors (such as invalid pointer operations) and do not rely on unspecified behaviors (such as the order of evaluation of function arguments). A test case doing one of these things is ill-formed and can accomplish nothing beyond annoying compiler developers. Empirically, if undefined behavior is not actively avoided during test-case reduction, C-Reduce will almost certainly introduce it. The practical solution is to employ suitable static and dynamic analysis tools, in order to avoid these ill-formed variants. Since no single tool detecting all undefined behaviors in C and C++ programs exists, a hybrid approach involving multiple tools is typically used in practice. This is not completely satisfying, but it works well enough that C-Reduce can reliably produce useful reduced test cases for miscompilation bugs in C and C++ compilers.

Writing good interestingness tests for miscompilations takes a bit of practice. More than one user has described C-Reduce as being something like the sorcerer’s apprentice: it does an excellent job reducing according to the criteria it is given, but if these criteria contain any kind of loophole, C-Reduce is likely to find it. For example it is easy to accidentally write a test which believes that the empty file is interesting. Additionally, a good interestingness test is written in such a way that its expected-case runtime is minimized by asking the quickest and most-likely-to-fail questions first.

From the start, C-Reduce was designed to produce a very small final reduced test case, even when this would make C-Reduce run for longer than we liked. This is based on the premise that we should burn cycles instead of human time, and that reporting a compiler bug is rarely on the critical path; we can often afford to wait for a better result. The consequences of this decision can be seen in Tables 1 and 2 of this paper that evaluates several test-case reduction methods: C-Reduce produces the smallest final output, but takes more time to do so.

A Modular, Domain-Independent Reducer Core

Although C-Reduce started out as a pet project solving a specific problem, it evolved into a research project involving a number of my colleagues, whose top-level goal was to produce an effective and usable reducer for C and C++ code as found in the wild. The first research contribution to come out of this effort was a way to achieve a clean mechanism/policy separation in a test case reducer. Previous reduction techniques had all baked specific transformations into the overall search strategy. This impedes extensibility, which we believe to be crucial. The structure that we ended up with is a small core that invokes a collection of pluggable transformation passes until a global fixpoint is reached.

The API for C-Reduce passes is simple but — like many simple things — required a lot of iterations before it felt finished. It is based on the ideas that transformation passes should be stateless and that every pass should implement a linear sequence of transformations, each of which results in a variant that may or may not be interesting. The interface is as follows:

state new(filename, option) : Return a fresh state object. Each pass uses this state to keep track of where it is in the sequence of transformations that it is capable of performing. These states may contain arbitrary data items; the C-Reduce core treats them as opaque. A typical pass stores some kind of cursor — a byte offset, token offset, line number, position in a tree traversal, etc. — in the state object.

The file referred to by filename is logically part of the state object even though it resides in the filesystem instead of memory. Of course it would not be difficult to pass the contents of the file around as a memory object but this would be slow when these objects are large (C-Reduce is frequently invoked on multi-megabyte preprocessed C++ files).

The “option” is used to select among different behaviors implemented by a composite pass.

state advance(filename, option, state) : Return a new state object referring to the next transformation opportunity following the one referenced by the state object passed as a parameter.

result transform(filename, option, state) : Modify the file in-place, selecting the transformation instance referred to by the state object. The result takes one of three values:

  • OK : the transformation succeeded
  • STOP : no more transformation instances remain for this pass
  • ERROR : something went wrong; for example, an external tool crashed, a working file or directory could not be created, etc.

(The API contains one additional method, which checks whether a pass’s external dependencies are satisfied, that doesn’t matter here.)

Our experience has been that every transformation pass that we wanted has been easy to implement behind this API.

The C-Reduce core implements this algorithm:

current = original_test_case
  size_at_start = size(current)
  foreach (p, option) in pass_list
    state = p::new(current, option)
      variant = current         // this is a file copy operation
      result = p::transform(variant, option, state)
      if result == ERROR
	report_problem_in_pass(p, option)
      if result == OK
        if is_interesting(variant)
          current = variant     // also a file copy
          state = p::advance(current, option, state)
    while result == OK
while size(current) < size_at_start

The termination argument for C-Reduce is:

  1. Since the outermost loop requires the size of the test case to decrease monotonically, it can only execute as many times as the size (in bytes) of the unreduced test case. In practice it executes many fewer times than this.
  2. The loop over passes terminates because the pass list is immutable once C-Reduce is initialized.
  3. Each iteration of the innermost loop either advances the state object or else (by selecting an interesting variant) removes one transformation opportunity. Either way, the number of remaining transformations decreases by one.
  4. The interestingness test is, at worst, terminated (using OS support for killing all processes in a process group) after a configurable timeout.

In practice the weak link in this argument is case 3, which is vulnerable to bugs in passes. C-Reduce terminates robustly by abandoning passes when they appear to be behaving unreasonably.

The C-Reduce core does not insist that transformations make the test case smaller, and in fact quite a few of its passes potentially increase the size of the test case, with the goal of eliminating sources of coupling within the test case, unblocking progress in other passes.

The sequence of transformation passes is carefully orchestrated such that passes that are likely to give the biggest wins -- such as those that remove entire functions -- run first; otherwise the tool would end up spending days or weeks doing silly things such as trying to shrink numeric constants in a huge source file. Shrinking numbers is useful, and it should be done, but only after many other reduction mechanisms have run to completion.

C-Reduce's collection of cooperating passes, with heavy phase-ordering constraints, is highly reminiscent of how a modern optimizing compiler works. However, only a small proportion of the transformation passes is intended to be semantics-preserving in the sense that a compiler's optimization passes must be. In this domain, we only want to preserve enough semantics that we can probabilistically avoid breaking whatever property makes a test case interesting.

A consequence of writing a modular reducer is that once we came up with the right API for writing passes, we were free to write a lot of passes. My colleagues and I spent several years doing this and we ended up with:

  • 35 passes, implemented in Perl, that include heuristics such as removing lines, removing various kinds of matched delimiters (and perhaps also the text between them), and shrinking integer values
  • 6 passes that invoke external utilities such as unifdef, a partial evaluator for the C preprocessor language, a lexer for C and C++ that supports various token-level reduction transformations, and pretty-printing utilities that make the reduced test case more pleasant to look at
  • 69 passes, implemented in C++, that use LLVM's Clang front end as a library for source-to-source transformation of C and C++ code; these include function inlining, partial template instantiation, scalar replacement of aggregates, copy propagation, and eliminating levels of a class hierarchy.

The actual number of dynamic passes is larger than the total of these numbers since some passes can be invoked in different modes using the "option" parameter mentioned above.

In this piece, we looked at why we had to create C-Reduce and at the modular structure that was key to making it solve the problems that we wanted to solve. In Part 2, I'll describe how C-Reduce improves reduction times using multiple cores and why C-Reduce usually does a good job reducing test cases in languages other than C and C++; finally, I'll discuss a few open research problems in test case reduction.

It’s Time for a Modern Synthesis Kernel

Alexia Massalin’s 1992 PhD thesis has long been one of my favorites. It promotes the view that operating systems can be much more efficient than then-current operating systems via runtime code generation, lock-free synchronization, and fine-grained scheduling. In this piece we’ll only look at runtime code generation, which can be cleanly separated from the other aspects of this work.

Runtime Code Generation in Ring 0

The promise of kernel-mode runtime code generation is that we can have very fast, feature-rich operating systems by, for example, not including code implementing generic read() and write() system calls, but rather synthesizing code for these operations each time a file is opened. The idea is that at file open time, the OS has a lot of information that it can use to generate highly specialized code, eliding code paths that are provably not going to execute.

Runtime code generation was a well-known idea in 1992, but it wasn’t used nearly as widely as it is today. In 2019, of course, just-in-time compilers are ubiquitous. However, operating system kernels still do not use runtime code generation very much, with a few exceptions such as:

  • several OS kernels, including Linux, have a simple JIT compiler in their BPF implementation
  • VMware used to use dynamic code generation to performantly virtualize OS kernels on x86 chips that lacked hardware virtualization extensions; I doubt that this is commonly used any longer
  • pre-NT Windows kernels would dynamically generate bitblit code. I learned this in a talk by a VMware employee; this code generation was apparently a debugging issue for VMware since it would fight with their own runtime code generator. Some details can be found in this post. The old paper about the origins of this technique in the Xerox Alto is a classic.
  • TempleOS, as explained in this nice writeup, made heavy use of dynamic code generation

Anyway, back to Synthesis. The OS and code generators were all written, from scratch, in 68020 assembly language. How do we translate Massalin’s ideas to 2019? Most likely by reusing an existing code generator and OS. For most of this piece I’ll assume that that’s what we want to do, but we’ll also briefly touch on customized alternatives.

Code Generator Requirements

The particular technology that we use for runtime code generation isn’t that important, but for now let’s imagine using LLVM. This means that the pieces of the kernel that we wish to specialize will need to be shipped as bitcode, and then we’ll ask LLVM to turn it into object code as needed. LLVM has lots of great optimization passes, from which we could pick a useful subset, and it is not hard to use in JIT mode. On the other hand, LLVM isn’t as fast as we’d like and also it has a large footprint. In production we’d need to think carefully whether we wanted to include a big chunk of non-hardened code in the kernel.

What optimizations are we expecting the code generator to perform? Mostly just the basic ones: function inlining, constant propagation, and dead code elimination, followed by high-quality instruction selection and register allocation. The hard part, as we’re going to see, is convincing LLVM that it is OK to perform these optimizations as aggressively as we want. This is an issue that Massalin did not need to confront: her kernel was designed in such a way that she knew exactly what could be specialized and when. Linux, on the other hand, was obviously not created with staged compilation in mind, and we’re going to have to improvise somewhat if we want this to work well.

My guess is that while LLVM would be great for prototyping purposes, for deployment we’d probably end up either reusing a lighter-weight code generator or else creating a new one that is smaller, faster, and more suitable for inclusion in the OS. Performance of runtime code generation isn’t just a throughput issue, there’ll also be latency problems if we’re not careful. We need to think about the impact on security, too.

Example: Specializing write() in Linux

Let’s assume that we’ve created a version of Linux that is capable of generating a specialized version of the write() system call for a pipe. This OS needs — but we won’t discuss — a system call dispatch mechanism to rapidly call the specialized code when it is available. In Synthesis this was done by giving each process its own trap vector.

Before we dive into the code, let’s be clear about what we’re doing here: we are pretending to be the code generator that is invoked to create a specialized write() method. Probably this is done lazily at the time the system call is first invoked using the new file descriptor. The specialized code can be viewed as a cached computation, and as a bonus this cache is self-invalidating: it should be valid as long as the file descriptor itself is valid. (But later we’ll see that we can do a better job specializing the kernel if we support explicit invalidation of runtime-generated code.)

If you want to follow along at home, I’m running Linux 5.1.14 under QEMU, using these instructions to single-step through kernel code, and driving the pipe logic using this silly program.

Skipping over the trap handler and such, ksys_write() is where things start to happen for real:

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos = file_pos_read(f.file);
		ret = vfs_write(f.file, buf, count, &pos);
		if (ret >= 0)
			file_pos_write(f.file, pos);

	return ret;

At this point the “fd” parameter can be treated as a compile-time constant, but of course “buf” and “count” cannot. If we turn “fd” into a constant, will LLVM be able to propagate it through the remaining code? It will, as long as:

  1. We inline all function calls.
  2. Nobody takes the address of “fd”.

It’s not that calls and pointers will always block the optimizer, but they complicate things by bringing interprocedural analysis and pointer analysis into the picture.

Our goal is going to be to see whether the code generator can infer the contents of the struct returned from fdget_pos(). (You might wonder why performance-sensitive code is returning a “struct fd” by value. Turns out this struct only has two members: a pointer and an integer.)

The call to fdget_pos() goes to this code:

static inline struct fd fdget_pos(int fd)
	return __to_fd(__fdget_pos(fd));

and then here:

unsigned long __fdget_pos(unsigned int fd)
	unsigned long v = __fdget(fd);
	struct file *file = (struct file *)(v & ~3);

	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
		if (file_count(file) > 1) {
	return v;

and then (via a trivial helper that I’m not showing) here:

static unsigned long __fget_light(unsigned int fd, fmode_t mask)
	struct files_struct *files = current->files;
	struct file *file;

	if (atomic_read(&files->count) == 1) {
		file = __fcheck_files(files, fd);
		if (!file || unlikely(file->f_mode & mask))
			return 0;
		return (unsigned long)file;
	} else {
		file = __fget(fd, mask, 1);
		if (!file)
			return 0;
		return FDPUT_FPUT | (unsigned long)file;

Keep in mind that up to here, we haven’t seen any optimization blockers. In __fdget_light(), we run into our first interesting challenge: “current” is a macro that returns a pointer to the running process’s PCB (in Linux the PCB, or process control block, is a “task_struct” but I’ll continue using the generic term). The current macro ends up being a tiny bit magical, but its end result can be treated as a constant within the context of a given process. There is no way a code generator like LLVM will be able to reach this conclusion, so we’ll need to give it some help, perhaps by annotating certain functions, macros, and struct fields as returning values that are constant over a given scope. This is displeasing but it isn’t clear there’s any easier or better way to achieve our goal here. The best we can hope for is that the annotation burden is close to proportional to the number of data types in the kernel; if it ends up being proportional to the total amount of code then our engineering effort goes way up.

Now, assuming that we can treat “current” as a compile-time constant, we’re immediately faced with a similar question: is the “files” field of the PCB constant? It is (once the process is initialized) but again there’s not going to be any easy way for our code generator to figure this out; we’ll need to rely on another annotation.

Continuing, the “count” field of files is definitely not a constant: this is a reference count on the process’s file descriptor table. A single-threaded Linux process will never see count > 1, but a multi-threaded process will. (Here we need to make the distinction between open file instances, which are shared following a fork, and the file descriptor table, which is not.) The fast path here is exploiting the insight that if our process is single-threaded we don’t need to worry about locking the file descriptor table, and moreover the process is not going to stop being single-threaded during the period where we rely on that invariant, because we trust the currently running code to not do the wrong thing.

Here our specializing compiler has a fun policy choice to make: should it specialize for the single threaded case? This will streamline the code a bit, but it requires the generated code to be invalidated later on if the process does end up becoming multithreaded — we’d need some collection of invalidation hooks to make that happen.

Anyhow, let’s continue into __fcheck_files():

static inline struct file *__fcheck_files(struct files_struct *files, unsigned int fd)
	struct fdtable *fdt = rcu_dereference_raw(files->fdt);

	if (fd < fdt->max_fds) {
		fd = array_index_nospec(fd, fdt->max_fds);
		return rcu_dereference_raw(fdt->fd[fd]);
	return NULL;

At this point we’re in deep “I know what I’m doing” RCU territory and I’m going to just assume we can figure out a way for the code generator to do what we want, which is to infer that this function returns a compile-time-constant value. I think this’ll work out in practice, since even if the open file instance is shared across processes, the file cannot be truly closed until its reference count goes to zero. Anyway, let’s move forward.

Next, we’re back in __fget_light() and then __fdget_pos(): our code generator should be able to easily fold away the remaining branches in these functions. Finally, we return to line 4 of ksys_write() and we know what the struct fd contains, making it possible to continue specializing aggressively. I don’t think making this example any longer will be helpful; hopefully the character of the problems we’re trying to solve are now apparent.

In summary, we saw four kinds of variables in this exercise:

  1. Those such as the “fd” parameter to write() that the code generator can see are constant at code generation time.
  2. Those such as the “current” pointer that are constant, but where the code generator cannot see this fact for one reason or another. To specialize these, we’ll have to give the compiler extra information, for example using annotations.
  3. Those such as the “count” field of the “files_struct” that are not actually constant, but that seem likely enough to remain constant that we may want to create a specialized version treating them as constants, and then be ready to invalidate this code if the situation changes.
  4. Those that are almost certainly not worth trying to specialize. For example, the “count” parameter to write() is not likely to remain constant over a number of calls.

Writing one byte to a pipe from a single-threaded process executes about 3900 instructions on Linux 5.1.14 (this is just in ksys_write(), I didn’t measure the trapping and untrapping code). The Synthesis thesis promises an order of magnitude performance improvement. Can specialization reduce the fast path on this system call to 390 instructions? It would be fun to find out.

I’ll finish up this example by observing that even though I chose to present code from the filesystem, I think it’s network stack code that will benefit from specialization the most.


I have some experience with OS kernels other than Linux, and my belief is that attempting to dynamically specialize any mainstream, production-grade OS other than Linux would run into the same issues we just saw above. At the level the code generator cares about, there just isn’t much effective difference between these OSes: they’re all big giant blobs of C with plentiful indirection and domain-specific hacks.

If our goal is only to create a research-grade prototype, it would be better to start with something smaller than Linux/Windows/Darwin so that we can refactor specialization-unfriendly parts of the OS in a reasonable amount of time. xv6 is at the other extreme: it is super easy to hack on, but it is so incredibly over-simplified that it could not be used to test the hypothesis “a realistic OS can be made much faster using specialization.” Hilariously, an xv6+LLVM system would be about 0.15% OS code and 99.85% compiler code. Perhaps there’s a middle ground that would be a better choice, Minix or OpenBSD or whatever.

Given two developers, one who knows LLVM’s JIT interfaces and one who’s a good Linux kernel hacker, how long would it take to bring up a minimally ambitious dynamically specializing version of Linux? I would guess this could be done in a week or two, there’s not really anything too difficult about it (it’s easy to say this while blogging, of course). The problem is that this would not give good results: only the very easiest specialization opportunities will get spotted by the runtime code generator. But perhaps this would generate enough interest that people would keep building on it.

Do we want to do specialization work on C code? No, not really, it’s just that every one of our production-grade kernels is already written in it. A fun but engineering-intensive alternative would be to create a new, specialization-friendly kernel in whatever programming language looks most suitable. Functional languages should offer real advantages here, but of course there are issues in using these languages to create a performant OS kernel. Perhaps Mirage is a good starting point here, it is already all about specialization — but at system build time, not at runtime.

An ideal programming environment for a modern Synthesis kernel would provide tool and/or language support for engineering specialization-friendly kernel code. For example, we would identify a potential specialization point and then the tools would use all of our old friends — static analysis, dynamic analysis, symbolic execution, etc. — to show us what data items fall into each of the four categories listed in the last section, and provide us with help in refactoring the system so that specialization can work better. A tricky thing here is taking into account the different kinds of concurrency and synchronization that happen in a sophisticated OS.

Some useful questions to ask (and of course we’re always asking these same things when doing OS and compiler research) are: How are we supposed to think about a dynamically specializing OS kernel? What are the new abstractions, if any? Specialization could really benefit from some sort of first-class “code region over which these values are effectively constant” and then also “but the constant-ness is invalidated by this set of events.”

Why Now?

The literature on dynamic specialization of OS code is interesting: it looks like there was a flurry of interest inspired by Synthesis in the mid/late 90s. Many of these papers had Calton Pu, Massalin’s thesis supervisor, on the author list. Not a whole lot has happened in this area since then, as far as I know. The only paper I can think of about optimistic OS specialization is this one; it’s a nice paper, I recommend it. Static OS specialization, on the other hand, is what unikernels are all about, so there’s been quite a bit of work done on this.

It seems like time to revive interest in dynamic OS specialization because:

  • Most of the processor speed wins lately are application specific; the cores that execute OS code are not getting noticeably faster each year, nor do they seem likely to. In fact, way back in 1989 John Ousterhout argued that increases in processor speed weren’t benefiting OS code as much as other kinds of code.
  • OSes have slowed down recently to mitigate side channel attacks. Maybe we can get some of that speed back using dynamic specialization.
  • OSes are way bloatier than they were in the 90s, increasing the potential benefits due to specialization.
  • Compiler technology is far ahead of where it was in the 90s, with off-the-shelf toolkits like LLVM providing high-quality solutions to many of the problems we’d run into while prototyping this work.

I’d like to thank Perry Metzger who suggested this piece and also provided feedback on a draft of it. Perry worked with Alexia back in the day and hopefully he’ll also write about this topic.

Finally, I don’t want to give the impression that I’m summarizing a research proposal or an in-progress project. This is the kind of thing I love to think about, is all.

Floating the Dirty Devil River

Packrafts are tough, light, individual-sized inflatable boats that people use to put together really amazing wilderness trips by combining rafting, hiking, and sometimes even biking. I’ve had sort of a low-grade obsession with packrafting since 2009 when I ran into some people in Alaska who were on their way to hike up and over the Brooks Range and paddle down to the Arctic Ocean. (I’m 99.9% certain it was these folks.) But anyway, I never managed to actually do any packrafting until spring break this year when — along with my friends Brian and Tim — I floated a section of the Dirty Devil River in southern Utah.

The Dirty Devil is a little tricky: it’s a seasonal desert river that, on average, only has enough flow for floating during February and March — and February can be really cold. Luckily the university’s spring break is in March, usually a superb time to be in the desert: the days are getting longer and warmer, the nights are still cold, and there’s not a single bug to be found. Usually you get snowed on for a couple hours, but the snow doesn’t stick around.

None of us had so much as touched a packraft, but all of us had canoed or kayaked before, so we figured a class 1-2 river should be doable. The Dirty Devil flows through a canyon 1500 feet (450 meters) deep and, once we started, there was no realistic possibility of escape except by finishing the trip 36 miles later.

We started out on a spectacular morning:

To set up a shuttle we drove down the rough but super fun Poison Springs Road; here Brian’s cramming a few last things into my vehicle with the DD running narrow and fast in the background:

There was quite a bit more water in the channel than we’d expected; it turned out that Happy Canyon, about eight miles upstream, was in the middle of a minor flash flood.

Next we returned to the paved road, drove a few miles, then down to the top of the western end of the Angel Trail (one of a number of trails with that name). Nobody else was there, and we didn’t see anyone until the last day of our trip. At this point the weather was deteriorating:

And soon we here hiking in the snow:

The snow didn’t last long, but it remained windy and not far above freezing; we were happy to have wetsuits and puffy jackets. We put in across the river from Robbers Roost Canyon, where Butch Cassidy and his gang used to hide out:

We paddled to Angel Cove, which has good camping and a nice spring. With the braided river and the wind it took forever to go just a couple of miles. We were pretty happy to get there, get into dry clothes, make some dinner. Also, the rest of our trip lacked reliable water sources (water from the DD is nasty) so we purified enough water to last the rest of the trip.

The next morning it was still cold, but nice and sunny:

We saw both sides of a ridge called the Saw Tooth, which the river meanders around:

We also passed No Man’s Canyon and Larry Canyon, both of which would have been great to spend a few days hiking, but we hadn’t been able to budget time for that.

The opaque water was hard to read and we did an awful lot of pulling our boats around the river trying to find the channel, which moved from bank to bank every few hundred yards:

This was exhausting and there was plenty of quicksand too. Time flew by as we moved down the river at less than a walking pace and it was around 3pm before we noticed that we should stop for lunch.

The scenery continued to develop:

We pushed late into the day:

And finally stopped at a nice camp site on the river bank where we watched the last of the sun on the canyon walls:

There was copious driftwood so we had a fire:

Morning was good:

Since we hadn’t made good time the previous day I got us up early, but it was just too cold to get into the water and anyway our wetsuits were frozen stiff. We had another cup of coffee and waited for the sun:

It ended up being warmer than the previous day — just about perfect weather — and for whatever reason we made much better time than we had earlier. At this innocuous-looking rest stop I hit some ugly quicksand and told the others to land somewhere else:

Brian landed a little ways to one side and just walked up the bank but Tim got mired much worse than I had, it probably took him 15 minutes to get out. We couldn’t even help.

Over the course of the day we passed Twin Corral Box Canyon and Sam’s Mesa Box Canyon, both huge and seldom-visited places I’d have been happy to spend a couple days exploring, but again we wanted to keep on schedule. Travel was generally easy and also the river gained some fun riffles as it started to drop more rapidly through the sandstone layers. It wasn’t possible to avoid hitting rocks since we couldn’t see through the water, but packrafts seem to be very tough. They’re also quite stable, I only worried about flipping once or twice, and never did.

By late afternoon we reached Happy Canyon — named, apparently, by old-time cowboys who were happy to find a way out of it — which was still running from its flash three days earlier:

No shortage of quicksand here either, ugh!

We didn’t take much time to explore, but rather found a place to camp nearby. By this point we were only eight miles from our takeout and I finally stopped worrying about getting trapped in the canyon if one of the boats popped or got lost. For whatever reason this night was extra cold, my water bottle was more than half frozen in the morning, meaning it had probably gotten below 25 degrees.

Nice light on the walls:

Happy Canyon has some of the finest walk-through narrows anywhere on the Colorado Plateau:

After this excursion it was time to pack up camp and float some more. We immediately hit a series of rock gardens:

Progress was overall slower than the previous day, but we moved along and the scenery kept being good.

And suddenly we reached the takeout, where we had adult beverages and loaded a lot of extremely sandy gear into Brian’s truck:

And then back to pick up my truck, which we reached just in time for sunset:

Here in the middle-right you can see the sawtooth ridge that we had floated past a few days earlier:

And finally, as tradition dictates, we stopped for burgers at Ray’s in Green River. This was a super good trip, but fairly difficult physically and perhaps a bit much to bite off for a group of 40-something first-time packrafters.

Verifying Popcount

Popcount is the function that returns the number of set bits in its argument. Showing that a popcount implementation does what it claims to do has become one of my favorite examples to use when I need to quickly show students how we can reason about programs mathematically. Something like a selection sort is probably the more standard example here, but sorting gets into some subtleties that I’d rather reserve for a second example. Popcount is so easy that this whole example can be done in a half hour.

The programming language and the particular algorithm don’t matter too much, here we’ll use what is perhaps the simplest popcount implementation in C or C++:

int popcount(uint64_t x) {
  int c = 0;
  for (int i = 0; i < 64; i++) {
    c += x & 1;
    x >>= 1;
  return c;

This works, of course, because it scans the entire argument, incrementing the counter each time it finds a set bit. To show that it works in a more formal fashion, we can use methods more or less out of A Discipline of Programming, which allow us to combine obvious effects of individual statements into (hopefully) the desired high-level property, which we might as well annotate the program with right now:

int popcount(uint64_t x) {
  int c = 0;
  for (int i = 0; i < 64; i++) {
    c += x & 1;
    x >>= 1;
  // c = POPCOUNT(orig_x)
  return c;

Here we’ve used a couple of tricks that need to be fully explained when giving this example to students who aren’t used to this kind of reasoning. First, since x is modified as the program executes, we’ve used orig_x as a shorthand for the original value of x when it was passed as an argument. (This syntax is a trick I learned from ACSL.) Second, here we’ve used the POPCOUNT function, which is the mathematical specification of popcount, which is a completely different thing than the C code we are trying to verify. Consistently applying the correct rules to mathematical and programmatic elements isn’t as straightforward as we would hope and in fact quite a few results from the bad old days of non-mechanized program verification are wrong in subtle ways.

To make our proof go, we’ll need to find the right loop invariant: a statement that is true before the loop executes and also true after every loop iteration. The trick when explaining this task to students is helping them realize that (for a simple example like this) if they understand how the algorithm works, they effectively already know the loop invariant. They just need some help in teasing it out into a usable form. One of the stumbling blocks is that every loop has many invariants, but most of them (for example c ≥ 0) aren’t helpful. We can avoid a lot of silly loop invariants by observing that the right invariant will be strong enough to imply the correctness criterion for the program while also being weak enough to already be true before the first loop iteration, but I’ve found that negative constraints like these are not very helpful for students. The trick, of course, is to capture the abstract fact that is being held in balance as the loop modifies pieces of program state, and to make a lecture like this work, we need to avoid the appearance of pulling this fact out of thin air. It defeats the entire purpose of this discussion if students walk away feeling like there was a magical step in the middle.

So anyway, by hook or by crook we need to get the audience to realize that the key to this algorithm is the loop invariant c + POPCOUNT(x) == POPCOUNT(orig_x). At this point we’re over the hump and we should mark up the code some more:

int popcount(uint64_t x) {
  int c = 0;
  // c + POPCOUNT(x) == POPCOUNT(orig_x) ???
  for (int i = 0; i < 64; i++) {
    // c + POPCOUNT(x) == POPCOUNT(orig_x)
    c += x & 1;
    x >>= 1;
    // c + POPCOUNT(x) == POPCOUNT(orig_x) ???
  // c + POPCOUNT(x) == POPCOUNT(orig_x) => c = POPCOUNT(orig_x) ???
  return c;

Here I’ve used question marks to indicate a proposition that is still in doubt. Now we’ve broken up the problem into manageable pieces and can visit them in turn:

  • The loop invariant is true before the loop executes because c = 0 and x = orig_x.
  • To see that the loop invariant is preserved by the loop body, we analyze two cases: x & 1 is either zero or one. If it is zero, then c is unchanged and also, clearly, shifting x right by one place does not change POPCOUNT(x). If it is 1 then c is incremented by one and shifting x right by one place has decreased its popcount by one. So we can see that either way, the invariant is maintained.
  • To prove that the loop invariant implies the overall correctness condition, we must observe that x = 0 after being shifted right 64 times. This is the sort of situation where we better be clear about the rules for shifting signed vs. unsigned values in C or C++.

Of course we’ve skipped over a lot of smaller steps, but this argument captures the essence of why the algorithm works. When presenting this particular algorithm, we can safely skip the termination proof since i is not modified in the loop body.

An entertaining variant is Kernighan’s popcount:

int popcount(uint64_t x) {
  int c = 0;
  while (x) {
    x &= x - 1;
  return c;

I didn’t use it in this post because I feel like it’s useful to show the analysis by cases in the simpler algorithm. Wikipedia has some faster and much harsher algorithms, which are the ones used in practice on processors that do not have a native popcount instruction.

Anyway, I presented this material in a software engineering class yesterday, and realized that I’d used this example a couple of times before, so it seemed worth writing up here.

Explaining Code using ASCII Art

People tend to be visual: we use pictures to understand problems. Mainstream programming languages, on the other hand, operate in an almost completely different kind of abstract space, leaving a big gap between programs and pictures. This piece is about pictures drawn using a text character set and then embedded in source code. I love these! The other day I asked around on Twitter for more examples and the responses far exceeded expectations (thanks everyone!). There are a ton of great examples in the thread; here I’ve categorized a few of them. Click on images go to the repositories.

Data Structures

One of the most common kinds of ASCII art in code is illustrating the shape of a data structure.

The example I started with comes from LLVM:

The layout of a data structure in the Jikes RVM:

A tree rotate in Musl:

Double-ended queue from the Rust library:

Swift compiler internals:

Malloc header layout:

State Machines

JavaScript profiling:

RPCs in Cloud Spanner:

I/O stream states:

Logical Structure in the Problem Domain

Control flow in an NWScript program being decompiled:

ECC internals:

Formatting numbers:

A quantum circuit:

Balancing memory management objectives in an OS kernel:

Subtyping relations (this is a very cool special case where the ASCII art is also code):

The format of a DBF file:

A lookup table for image processing:

Shape of a color function:

Structure of a URI:

A very quick tutorial on undo systems from emacs:


Attitude control in the Apollo Guidance Computer (!!!):

Image tiling:

Boomerang trajectories in Nethack:

Rendering CSS borders:


Speed control in a milling machine:

Scrolling web pages:

I hope you enjoyed these as much as I did!

Walking or Biking to NSF

Since the National Science Foundation funds a large fraction of academic computer science research in the USA, we often end up traveling to Washington to visit the NSF. This post is just to say that if you are traveling light, if you need some exercise, if you have a bit of free time, and if the weather is good, it is feasible and pleasant to walk or bike from Reagan Washington National Airport (DCA) to the new NSF location in Alexandria.

The short version is:

  • Don’t follow Google Maps in pedestrian mode, which finds a not-great route
  • Do follow the Mount Vernon Trail, which connects DCA and Old Town Alexandria in a pretty direct fashion

I’ve walked this route in both directions; it’s roughly five miles and takes me 80 or 90 minutes (I’m a pretty fast walker so ymmv). I have not biked it, but Capital Bikeshare has several stations near DCA and quite a few stations scattered around Alexandria — so this seems like a great option, assuming that your luggage is in a backpack or something else hands-free.

Here’s an overview of the whole route:

To help orient yourself, the beltway runs across the bottom of the image, the Potomac River runs north to south, you can see DCA’s runways towards the top, and Old Town Alexandria is in the triangular area in between the Potomac, the beltway, and the red walking route.

The only potentially tricky bit is locating the Mt Vernon Trail. Starting anywhere in the airport, walk to Terminal A and go outside to the semicircle-shaped pickup/dropoff area, and then find a sidewalk going roughly southwest. The road you want to walk along is South Smith Boulevard. Don’t cross under the Metro tracks and also don’t go down a level to Thomas Avenue, which gives access to some of the buildings right next to the airport terminal. This may be helpful:

In less than half a mile, as you reach the end of the airport-related buildings, the Mount Vernon Trail, paved with a dotted yellow center line, is easy to see and you should move over to it here (open kmz files in Google Earth). This is all pretty obvious on the ground, and there are some helpful signs. Whatever you do, don’t cross the George Washington Parkway — the busy divided highway that runs nearby.

Follow the Mount Vernon Trail past Daingerfield Island to Alexandria, at which point the trail forks and the Mount Vernon Trail heads over towards the Alexandria waterfront. This is a nice walk if you have time, but it’s out of the way and you can get to NSF more quickly by moving to East Abingdon Drive here and then winding your way through residential Alexandria towards King Street and then NSF.

As long as you have some sort of map this whole route is very easy to follow, as is the reverse route back to the airport.