Alive2 Part 2: Tracking miscompilations in LLVM using its own unit tests

[This piece is co-authored by Nuno P. Lopes and John Regehr.]

Alive2 is a formal verification framework for LLVM optimizations. It includes multiple tools, including a plugin for ‘opt’ to verify whether the optimizations just run are correct or not. We gave an introduction to Alive2 in a previous post.

A few years ago we used an old version of Alive to verify small, automatically generated test cases. These test cases had functions with three instructions which were then piped through different optimizations and their outputs verified with Alive for correctness. This approach found several bugs in the then-current version of LLVM.

But we want more — we want to identify and hopefully help fix all miscompilations that can be triggered in practice. To this end, we created Alive2, which can reason about a wide variety of intraprocedural optimizations including those that work with control flow, memory, vectors, and floating point types. We then started running Alive2 regularly on LLVM’s own unit tests. This has already resulted in 39 bug reports.

Working on the unit tests makes a lot of sense:

  • The behaviors being tested are clearly important, or at least interesting, or else the LLVM developers wouldn’t have made them part of the unit test suite in the first place.
  • The tests are usually small and only involve one or a few optimizations, making it much easier to fix problems than it is when we observe miscompilations in the field.

In this post we describe the process of running Alive2 on LLVM’s unit tests, and give an overview of the remaining problems we’ve found in LLVM that still need to be fixed.

Running Alive2 on LLVM’s unit tests

LLVM’s unit tests are run by a tool called lit, which makes it easy to interpose on calls to opt, the standalone optimizer for LLVM IR. We simply run a subset of the unit tests like this:

$ ~/llvm/build/bin/llvm-lit -vv -Dopt=$HOME/alive2/scripts/ ~/llvm/llvm/test/Transforms

Our script, then, loads the Alive2 opt plugin and fails the test case if Alive2 can prove that it was optimized improperly. This script has a list of passes that are not supported by Alive2, so it can avoid false positives (tests that look wrong to Alive2, but are actually correct). We take this issue very seriously, to avoid wasting people’s time. Passes that aren’t currently supported are those that do interprocedural optimizations.

In addition to unsupported passes, Alive2 doesn’t support all of LLVM’s IR, including some function attributes and many intrinsics and library functions. If a test case uses an unsupported feature, Alive2 still tries to analyze it to find miscompilations involving only supported IR features, while conservatively approximating the remaining parts.

Running the above lit command takes around 15 hours of CPU time, or less than 2 hours on an 8-core machine.

Tracking bugs over time

We built a simple dashboard to track the evolution of bugs found by Alive2 in LLVM’s unit tests over time.

The plot shows two lines: number of failures per day, and number of failures if ‘undef’ values were to disappear from LLVM. At the time of writing, Alive2 reported a total of 95 bugs in the unit tests, with 25 of those being ok if undef was removed. Since ‘undef’ values still exist, these optimizations are incorrect. We plot both lines as an incentive to get rid of ‘undef’ values: over a quarter of the reports would disappear.

The page has further information for the last five runs, including the reports that were fixed and which are new in each run. Clicking on a run gives the list of all failed tests, as well as a short manually written description of the failure cause.

We use a timeout in Alive2 for the work done by the theorem prover underneath. Unfortunately, some reports are only triggered very close to the timeout and therefore sometimes there is churn, with a test failing only in some runs.

Current issues in LLVM

Many of the bugs we’ve found and reported have been fixed already. Most of these were some Instcombine pattern that was incorrect for some inputs. Other issues are trickier to solve, for example some bugs are hard to fix without performance regressions, as they may need further changes in other optimizations or even in SelectionDAG to recognize a different IR canonicalization. In some cases, there is not yet an agreement on the right semantics for the IR: these can’t be fixed straight away without considering pros and cons of different semantics and getting the community to reach a consensus about them. We now give a few examples of open issues in LLVM.

Introduction of branch on poison/undef

Branching on a poison or undef value is immediate UB, like dividing by zero. Optimizations that hoist branches and/or combine multiple branches can introduce branches on values that were not observed in the original program. This issue is prevalent across multiple optimizations like SimplifyCFG, loop unswitch, simple loop unswitch, and induction variable simplification. A simple example:

if (x) {
  if (y) {
if (x & y) {

The original program would not branch on y if x was false. The optimized code, however, always branches on x and y. Therefore, when x=false and y=poison, the source program is well defined, but the optimized program is UB. This is a failure of refinement and the transformation is wrong.

One possible solution is to stop performing these transformations, but in general this is not considered acceptable because it loses performance. A different solution is to add freeze instructions that we introduced in a paper, and that have been supported in LLVM since version 10.0. Freezing a regular value has no effect, and freezing undef or poison results in an arbitrary value being chosen.
A correct version of the wrong optimization above is:

if (x & freeze(y)) {

We know where freeze instructions need to be introduced — this is not too difficult. However, adding too many freezes can (and has) led to performance regressions. These are usually not because freeze is inherently expensive, but rather because freezes can inhibit optimizations, particularly those involving loops. Fixing the regressions is a matter of putting in the engineering effort to do things like hoisting freeze out of loops, eliminating unnecessary freezes, and teaching optimizations not to be blocked by freezes. Work is underway, but several parts of the compiler need to be touched and it will take some time.

Introduction of undef uses

Generally speaking, it is not legal to add a new use of an undef value in LLVM. This is a problem in practice because it is easy for compiler developers to forget that almost any value is potentially undef. For example, shl %x, 1 and add %x, %x are not equivalent if %x might be undef. The former always returns an even number, while the latter could also yield an odd number if %x is undef. In other words, different uses of an undef value can resolve to different concrete values. Alive2 reports 25 test cases that are incorrect if we consider some of the variables to be undef, but are correct otherwise.

Some cases are easy to fix. For the example above, we can simply choose to canonicalize the IR to the shift instruction. We could also freeze %x, since all uses of the frozen value will resolve to the same concrete value.

This canonicalization has no obvious alternative solution other than using freeze or removing undef entirely from LLVM:

sext(x < 0 ? 0 : x)
zext(x < 0 ? 0 : x)

If x is undef, the expression x < 0 ? 0 : x may yield a negative number. Therefore, replacing the sign-extension with a zero-extension is not correct. Freezing x or eliminating undef values would both work.

Pointer comparison semantics

LLVM pointers track object provenance. That is, a pointer derived from a malloc() call can only point within the buffer allocated by that call, even if the underlying machine value of the pointer happens to point to another buffer. For example:

int *p = malloc(4);
int *q = malloc(4);
p[x] = 2; // this implies x = 0; cannot write to q, for example

The reason for this semantics is that it enables cheap, local alias analysis. If provenance tracking didn’t exist, alias analysis would often need to do whole program analysis where today it can get away with easy, local reasoning. This is nice; we all love fast compilation times and good object code!

The goal of provenance tracking is to prevent the program from learning the layout of objects in memory. If the program could learn that q is placed right after p in memory for the example above, the program could potentially change their contents without the compiler realizing it. A way the memory layout can leak is through pointer comparison. For example, what should p < q return? If it’s a well defined operation, it leaks memory layout information. An alternative is to return a non-deterministic value to prevent layout leaks. There are pros and cons of both semantics for the comparison of pointers of different objects:

Comparison yields a
deterministic value
Comparison yields a
non-deterministic result
p = obj1 + x; q = obj2 + y;
x,y unknown
Cannot simplify the comparison;
depends on the values of x & y
Can fold p < q to false
Equivalence between pointer
and integer comparison?
Yes; can simplify
(ptrtoint p) < (ptrtoint q) to p < q
Assume all pointers are > null Yes; can simplify
(p <= q) or (p == null) to: (p <= q)
Maybe (can special case?)
inttoptr simplification Harder; pointer comparisons
leak memory layout

Unfortunately LLVM currently wants the best of both worlds, with different optimizations assuming different semantics, which can lead to miscompilations. We need to settle on one semantics and stick to it throughout the compiler.

Load type punning

In order to respect pointer provenance semantics, LLVM’s memory is typed. We need at least to be able to distinguish whether a byte in memory holds an integer/float or a pointer. However, LLVM sometimes violates this rule and loads a pointer with an integer load operation or vice-versa. This implies that any load instruction could potentially do an implicit inttoptr or ptrtoint operation, thus leaking information about the memory layout. This is not good: as we have seen before, it undermines alias analyses.

Introducing type punning instructions is common in LLVM. Many of these cases arise when converting small memcpy operations to load/store instructions. While memcpy copies memory as-is, load/store instructions are typed. Additionally, LLVM currently has no way of expressing C’s char or C++’s std::byte. These types are allowed to store any type of data, including pointers and integers.

A suggestion to fix this issue is to introduce a ‘byte’ type in LLVM IR, a sort of union of all LLVM types. This type couldn’t be used in arithmetic operations, but only in memory operations and appropriate casts (yielding poison when casting byte to the wrong type). An alternative solution is to accept that load instructions may do implicit casts and fix alias analyses to take this fact into account (likely paying a performance price).

Select to Arithmetic

LLVM canonicalizes select instructions to arithmetic when it can do so. All of these transformations are wrong because select blocks a poison value from its not-chosen input, but arithmetic does not. The most obvious fix, changing the meaning of select so that it no longer blocks poison, is unacceptable because this renders unsound a different class of transformation that LLVM wants to do: canonicalizing simple conditional expressions to select. This kind of tricky semantic balancing act is found again and again in LLVM IR. Appropriate fixes would be to avoid these transformations (which requires strengthening up support for select in backends, and in fact some work in this direction has been done) or else freezing values before performing arithmetic on them.

Other issues

This post is already long, so we list here a few further open issues and a short description:

  • FP bitcast semantics. What’s the value of (int)(float)x, where x is an integer? Is it always x? The corner case is when x matches one of the NaN bit-patterns. Can the CPU canonicalize the NaN bit-pattern and thus when converting the float back to integer yield a result different than x?
  • Load widening. LLVM combines multiple consecutive loads into a single load by widening the loaded type. This is not correct: if any of the loaded values is poison, all of them become poison. A solution is to use loads with vector types, as poison elements in a vector don’t contaminate each other.
  • inttoptr(ptrtoint(x)) => x. This is not correct, as there can be two valid pointers with the same integer value (beginning of an object and one past the end of another). The solution is to enable this transformation only when a static analysis determines it’s safe to do so. Currently, LLVM performs this optimization unconditionally.
  • Shufflevector semantics. When the mask is out of bounds, the current semantics are that the resulting element is undef. If shufflevector is used to remove an element from the input vector, then the instruction cannot be removed, as the input might have been poison. The solution is to switch to give poison instead.


Our short-term goal is to get LLVM to a point where Alive2 detects zero miscompilations in the unit test suite. Then we hope that an Alive2 buildbot can continuously monitor commits to help ensure that no new miscompilation bugs are committed to the code base. This is a powerful baseline, but we believe there are additional miscompilation errors lurking in LLVM that will only be exposed by more thorough testing. We will accomplish this by using Alive2 to monitor LLVM while it optimizes real applications and also randomly generated programs.

We thank everyone involved in fixing LLVM bugs as well as discussing semantics and fixing LangRef!

Sid’s Mountain Backpacking Loop

Last fall my friend Brian and I went on a short backpacking trip in the San Rafael Swell. We left SLC early, drove to Ferron Utah, and then followed a high-clearance dirt road to the rim of North Salt Wash, a wide canyon that feeds the San Rafael River. We dropped into this open canyon using an old road and then a foot trail:

There’s an extensive area of rock art near where we reached the canyon floor, some of it fairly fresh-looking (so probably less than 1,000 years old):

We particularly enjoyed the turtle.

Other petroglyphs in the area are faded, meaning they are most likely much older:

We walked down North Salt Wash for about five miles until it ended at the San Rafael River:

We had lunch here and then walked downstream, sometimes on trails, other times just walking in the water. This hike would be difficult or impossible in spring when the river has a lot of flow, but in fall it was slow and shallow. About three miles later we reached Virgin Springs Canyon, which also has some nice rock art:

A short distance past the rock art, this canyon has a beautiful perennial pool, and we stopped for a while there to filter a lot of water — about nine liters each — since we weren’t going to run into reliable water for the next two days. October is an idyllic time in canyon country with moderate temperatures, stable weather, and no bugs, but it’s not generally a good time to plan on finding water in intermittent sources such as potholes and weak springs.

Above the pool, Virgin Springs Canyon is impassable to cattle, making it immeasurably more pleasant than if it was all cowed up. It has no trails of any sort and some of the walking was fast and easy, but other times we had to climb through boulder piles and shove our way through thick scrub. Our packs had started the day nice and light (25 or 30 pounds probably) but adding 20 pounds of water made them a lot less pleasant to carry.

Soon enough it was late afternoon and we found my favorite kind of desert campsite: an area of flat slickrock. These are perfect for low-impact camping and also for walking around barefoot without getting all dirty or stepping on a cactus. The weird white thing is my bivy bag, I don’t bother with a tent when the weather is good and the bugs aren’t bad.

In the morning we continued walking up Virgin Springs Canyon. First order of business was bypassing a big dryfall by working through a cliff band:

This left us with only the final headwall to surmount; there’s a slightly scrambly route out of the canyon in the shadowed area in the left-center of this picture:

Looking back at upper Virgin Springs Canyon from just below the exit:

Overall this canyon exit ended up being pretty easy and straightforward, as these things go.

At this point we were up on Sid’s Mountain, a remote plateau. We were only a short distance from Swazy, the local high point, so we climbed it and got good views in all directions.

Our next destination was an old cabin; I think the metal lids on top of the stilts are to keep mice out.

Next we had a longish walk up on the flats. Despite the good scenery, this part of the hike was not too inspiring: the afternoon was a bit too warm, plenty of cows had been in the area, and those yellow-green bushes Brian is standing next to are full of hellish little sharpies that try to break off and get into your socks. I was wearing running shoes and had to stop and pick those things out every few minutes.

Towards evening we reached the rim of Saddle Horse Canyon where we found another nice slickrock campsite. The upside of carrying water is you can camp anywhere! I didn’t take pictures but we had a spectacular sunset and again it was warm enough to walk around barefoot, a welcome change after being in dusty sneakers all day.

I got up early and walked around shivering, waiting to take a few pictures as the sun came up.

In a little while the Earth’s shadow started dropping towards the skyline.

First light on the Wasatch Plateau.

The rugged central San Rafael Swell to the south of us.

And finally full sun.

We only had about eight miles of walking to get back to my truck, so didn’t worry about getting an early start. The descent down into Saddle Horse Canyon was on an angle-of-repose rubble slope that looked grim, but on fresh legs it wasn’t bad. A big arch lives nearby:

Lower down in Saddle Horse Canyon there were areas of running water. Though there were a few sections of brush and soft sand, overall it was pretty easy walking.

Saddle Horse Canyon ends at North Salt Wash basically right at the bottom of the trail we had used to enter the canyon on the first day, so it was a quick walk out from there. Obligatory post-trip selfie:

Alive2 Part 1: Introduction

[This piece is co-authored by Nuno P. Lopes and John Regehr.]

Compiler bugs threaten the correctness of almost any computer system that uses compiled code. Translation validation is a path towards reliably correct compilation that works by checking that an individual execution of the compiler did the right thing. We created a tool, Alive2, that takes an unoptimized function in LLVM intermediate representation (IR) and either proves that it is refined by the optimized version of the function, or else shows a set of inputs that illustrate a failure of refinement.

We work in terms of refinement, rather than equivalence, because compilers often change the meaning of the code they are optimizing — but in allowed ways. For example, adding two variables of “int” type in a C or C++ program means something like “add the two quantities, returning the mathematical result as long as it is within an implementation-specified range that is no smaller than -32767..32767.” When integer addition is lowered from C or C++ to LLVM IR, this very restrictive range for validity will typically be refined to be -2147483648..2147483647. Then, when LLVM IR is lowered to a target platform such as ARM or x86-64, the behavior of additions whose results fall out of bounds is refined from “undefined” to “two’s complement wraparound.” Both of these refinements are perfectly legal, but would be incorrect if the transformation went in the other direction. Non-trivial refinement is ubiquitous during compilation and we have to take it into account in a translation validation tool.

Let’s look at an example of refinement and also learn how to invoke Alive2 and interpret its results. Here’s a pair of functions in LLVM IR:

define i16 @src(i16 %x, i16 %y) {
    %r = sdiv i16 %x, %y
    ret i16 %r

define i16 @tgt(i16 %x, i16 %y) {
    %z = icmp eq i16 %y, 0
    br i1 %z, label %zero, label %nonzero
    ret i16 8888
    %q = sdiv i16 %x, %y
    ret i16 %q

The first function, src, simply returns the quotient of its arguments. The second function, tgt, returns the quotient of its arguments unless the second argument is zero, in which case it returns 8888. It turns out that src is refined by tgt. This works because division by zero in src has undefined behavior: LLVM makes no promise whatsoever about the behavior of a program that divides a number by zero. tgt, on the other hand, has a specific behavior when its second argument is zero, and behaves the same as src otherwise. This is the essence of a refinement relationship. If we run the refinement check in the other direction, Alive2 will tell us that src does not refine tgt.

To see these results for yourself, you can build Alive2 (making sure to enable translation validation at CMake time) and then invoke our alive-tv tool like this:

alive-tv foo.ll

Alternatively, you can simply visit our Compiler Explorer (CE) instance. Let’s talk about this user interface for a second. When you load the page you should see something like this:

The large pane on the left is where you type or paste LLVM IR. The large pane on the right shows alive-tv’s output in the case where it is able to parse the LLVM code successfully. The “share” pulldown at the top provides a short link to what you are currently seeing: this is an excellent way to show other people your results. If your LLVM code contains an error, the right pane will say “Compilation failed” and the text at the bottom of the right pane will change from “Output (0/0)” to something like “Output (0/3)”. Click on this text and a third pane will open showing the error output. You should then see something like this:

After fixing the error you can close this new pane.

Returning to our example, here’s Compiler Explorer with the example already loaded. We’ve also given alive-tv the --bidirectional command line option, which causes it to run the refinement check in both directions. The output therefore has two parts; this is the first one:

define i16 @src(i16 %x, i16 %y) {
  %r = sdiv i16 %x, %y
  ret i16 %r
define i16 @tgt(i16 %x, i16 %y) {
  %z = icmp eq i16 %y, 0
  br i1 %z, label %zero, label %nonzero

  %q = sdiv i16 %x, %y
  ret i16 %q

  ret i16 8888
Transformation seems to be correct!

“Transformation seems to be correct!” indicates that Alive2 thinks that src is refined by tgt. The arrow (=>) points in the direction of the refinement relation. The second part of the output is:

define i16 @tgt(i16 %x, i16 %y) {
  %z = icmp eq i16 %y, 0
  br i1 %z, label %zero, label %nonzero

  %q = sdiv i16 %x, %y
  ret i16 %q

  ret i16 8888
define i16 @src(i16 %x, i16 %y) {
  %r = sdiv i16 %x, %y
  ret i16 %r
Reverse transformation doesn't verify!
ERROR: Source is more defined than target

i16 %x = undef
i16 %y = #x0000 (0)

i1 %z = #x1 (1)
i16 %q = #xffff (65535, -1)	[based on undef value]

i16 %r = #xffff (65535, -1)

Notice that now the arrow points from tgt to src. This time Alive2 says:

Reverse transformation doesn't verify!
ERROR: Source is more defined than target

(Here in the reverse direction, “source” unfortunately refers to “tgt” and “target” refers to “src”.)

Interpreting a counterexample from Alive2 isn’t always easy. Part of the blame falls on LLVM’s tricky undefined behavior semantics, but also there are some UX details that we hope to improve over time. The main thing to remember is that a counterexample is an example: it represents a single execution which illustrates the failure of refinement. The most important part of a counterexample is the set of input values to the IR being checked:

i16 %x = undef
i16 %y = #x0000 (0)

These are, of course, the arguments to the functions. Given a set of input values, you should ideally be able to execute both the unoptimized and optimized functions in your head or on paper, in order to see the difference. Alas, this depends on the functions being fairly small and also on you having a good understanding of LLVM’s undefined behavior model.

The next part of the output is designed to help you think through the execution of the code; it provides intermediate values produced as the source and target execute the counterexample:

i1 %z = #x1 (1)
i16 %q = #xffff (65535, -1)	[based on undef value]

i16 %r = #xffff (65535, -1)

Now that we’re seen all of the parts of Alive2’s output, let’s work through this specific counterexample. The key is passing 0 as the argument to the %y parameter; the value of %x doesn’t matter at all. Then, tgt branches to %zero and returns 8888. src, on the other hand, divides by zero and becomes undefined. This breaks that rule that optimizations can make code more defined, but not less defined. We plan to augment Alive2’s counterexamples with some indication of where control flow went and at what point the execution became undefined. Also, we should suppress printing of values that occur in dead code. Alternately, and perhaps preferably, we could feed Alive2’s counterexample to a separate UB-aware LLVM interpreter (a tool that doesn’t yet exist, but should).

In general, when refinement fails, it means that there exists at least one valuation of the inputs to the function such that one of these conditions holds:

  1. the source and target return different concrete answers
  2. the source function returns a concrete answer but the target function is undefined
  3. there is a failure of refinement in the memory locations touched by the function

We’ll explain memory refinement in a later post in this series.

Alive2’s predecessor, Alive, was also a refinement checker. It was based on the idea that a user-friendly domain-specific language could make it easier for compiler developers to try out new optimizations. This language supported omitting some details found in real LLVM IR, such as how wide, in bits, an operation being optimized is. Alive2 still supports this domain-specific language, but much of our recent work has focused on handling LLVM IR directly. This means that we are working at a slightly lower level of abstraction than before, but it allows us to run our tool on large amounts of code found in the wild.

One of our main targets for Alive2 has been LLVM’s unit test suite: a collection of about 22,000 files containing LLVM IR. Each file specifies that one or more LLVM optimization passes should be run, and then checks that the file gets optimized as expected. When we run Alive2 over the test suite, we ignore the expected result and instead simply check that the optimized code refines the original code. Perhaps surprisingly, this has caught quite a few bugs: 27 so far. Why would the test suite itself, when all tests are in a passing state, be capable of triggering bugs in LLVM? This can happen for several reasons, but the most common one is that LLVM’s test oracles — which decide whether the compiler behaved correctly or not — are generally coarse-grained. For example, a unit test about optimizing division by power-of-two into right shift might simply test for the presence of a shift instruction in the optimized code. Alive2, on the other hand, acts as a fine-grained oracle: if the optimized code does not preserve the meaning of the original code for even one choice of input values, this fact will be detected and reported. Here’s an example of a bug we found this way.

Alive2 handles a large subset of LLVM, but not all of it. Right now there are three main limitations. First, loops are only partially analyzed. Alive2 unrolls loops once, so any miscompilation that only manifests when the loop runs for more than one iteration cannot be detected. Second, interprocedural optimizations are not supported. Although Alive2 can handle function calls in the code being analyzed, it does not look across functions. So a function inlining transformation cannot be analyzed, nor can something like removing an unused global variable. Third, some kinds of code will reliably cause the underlying SMT solver, Z3, to take too long to return a result. Floating point operations, wide integer divisions, and complex memory operations are common culprits.

This post has introduced Alive2 and shown off some of its basic capabilities. Subsequent posts in this series will cover more detailed technical topics such as Alive2’s memory model, verifying vector and floating point optimizations, and finding bugs in LLVM backends and also binary lifters that turn object code into LLVM IR.

Our Compiler Explorer instance is running on a machine sitting in John’s office. He will make a good-faith effort to keep it available, but no doubt it will be down sometimes. Feel free to drop him a line if you find it not working. We plan to look for a more permanent solution.

Precision Opportunities for Demanded Bits in LLVM

[Although this post was written to stand by itself, it builds on the previous one. It is authored by Jubi Taneja, Zhengyang Liu, and John Regehr.]

When designing computer systems, it can be useful to avoid specifying behaviors too tightly. For example, we might specify that a math library function only needs to return a result that is within some distance of the true answer, in order to avoid paying the cost of a fully precise answer when that precision is not needed. Similarly, when a processor boots we might specify that some of its registers or memory are in unknown configurations, to avoid burdening the processor designers with bringing up all hardware in known states. In hardware design we often refer to don’t-care terms in a functional specification. Undefined behavior in C and C++ has a similar rationale: by avoiding specification of an implementation’s behaviors in numerous inconvenient situations, more efficient implementations become possible.

A fun thing that optimizing compilers can do is to automatically infer when the full power of some operation is not needed, in which case it may be that the operation can be replaced by a cheaper one. This happens in several different ways; the method we care about today is driven by LLVM’s demanded bits static analysis, whose purpose is to prove that certain bits of an SSA value are irrelevant. For example, if a 32-bit value is truncated to 8 bits, and if that value has no other uses, then the 24 high bits of the original value clearly do not matter.

Here is a demanded-bits-based optimization that looks for a call to the bswap intrinsic where only a single byte of the result is demanded, and then simply shifts that one byte into the necessary position. This isn’t an important optimization for targets that have a fast bswap instruction, but it would be useful for targets that require bswap to be open-coded. In this example, a full-power 64-bit bswap for 32-bit ARM (foo1) requires more than a dozen instructions but a bswap where only the low byte is demanded (foo2) requires just two instructions.

In this post we’ll present a few examples where the precision of LLVM’s demanded bits analysis compares unfavorably with a highly precise demanded bits that we created using an SMT solver. Some of these situations may be worth fixing. The solver-based algorithm is pretty straightforward. For every bit that is an input to a computation, it asks two questions:

  1. If this bit is forced to zero, does the resulting computation have the same behavior as the original, for all valuations of the remaining input bits?
  2. If this bit is forced to one, does the resulting computation have the same behavior as the original, for all valuations of the remaining input bits?

If both answers are “yes” then that input bit is not demanded. Otherwise it is demanded.

Side note: “demanded bits” is a slightly confusing name for this analysis. Since static analyses always approximate, we can say that they have a may direction and a must direction. If demanded bits says that a bit is demanded, this actually means “it may be demanded, or perhaps not: this analysis is incapable of saying for sure.” On the other hand, if it says that a bit is not-demanded, then this should be interpreted as “this bit’s value must not matter, and we’re confident that we could provide a formal proof of this.” The problem here is that it is more intuitive to name an analysis after its must direction, since this is the reliable information that we can directly use to drive optimizations. So, in other words, this analysis should have been called a don’t-care analysis, as is traditional. We are getting off of our soapbox now.

Ok, now on to the examples. The code will be in Souper syntax, but hopefully it is close enough to LLVM IR that nobody will have trouble seeing what’s going on. All of these program fragments are ones that we encounter while compiling SPEC CPU 2017. The LLVM that we’re comparing against is from January 20, 2020.

When a number is multiplied by a constant, some of its high bits may not affect the answer:

%0:i32 = var
%1:i32 = mul 40:i32, %0

demanded-bits from Souper:   for %0 : 00011111111111111111111111111111
demanded-bits from compiler: for %0 : 11111111111111111111111111111111

Let’s pause here for a second and assume that you don’t believe us. And why should you? This material can be counterintuitive and we make mistakes all the time, like anyone else. So here’s a simple program you can use to check this result using exhaustive testing instead of a solver. With a bit of effort you should be able to use it to double-check all of the results in this post.

It is also the case that some low bits of a number may not matter, when it is going to be divided by a constant:

%0:i32 = var
%1:i32 = udiv %0, 1000:i32

demanded-bits from Souper:   for %0 : 11111111111111111111111111111000
demanded-bits from compiler: for %0 : 11111111111111111111111111111111

This one is cute:

%0:i32 = var
%1:i32 = srem %0, 2:i32

demanded-bits from Souper:   for %0 : 10000000000000000000000000000001
demanded-bits from compiler: for %0 : 11111111111111111111111111111111

And so is this one:

%0:i32 = var
%1:i32 = mul %0, %0

demanded-bits from Souper:   for %0 : 01111111111111111111111111111111
demanded-bits from compiler: for %0 : 11111111111111111111111111111111

Another flavor of lost precision comes from comparison instructions:

%0:i8 = var
%1:i1 = ult %0, 8:i8

demanded-bits from Souper:   for %0 : 11111000
demanded-bits from compiler: for %0 : 11111111

This one is fun, only the sign bits matters:

%0:i8 = var
%1:i1 = slt %0, 0:i8

demanded-bits from Souper:   for %0 : 10000000
demanded-bits from compiler: for %0 : 11111111

The above represent some low-hanging fruit. We’re leaving out some UB-related imprecision since that stuff is fiddly to work with, and also a lot of imprecision that requires looking at multiple instructions together. (Unhappily, abstract transfer functions are non-compositional: even when a collection of maximally precise functions is applied to a collection of instructions, the result is often not maximally precise.) Here are two quick examples:

%0:i16 = var
%1:i16 = add 64:i16, %0
%2:i16 = and 448:i16, %1

demanded-bits from Souper:   for %0 : 0000000111000000
demanded-bits from compiler: for %0 : 0000000111111111

And second:

%0:i32 = var
%1:i1 = eq 0:i32, %0
%2:i32 = select %1, 1:i32, %0
%3:i1 = ne 0:i32, %2

demanded-bits from Souper:   for %0 : 00000000000000000000000000000000
demanded-bits from compiler: for %0 : 11111111111111111111111111111111

Here we can see that %2 can never be zero, since %2 takes the value of %0 when %0 is not zero, and takes the value 1 otherwise. This analysis result is particularly satisfying because the special case of no demanded bits means that some instructions become dead. In other words, as far as this little collection of instructions is concerned, there wasn’t any need to compute %0 in the first place, because every one of the 232 values that it could take results in %3 being true.

Which of these imprecisions are worth fixing? That is harder to answer. The potential benefit is obvious: it may allow additional optimizations to be performed. The costs of increasing precision include increased compile time, the overhead of maintaining the new code, and the risk of miscompilation if someone writes an unsound transfer function.

Testing Dataflow Analyses for Precision and Soundness

[This piece is co-authored by Jubi Taneja, Zhengyang Liu, and John Regehr; it’s a summary of some of the findings from a paper that we just recently completed the camera ready copy for, that is going to be published at CGO (Code Generation and Optimization) 2020.]

Update from Jan 12 2020: Looks like there’s a patch in the works fixing some imprecisions we found!

Optimizing compilers use dataflow analyses to compute properties that are true in every execution of the program being compiled. For example, if an array access is provably in-bounds in every execution, the bounds check can be elided. A dataflow analysis should be:

  • Sound: It must never erroneously claim that a property holds. Unsound dataflow analyses typically lead to miscompilations, for example by helping eliminate a bounds check that could fail.
  • Precise: When the property holds, the dataflow analysis should be able to see this. The halting problem tells us that this is not always possible, but we’d like the analysis to mostly work pretty well in practice.
  • Fast: We don’t want dataflow analysis to slow down the overall compilation very much.

Meeting these constraints isn’t easy, and moreover compiler developers generally attempt to do this in a seat-of-the-pants fashion, without any kind of help from formal-methods-based tools. Our paper is based on the premise that formal methods can help! Using an SMT solver, we can create dataflow analyses that are sound and maximally precise: they never erroneously claim that the property holds and they always conclude that the desired property holds when this can be done at all by the analysis in question. (But keep in mind these claims require an ideal SMT solver: solver bugs and solver timeouts respectively defeat them.)

Our work is only about soundness and precision: it does not help make dataflow analyses faster. Moreover, since it involves lots of solver calls, it is slow: we can use it offline to find places where the compiler might want to be improved, but it is not possible to just drop it into the compiler as a replacement for existing dataflow analyses. (We do have some preliminary work on synthesizing dataflow implementations, and will write about that some other time.)

The first dataflow analysis we’ll look at is one of LLVM’s workhorses: known bits. This analysis tries to prove that individual bits of SSA values are either 0 or 1 in all executions. Many cases handled by this analysis are easy to understand. For example, a value that has been zero-extended obviously contains zeroes in its high-order bits. Similarly, a value that has been orred with a bitmask will have some ones in it. Other cases are harder. As an exercise, you might try to work out the known bits in the result of a subtraction or a signed division.

Known bits has a variety of customers in LLVM middle-ends and backends. For example, the “simplify CFG” pass computes the known bits in the argument to a switch instruction in order to optimize away dead switch cases; a switch case is provably dead if at least one bit of its value conflicts with a known bit in the value being switched on.

Ok, it’s time for an example. Consider this LLVM IR which shifts the value four to the right by the number of bit positions specified in its argument value:

define i32 @foo(i32) {
  %a = lshr i32 4, %0
  ret i32 %a

As of January 2020, LLVM doesn’t return any known bits for %a. If you would like to see this for yourself, first build a fresh LLVM and then second build this out-of-tree pass which calls LLVM’s dataflow analysis and pretty-prints its results. Run it like so:

$ opt -load ./ -dataflow-info ../test/test1.ll -o /dev/null
function: foo
  %a = lshr i32 4, %0
    known: ????????????????????????????????
    isKnownToBePowerOfTwo = 0
    LVI: [0,5)
    SCEV: %a  full-set  full-set
    demanded: 11111111111111111111111111111111

It is easy to see that a better answer is possible: LLVM could have returned 00000000000000000000000000000???. In other words, the result will always have 29 zeroes in the high end. This is the best possible result: a sound implementation of known bits cannot prove anything about the last three bits. This is because, for each of these bits, there exists an execution in which the bit is zero and also an execution in which the bit is one.

Now let’s switch gears and look at a different dataflow analysis from LLVM: integer ranges. This analysis lives in the “Lazy Value Info” analysis pass and it computes one of four results for an SSA value; the one we care about right now is a “regular range” [a, b) where a is smaller (using unsigned comparison) than b. This result means that in every execution of the program, the SSA value will be at least a and also less than b, again using unsigned comparisons.

For example, consider this IR which zero-extends its 8-bit arguments to 64 bits, xors the extended values, and returns the result:

define i64 @foo(i8, i8) {
  %x = zext i8 %0 to i64
  %y = zext i8 %1 to i64
  %z = xor i64 %x, %y
  ret i64 %z

It is obvious that [0, 256) is the result we want for %z: no smaller range captures all of the possibilities, and there are no possible executions that require a larger range. LLVM, on the other hand, returns a special “full set” result for %z, meaning that it is unable to compute any useful information about it. This happens because the code for modeling the effect of binary operators on integer ranges does not handle xor at all.

In our paper we deal with some additional dataflow analyses, such as “demanded bits,” which is LLVM’s funny name for a don’t-care analysis that looks for bits whose value does not affect subsequent computations. Demanded bits is interesting because it propagates information backwards along SSA edges. We won’t cover demanded bits or any of the others in this post.

Ok, now let’s get to the good stuff: computing maximally precise dataflow results. We build on Souper, since it already knows how to ask a solver questions about LLVM code, but besides that we don’t make interesting use of Souper’s capabilities. The bad news is that we don’t know how to just plug in a formal definition of a dataflow analysis and get answers directly from that. Rather, we need to spend a bit of time studying the structure of each abstract domain that we target, in order to create a solver-based algorithm for computing maximally precise results reasonably efficiently.

For known bits, we can deal with each bit individually. For a given bit, we’ll first ask the solver whether there exists any valuation of the inputs that can force the bit to be zero. If this returns UNSAT, then we have just proven that this bit is one in every execution. If this query is SAT, we issue a second, analogous, query asking whether this bit can ever be one. If not, it is known to be zero. If neither query returns UNSAT, we know that this bit can be both 0 and 1, and we move on to the next bit. This procedure gives the most precise result using at most 2n solver queries (for a value n bits wide). The gist of the argument for maximal precision is that known bits are independent of each other. This is a nice result because a naive search of all elements of the known bits lattice would require 3n solver calls and this is clearly infeasible.

Looking back at the first example above, we can issue these commands:

$ llvm-as test1.ll
$ souper $SOUPER_SOLVER --infer-known-bits test1.bc 
; Listing valid replacements.
; Using solver: Z3 + internal cache

; Function: foo
%0:i32 = var
%1:i32 = lshr 4:i32, %0
infer %1
; knownBits from souper: 00000000000000000000000000000xxx

Integer ranges, it turns out, are more difficult. First, due to a “wrapped range” feature that we’ll not discuss here (again, see our paper) this abstract domain doesn’t form a lattice, so the definition of “maximal precision” becomes problematic. We pave over this issue by relaxing the notion of maximal precision a bit. Instead of insisting on the unique maximally precise result that would have to exist if the domain had been a lattice (but does not, since it isn’t), we’ll accept any result as maximally precise if there does not exist an integer range that excludes a larger number of values than is excluded by the given range. (If you find static analysis using non-lattice abstract domains interesting, this very nice paper contains more details.) The second problem with integer ranges is that there’s no obvious bit-level divide and conquer strategy for finding the maximally precise result. We’ll need a different approach.

Ignoring a special “empty set” value which indicates dead code or unconditional undefined behavior, we’re looking for an integer range [x, x+c) such that c is minimized. For example, we can start with c=1, corresponding to the hypothesis that there is a valid dataflow result [x, x+1), which excludes all integer values but one. In other words, this hypothesis corresponds to the SSA value being a constant, x. But how can we find the concrete value of x in a large search space such as 264? We do this using solver-based constant synthesis. If synthesis succeeds, we have found the maximally precise result. If no x exists such that [x, x+1) is a valid dataflow result, we launch a binary search to find the smallest c for which a suitable x does exist. This range will satisfy the maximal precision condition from the previous paragraph. The details of synthesizing an integer value meeting constraints don’t matter here as long as we have a procedure that works.

Going back to the second example:

$ llvm-as test2.ll
$ souper $SOUPER_SOLVER --infer-range test2.bc 

... a few irrelevant lines elided ...

; Function: foo
%0:i8 = var
%1:i64 = zext %0
%2:i8 = var
%3:i64 = zext %2
%4:i64 = xor %1, %3
infer %4
; range from souper: [0,256)

So in both of these cases we get the obvious maximally precise result. However, these examples are contrived. What code should we look at, when doing this work for real? We looked at all of the bitcode from compiling the C and C++ programs in SPEC CPU 2017, but really any large collection of LLVM IR should be fine. For each SSA value, we use our code and LLVM’s to compute dataflow results and then compare them. The possibilities are:

  • Our result is more precise than LLVM’s: lost precision, flag for human inspection.
  • LLVM’s result is more precise than ours: LLVM is unsound, flag for human inspection.
  • Results have the same precision: ignore.

Across the LLVM IR from compiling SPEC CPU 2017, we find many examples where LLVM’s dataflow analyses leave some precision on the floor. Is it worth making the analyses more precise? This is subjective, and can only be answered by compiler developers. Overall — despite the fact that we find a lot of imprecision — it looks to us like LLVM’s dataflow analyses are quite mature and provide a lot of precision where it matters. This would not have been the case around 2010, but these analyses have gotten a lot of tweaks since then. Even so, there is still plenty of room for improvement.

We thought that while doing this work we might find examples where LLVM’s dataflow result is “more precise” than our maximally precise one. This would indicate soundness errors in LLVM. We did not find any such examples, but we did re-introduce some old soundness bugs in LLVM into the version we used for testing, in order to verify that we can automatically detect these. Details are in the paper.

Since our method requires concrete code fragments as the basis for comparing the precision of LLVM’s analyses and ours, we might ask: is this is an advantage or a disadvantage? On the plus side, we’ll find imprecisions and soundness errors that occur when compiling real programs, and presumably these are the ones that should be a higher priority for fixing. On the other side, we’ll miss cases not exercised by the test programs.

In summary, we have presented several solver-based algorithms for computing maximally precise dataflow results, with the goal of spotting soundness errors and lost precision in a real compiler, LLVM. This is part of a larger research program outlined here. The idea is to move compiler construction away from the current situation, where compilers are huge collections of bespoke code, and towards a situation where the mathematical foundations of compilation can be directly used to support their construction.

Helping Generative Fuzzers Avoid Looking Only Where the Light is Good, Part 1

Let’s take a second to recall this old joke:

A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, “this is where the light is.”

Using a generative fuzzer — which creates test cases from scratch, rather than mutating a collection of seed inputs — feels to me a lot like being the drunk guy in the joke: we’re looking for bugs that can be triggered by inputs that the generator is likely to generate, because we don’t have an obviously better option, besides doing some hard work in improving the generator. This problem has bothered me for a long time.

One way to improve the situation is swarm testing, which exploits coarse-grained configuration options in the fuzzer to reach diverse parts of the search space with higher probability. Swarm is a good idea in the sense that it costs almost nothing and seems to give good results, but it’s not remotely the end of the story. Today we’ll work on a different angle.

Let’s start out with a high-level view of the situation: we have a system under test, and its inputs come from a vast, high-dimensional space. If we confine ourselves to generating inputs up to N bits long, it is often the case that only a very small fraction of the 2N possibilities is useful. (Consider testing Microsoft Word by generating uniform random bit strings of 100 KB. Very nearly 0% of these would be enough like a Word document to serve as an effective test case.) This is the reason generative fuzzers exist: they avoid creating invalid inputs. (Of course mutation-based fuzzers work on a related principle, but this piece isn’t about them.)

Now we can ask: out of the space of inputs to the system under test, which parts of the space will trigger bugs? I would argue that unless we are doing some sort of focused testing, for example to evaluate the hypothesis “formulas containing trig functions often make Excel fall over,” we should assume that bugs are distributed uniformly. (And even if we are restricting our attention to formulas containing a trig function, we probably want to assume that this subset of Excel formulas contains uniformly distributed bugs.) It follows that a good generative fuzzer should explore the space of inputs with uniform probability, or at least it should be capable of doing this. Another way to state the desired property is: if the fuzzer is capable of producing X different test cases, each of them should be generated with probability 1/X. I don’t know that this property is optimal in any meaningful sense, but I do believe it to be a good and reasonable design goal for a generative fuzzer. Alas, a non-trivial generative fuzzer is very unlikely to have this property by accident.

Why is a generative fuzzer not going to naturally emit each of its X outputs with probability 1/X? This follows from how these tools are implemented: as decision trees. A generative fuzzer makes a series of decisions, each determined by consulting a PRNG, on the way to emitting an actual test case. In other words, this kind of tool starts at the root of a decision tree and walks to a randomly chosen leaf. Choices made near the root of the decision tree are effectively given more weight than choices made near the leaves, making it possible for some leaves to have arbitrarily low probabilities of being emitted by a decision tree. Let’s consider this very simple decision tree:

If we use it to generate output a bunch of times, we’ll end up getting outputs with percentages something like this:

A 49.979
B 25.0018
C 12.509
D 6.26272
E 3.1265
F 3.12106

If you are thinking to yourself “this example is highly contrived” then please consider how structurally similar it is to the core of a simple LLVM IR generator that I wrote, which has been used to find a number of optimizer bugs. (At the moment it only works as an enumerator, I ripped out its randomized mode because of the problems described in this post.)

So what can we do about this? How can we hack our decision tree to generate its leaves uniformly? A few solutions suggest themselves:

  • Enumerate all leaves and then perform uniform sampling of the resulting list.
  • Skew the decision tree’s probabilities so that they are biased towards larger subtrees.
  • Structure the generator so that its decision tree is flattened, to stop this problem from happening in the first place.

Unfortunately, the first solution fails when the tree is too large to enumerate, the second fails when we don’t know subtree sizes, and the third presents software engineering difficulties. In Part 2 we’ll look at more sophisticated solutions. Meanwhile, if you’d like to play with a trivial generative fuzzer + leaf enumerator based on this decision tree, you can find a bit of code here.

Finally I want to briefly explain why generative fuzzers are structured as decision trees: it’s because they are effectively inverted parsers. In other words, while a parser takes lexical elements and builds them into a finished parse tree, a generative fuzzer — using the same grammar — starts at the root node and recursively walks down to the leaves, resulting in test cases that always parse successfully.

Write Fuzzable Code

Fuzzing is sort of a superpower for locating vulnerabilities and other software defects, but it is often used to find problems baked deeply into already-deployed code. Fuzzing should be done earlier, and moreover developers should spend some effort making their code more amenable to being fuzzed.

This post is a non-comprehensive, non-orthogonal list of ways that you can write code that fuzzes better. Throughout, I’ll use “fuzzer” to refer to basically any kind of randomized test-case generator, whether mutation-based (afl, libFuzzer, etc.) or generative (jsfunfuzz, Csmith, etc.). Not all advice will apply to every situation, but a lot of it is sound software engineering advice in general. I’ve bold-faced a few points that I think are particularly important.

Invest in Oracles

A test oracle decides whether a test case triggered a bug or not. By default, the only oracle available to a fuzzer like afl is provided by the OS’s page protection mechanism. In other words, it detects only crashes. We can do much better than this.

Assertions and their compiler-inserted friends — sanitizer checks — are another excellent kind of oracle. You should fuzz using as many of these checks as possible. Beyond these easy oracles, many more possibilities exist, such as:

  • function-inverse pairs: does a parse-print loop, compress-decompress loop, encrypt-decrypt loop, or similar, work as expected?
  • differential: do two different implementations, or modes of the same implementation, show the same behavior?
  • metamorphic: does the system show the same behavior when a test case is modified in a semantics-preserving way, such as adding a layer of parentheses to an expression?
  • resource: does the system consume a reasonable amount of time, memory, etc. when processing an input?
  • domain specific: for example, is a lossily-compressed image sufficiently visually similar to its uncompressed version?

Strong oracles are worth their weight in gold, since they tend to find application-level logic errors rather than the lower-level bugs that are typically caught by looking for things like array bounds violations.

I wrote a bit more about this topic a few years ago. Finally, a twitter user suggested “If you’re testing a parser, poke at the object it returns, don’t just check if it parses.” This is good advice.

Interpose on I/O and State

Stateless code is easier to fuzz. Beyond that, you will want APIs for taking control of state and for interposing on I/O. For example, if your program asks the OS for the number of cores, the current date, or the amount of disk space remaining, you should provide a documented method for setting these values. It’s not that we necessarily want to randomly change the number of cores, but rather that we might want to fuzz our code when set to single-core mode and then separately fuzz it in 128-core mode. Important special cases of taking control of state and I/O include making it easy to reset the state (to support persistent-mode fuzzing) and avoiding hidden inputs that lead to non-deterministic execution. We want as much determinism as possible while fuzzing our code.

Avoid or Control Fuzzer Blockers

Fuzzer blockers are things that make fuzzing gratuitously difficult. The canonical fuzzer blocker is a checksum included somewhere in the input: the random changes made to the input by a mutation-based fuzzer will tend to cause the checksum to not validate, resulting in very poor code coverage. There are basically two solutions. First, turn off checksum validation in builds intended for fuzzing. Second, have the fuzzer generate inputs with valid checksums. A generation-based fuzzer will have this built in; with a mutation-based fuzzer we would write a little utility to patch up the test case with a valid checksum after it is generated and before it is passed to the program being fuzzed. afl has support for this.

Beyond checksums, hard-to-satisfy validity properties over the input can be a serious problem. For example, if you are fuzzing a compiler for a strongly typed programming language, blind mutation of compiler inputs may not result in valid compiler inputs very often. I like to think of validity constraints as being either soft (invalid inputs waste time, but are otherwise harmless) or hard (the system behaves arbitrarily when processing an invalid input, so they must be avoided for fuzzing to work at all). When we fuzz a C++ compiler to look for wrong code bugs, we face a hard validity constraint because compiler inputs that have UB will look like wrong code bugs. There is no simple, general-purpose solution to this kind of problem, but rather a family of techniques for explicitly taking validity properties into account. The most obvious solution — but often not the right one — is to write a new generational fuzzer. The problem is that if you do this, you cannot take advantage of modern coverage-driven fuzzing techniques, which are amazing. To fit into a coverage-driven fuzzing framework you have a couple of options. First, write a custom mutator that respects your validity constraints. Second, structure-aware fuzzing, which basically means taking the mutated data from the fuzzer and translating it into something like what the program being fuzzed expects to see. There’s a lot of research left to be done in making coverage-driven fuzzers work well in the presence of validity constraints without requiring a lot of manual effort. There are some significant subtleties here, maybe I’ll go into them another time. Putting something like a SAT solver into the fuzzer is not, generally speaking, the answer here, first because some validity constraints like checksums are specifically difficult for solvers, and second because some validity constraints (such as UB-freedom in a C++ program) are implicit and cannot be inferred, even in principle, by looking at the system being fuzzed.

A lot of code in a typical system cannot be fuzzed effectively by feeding input to public APIs because access is blocked by other code in the system. For example, if you use a custom memory allocator or hash table implementation, then fuzzing at the application level probably does not result in especially effective fuzzing of the allocator or hash table. These kinds of APIs should be exposed to direct fuzzing. There is a strong synergy between unit testing and fuzzing: if one of these is possible and desirable, then the other one probably is too. You typically want to do both.

Sanitizers and fuzzers often require tweaks or even significant changes to the build process. To make this easier, keep the build process as clean and simple as possible. Make it easy to switch out the compiler and modify the compiler options. Depend on specific tools (and versions of tools) as little as possible. Routinely build and test your code with multiple compilers. Document special build system requirements.

Finally, some fuzzer blockers are sort of silly and easy to avoid. If your code leaks memory or terminates its process with a deep call stack, it will be painful to test using a persistent-mode fuzzer, so don’t do these things. Avoid handling SIGSEGV or, if you really must do this, have a way to disable the handler for fuzzer builds. If your code is not compatible with ASan or UBSan, then these extremely useful oracles are harder to use. In particular, if your code uses a custom memory allocator you should consider turning it off for fuzzer builds, or else adapting it to work with ASan, or else you’ll miss important bugs.

Unblock Coverage-Driven Fuzzers

Because coverage-driven fuzzers refocus their effort to try to hit uncovered branches, they can be blocked in certain specific ways. For example, if a coverage-driven fuzzer is presented with too many uncoverable branches, it can spend so much time on them that it becomes less likely to hit more interesting branches elsewhere in the program. For example, one time I compared afl’s coverage on a program compiled with and without UBSan, and found that (in whatever time limit I used) it covered quite a lot less of the sanitized program, compared to the unsanitized build. On the other hand, we definitely want our fuzzer to look for sanitizer failures. My advice is to fuzz both sanitized and unsanitized builds of your program. I don’t know how to budget fuzzing resources for these different activities and don’t know of any principled work on that problem. It may not matter that much, since fuzzing is all about overkill.

Sometimes your program will call branchy, already-heavily-fuzzed code early in its execution. For example, you might decompress or decrypt input before processing it. This is likely to distract the coverage-driven fuzzer, causing it to spend a lot of time trying to fuzz the crypto or compression library. If you don’t want to do this, provide a way to disable crypto or compression during fuzzing.

Any interpreter in your program is likely to make life difficult for a coverage-driven fuzzer, since the relevant program paths are now encoded in the data being interpreted, which is generally opaque to the fuzzer. If you want maximum mileage out of coverage-driven fuzzers, you may want to try to avoid writing interpreters, or at least keep them extremely simple. An obvious way to deal with embedded interpreters — which someone must have thought of and tried, but I don’t have any pointers — would be to have an API for teaching the fuzzer how to see coverage of the language being interpreted.

Enable High Fuzzing Throughput

Fuzzing is most effective when throughput is very high; this seems particularly the case for feedback-driven fuzzers that may take a while to learn how to hit difficult coverage targets. An easy throughput hack is to make it possible to disable slow code (detailed logging, for example) when it is not germane to the fuzzing task. Similarly, interposing on I/O can help us avoid speed hacks such as running the fuzzer in a ramdisk.

“But I Want Fuzzing My Code to be Harder, Not Easier”

I don’t have a lot of sympathy for this point of view. Instead of aiming for security through obscurity, we would do better to:

  • fuzz early and thoroughly, eliminating fuzzable defects before releasing code into the wild
  • write code in a programming language with as strong of a type system as we are willing to tolerate — this will statically eliminate classes of bad program behaviors, for example by stopping us from putting the wrong kind of thing into a hashmap
  • aggressively use assertions and sanitizers to get dynamic checking for properties that the type system can’t enforce statically

Anti-fuzzing techniques are a thing, but I don’t think it represents a useful kind of progress towards better software.


Randomized testing is incredibly powerful and there’s no point avoiding it: if you don’t fuzz your code, someone else will. This piece has described some ways for you, the software developer, to make fuzzing work better. Of course there are plenty of other aspects, such as choosing a good corpus and writing a good fuzzer driver, that are not covered here.

Acknowledgments: Pascal Cuoq and Alex Groce provided feedback on a draft of this piece, and it also benefited from suggestions I received on Twitter. You can read the conversation here; it contains some suggestions and nuances that I did not manage to capture.

Design and Evolution of C-Reduce (Part 2)

Part 1 of this series introduced C-Reduce and showed how it combines a domain-independent core with a large collection of domain-specific passes in order to create a highly effective test-case reducer for C and C++ code. This part tells the rest of the story and concludes.

Parallel Test-Case Reduction

C-Reduce’s second research contribution is to perform test-case reduction in parallel using multiple cores. The parallelization strategy, based on the observation that most variants are uninteresting, is to speculate along the uninteresting branch of the search tree. Whenever C-Reduce discovers an interesting variant, all outstanding interestingness tests are killed and a new line of speculation is launched. This is the same approach that was subsequently rediscovered by the creator of halfempty.

C-Reduce has a policy choice between taking the first interesting variant returned by any CPU, which provides a bit of extra speedup but makes parallel reduction non-deterministic, or only taking an interesting variant when all interestingness tests earlier in the search tree have reported that their variants are uninteresting. We tested both alternatives and found that the speedup due to non-deterministic choice of variants was minor. Therefore, C-Reduce currently employs the second option, which always follows the same path through the search tree that non-parallel C-Reduce would take. The observed speedup due to parallel C-Reduce is variable, and is highest when the interestingness test is relatively slow. Speedups of 2-3x vs. sequential reduction are common in practice.

Applicability and Moving Towards Domain-Independence

Something that surprised us is how broadly applicable C-Reduce is to test cases in languages other than C and C++. Our users have found it effective in reducing Java, Rust, Julia, Haskell, Racket, Python, SMT-LIB, and a number of other languages. When used in this mode, the highly C/C++-specific C-Reduce passes provide no benefit, but since they fail quickly they don’t do much harm. (C-Reduce also has a --not-c command line option that avoids running these passes in the first place.)

One might ask why C-Reduce is able to produce good results for languages other than C and C++; the answer appears to be based on the common structural elements found across programming languages in the Algol and Lisp families. In contrast, in our limited experience, C-Reduce does a very poor job reducing test cases in binary formats such as PDF and JPEG. Other reducers, such as the afl-tmin tool that ships with afl-fuzz, work well for binary test cases.

A consequence of the modular structure of C-Reduce is that while its transformation passes are aimed at reducing C and C++ code, the C-Reduce core is completely domain-independent. C-Reduce has been forked to create a highly effective reducer for OpenCL programs. We believe it would be relatively straightforward to do something similar for almost any other programming language simply by tailoring the passes that C-Reduce uses.


When used for its intended purpose, when does C-Reduce work badly? We have seen two main classes of failure. First, C-Reduce can be annoyingly slow. This typically happens when the passes early in the phase ordering, which are intended to remove a lot of code quickly, fail to do this. Second, highly templated C++ sometimes forces C-Reduce to terminate with an unacceptably large (say, >1 KB) final result. Of course this is better than nothing, and subsequent manual reduction is usually not too difficult, but it is frustrating to have written 69 different Clang-based passes and to still find effectively irreducible elements in test cases. The only solution — as far as we know — is to strengthen our existing transformation passes and to create more such passes.

A small minority of compiler bugs appears to be impossible to trigger using a small test case. These bugs are exceptions to the small scope hypothesis. They typically stem from resource-full bugs in the compiler. For example, a bug in register spilling code requires the test case to use enough registers that spilling is triggered; a bug in logic for emitting long-offset jumps requires the test case to contain enough code that a long offset is required; etc. These test cases are just difficult to work with, and it is not clear to us that there’s anything we can do to make it easier to debug the issues that they trigger.

C-Reduce Design Principles

In summary, C-Reduce was designed and implemented according to the following principles:

  1. Be aggressive: make the final reduced test case as small as possible.
  2. Make the reducer fast, for example using parallelism, careful phase ordering of passes, and avoiding unnecessary I/O traffic, when this can be done without compromising the quality of the final output.
  3. Make it as easy as possible to implement new passes, so that many domain-specific passes can be created.
  4. Keep the C-Reduce core domain-independent.
  5. Focus only on reduction, delegating all other criteria to the user-supplied interestingness test.

Directions for Future Test-Case Reduction Research

Although perhaps a few dozen papers have been written about test-case reduction since Hildebrandt and Zeller’s initial paper 19 years ago, I believe that this area is under-studied relative to its practical importance. I’ll wrap up this article with a collection of research questions suggested by our experience in over a decade of work on creating a highly aggressive reducer for C and C++.

What is the role of domain-specificity in test case reduction? Researchers who cite C-Reduce appear to enjoy pointing out that it is very domain-specific (nobody seems to notice that the C-Reduce core is domain-independent, and that the pass schedule is easy to modify). The implication is that domain-specific hacks are undesirable and, of course, an argument against such hacks would be forceful if backed up by a test-case reducer that produced smaller final C and C++ code than C-Reduce does, without using domain knowledge. So far, such an argument has not been made.

The open research question is whether domain knowledge is actually necessary, or if a domain-independent test-case reducer can beat C-Reduce at its own game. The most impressive such effort that we are aware of is David MacIver’s structureshrink, which uses relatively expensive search techniques to infer structural elements of test cases that can be used to create variants. Anecdotally, we have seen structureshrink produce reduced versions of C++ files that are smaller than we would have guessed was possible without using domain knowledge.

What is the role of non-greedy search in test-case reduction? In many cases, the order in which C-Reduce runs its passes has little or no effect on the final test case. In other words, the search is often diamond-shaped, terminating at the same point regardless of the path taken through the search space. On the other hand, this is not always the case, and when the search is not diamond-shaped, a greedy algorithm like C-Reduce’s risks getting stopped at a local minimum that is worse than some other, reachable minimum. The research question is how to get the benefit of non-greedy search algorithms without making test-case reduction too much slower.

What other parallelization methods are there? C-Reduce’s parallelization strategy is pretty simple, gives a modest speedup in practice, and always returns the same result as sequential reduction. There must be other parallel test-case reduction strategies that hit other useful points in the design space. This is, of course, related to the previous research question. That is, if certain branchings in the search tree can be identified as being worth exploring in both directions, this could be done on separate cores or machines.

What is the role of canonicalization in test-case reduction? A perfectly canonicalizing reducer — that reduces every program triggering a given bug to the same reduced test case — would be extremely useful. Although it seems impossible to achieve this in all cases, there are many relatively simple strategies that can be employed to increase the degree of canonicalization, such as putting arithmetic expressions into a canonical form, assigning canonical names to identifiers, etc. C-Reduce has a number of transformations that are aimed at canonicalization rather than reduction. For example, the reduced test case at the top of part 1 of this piece has four variables a, b, c, and d, which appear in that order. I believe that more work in this direction would be useful.

Can we avoid bug hijacking? Test reduction sometimes goes awry when the bug targeted by the reduction is “hijacked” by a different bug. In other words, the reduced test case triggers a different bug than the one triggered by the original. During a compiler fuzzing campaign this doesn’t matter much since one bug is as good as another, but hijacking can be a problem when the original bug is, for example, blocking compilation of an application of interest. Hijacking is particularly common when the interestingness test is looking for a non-specific behavior such as a null pointer dereference. C-Reduce pushes the problem of avoiding hijacking onto the user who can, for example, add code to the interestingness test looking for specific elements in a stack trace. The research question here is whether there are better, more automated ways to prevent bug hijacking.

Obtaining C-Reduce

Binary C-Reduce packages are available as part of software distributions including Ubuntu, Fedora, FreeBSD, OpenBSD, MacPorts, and Homebrew. Source code can be found here.


C-Reduce was initially my own project, but by lines of code the largest contributor by a factor of two is my former student Yang Chen, now a software engineer at Microsoft. Yang wrote effectively all of the Clang-based source-to-source transformation code, more than 50,000 lines in total. Eric Eide, a research professor in computer science at the University of Utah, is the other major C-Reduce contributor (and gave useful feedback on a draft of this piece). Our colleagues Pascal Cuoq, Chucky Ellison, and Xuejun Yang also contributed to the project, and we have gratefully received patches from a number of external contributors. Someone created a fun visualization of the part of C-Reduce’s history that happened on Github.

Design and Evolution of C-Reduce (Part 1)

[This piece is posted in parallel on the IEEE Software blog. Karim Ali copyedited.]

Since 2008, my colleagues and I have developed and maintained C-Reduce, a tool for programmatically reducing the size of C and C++ files that trigger compiler bugs. C-Reduce also usually does a credible job reducing test cases in languages other than C and C++; we’ll return to that later.

Why Reduce Test Cases?

Here’s a typical C-Reduce output, as found in the LLVM bug database:

int a[1];
int b;
void c() {
  void *d = b = 0;
  for (;; b++)
    a[b] = (int) d++;

Compiling this code at the -O2 optimization level causes LLVM to crash. The bug report doesn’t contain the original, unreduced test case, but most likely it was larger.

A reduced test case is preferable because:

  • it usually gets the compiler to misbehave quickly, reducing the number of function calls, memory allocations, etc. that the compiler developer has to step through and reason about while debugging
  • the reduced file contains little code not directly related to triggering the bug, reducing the likelihood that compiler developers will be distracted by extraneous features of the test case
  • reduced test cases for the same bug often look similar to each other, whereas this is not normally true for unreduced files that trigger the same bug
  • there is often little or no discernible similarity between an unreduced source file and its reduced version, making it easier for compiler bugs triggered by proprietary code to be reported externally

The minimum-sized input triggering any particular compiler bug can be found using a trivial algorithm: exhaustive search of text strings of increasing size. This is, of course, almost always intractable. In practice, test case reduction proceeds in the other direction: starting with a large, failure-inducing test case, incrementally making it smaller until a local minimum is reached.

A Bit of Background

The history of automated test case reduction does not seem to be well documented, but several examples can be found in software testing papers from the 1990s, such as Differential Testing for Software and Massive Stochastic Testing of SQL. Test-case reduction was first studied in its own right in 2000 when Hildebrandt and Zeller introduced Delta Debugging: a general-purpose technique for test case reduction. Their algorithm uses a greedy search where a series of “variants” (our term, not theirs, for partially-reduced test case candidates) is produced by removing chunks of the test case. As reduction progresses, the chunk size is reduced, until it reaches some minimum-sized unit, such as a single line, token, or character. When no minimum-sized chunk can be removed from the test case without breaking the property that it triggers the bug, the Delta Debugger terminates. Almost all subsequent test-case reduction work, including ours, builds upon this work.

Towards C-Reduce

I became interested in test case reduction when my colleagues and I began to find a lot of bugs in C compilers using random testing. We found so many bugs that reporting them became bottlenecked on reducing the bug-triggering programs. Since I was the one reporting the bugs we found, I was the one who felt the pain of manual test-case reduction, and it quickly got old. I eventually reported around 500 compiler bugs and I could not have done this without first creating C-Reduce.

At the time, the best open-source implementation of Delta Debugging, from UC Berkeley, was line-based and contained a significant innovation over the original algorithm: it could reorganize a file in such a way that all nested curly braces deeper than a configurable level would appear on a single line. Thus, at level zero, entire functions would be placed on the same line, enabling the line-based reducer to remove an entire function at once. At higher nesting levels, functions would be split across lines, enabling finer-grained reduction. This worked well but the tool ended up being inadequate for my needs: it got stuck at local minima that were often orders of magnitude larger than what could be achieved when reducing by hand.

The limiting factor in this existing tool (“Delta” from now on) was obvious: it was not able to exploit enough of the structure of the file being reduced. For example, it could usually not do much to simplify arithmetic expressions. These sorts of simplifications tend to have a cascading effect: eliminating the last use of a variable allows its definition to be eliminated, etc. The obvious path forward was to write a new tool that solved a reduction problem that Delta could not solve, and then to alternate running this tool and Delta until a global fixpoint was reached. I did this, adding more and more reduction techniques over time. At some point I implemented a line-elimination pass in my new reducer, at which point Delta was subsumed and could be dropped.

We ended up keeping two elements of Delta’s design. First, the configurable hierarchical reformatting of a test case based on curly brace nesting. This technique, followed by removing contiguous chunks of code, is still one of C-Reduce’s most useful first lines of attack on a test case. Second, Delta’s mechanism for determining whether a given variant is “interesting.” An interesting variant is used as the basis for further reduction steps; an uninteresting variant is a dead end, and is discarded. Delta determined interestingness by invoking a user-supplied program — typically a shell script — whose process exit code determines the interestingness of the current variant. The flexibility afforded by this small element of user extensibility ends up being extremely useful. For example, the interestingness test can discard test cases that trigger certain compiler warnings, it can attempt to disambiguate different crash bugs, etc.

It is more challenging to reduce test cases that cause the compiler to emit incorrect object code than it is to reduce test cases that merely cause the compiler to crash. C-Reduce itself is agnostic about the character of the bug of interest: we push all of the difficulties in reducing miscompilation triggers into the interestingness test, which should try to answer questions such as:

  • is the variant well-defined by the C or C++ standard?
  • does the variant avoid depending on behaviors that are unspecified by the C or C++ standard?
  • does the buggy compiler turn the variant into an executable?
  • does this executable terminate within a specified time?
  • does the reference compiler (assumed to not contain the bug of interest) turn the variant into an executable?
  • does this executable also terminate within a specified time?
  • does the behavior of the two executables differ in a way that indicates that a miscompilation occurred?

The variant is interesting if the answer to all of these questions is “yes.”

The hardest part of reducing programs that trigger miscompilation bugs is ensuring that variants avoid undefined behaviors (such as invalid pointer operations) and do not rely on unspecified behaviors (such as the order of evaluation of function arguments). A test case doing one of these things is ill-formed and can accomplish nothing beyond annoying compiler developers. Empirically, if undefined behavior is not actively avoided during test-case reduction, C-Reduce will almost certainly introduce it. The practical solution is to employ suitable static and dynamic analysis tools, in order to avoid these ill-formed variants. Since no single tool detecting all undefined behaviors in C and C++ programs exists, a hybrid approach involving multiple tools is typically used in practice. This is not completely satisfying, but it works well enough that C-Reduce can reliably produce useful reduced test cases for miscompilation bugs in C and C++ compilers.

Writing good interestingness tests for miscompilations takes a bit of practice. More than one user has described C-Reduce as being something like the sorcerer’s apprentice: it does an excellent job reducing according to the criteria it is given, but if these criteria contain any kind of loophole, C-Reduce is likely to find it. For example it is easy to accidentally write a test which believes that the empty file is interesting. Additionally, a good interestingness test is written in such a way that its expected-case runtime is minimized by asking the quickest and most-likely-to-fail questions first.

From the start, C-Reduce was designed to produce a very small final reduced test case, even when this would make C-Reduce run for longer than we liked. This is based on the premise that we should burn cycles instead of human time, and that reporting a compiler bug is rarely on the critical path; we can often afford to wait for a better result. The consequences of this decision can be seen in Tables 1 and 2 of this paper that evaluates several test-case reduction methods: C-Reduce produces the smallest final output, but takes more time to do so.

A Modular, Domain-Independent Reducer Core

Although C-Reduce started out as a pet project solving a specific problem, it evolved into a research project involving a number of my colleagues, whose top-level goal was to produce an effective and usable reducer for C and C++ code as found in the wild. The first research contribution to come out of this effort was a way to achieve a clean mechanism/policy separation in a test case reducer. Previous reduction techniques had all baked specific transformations into the overall search strategy. This impedes extensibility, which we believe to be crucial. The structure that we ended up with is a small core that invokes a collection of pluggable transformation passes until a global fixpoint is reached.

The API for C-Reduce passes is simple but — like many simple things — required a lot of iterations before it felt finished. It is based on the ideas that transformation passes should be stateless and that every pass should implement a linear sequence of transformations, each of which results in a variant that may or may not be interesting. The interface is as follows:

state new(filename, option) : Return a fresh state object. Each pass uses this state to keep track of where it is in the sequence of transformations that it is capable of performing. These states may contain arbitrary data items; the C-Reduce core treats them as opaque. A typical pass stores some kind of cursor — a byte offset, token offset, line number, position in a tree traversal, etc. — in the state object.

The file referred to by filename is logically part of the state object even though it resides in the filesystem instead of memory. Of course it would not be difficult to pass the contents of the file around as a memory object but this would be slow when these objects are large (C-Reduce is frequently invoked on multi-megabyte preprocessed C++ files).

The “option” is used to select among different behaviors implemented by a composite pass.

state advance(filename, option, state) : Return a new state object referring to the next transformation opportunity following the one referenced by the state object passed as a parameter.

result transform(filename, option, state) : Modify the file in-place, selecting the transformation instance referred to by the state object. The result takes one of three values:

  • OK : the transformation succeeded
  • STOP : no more transformation instances remain for this pass
  • ERROR : something went wrong; for example, an external tool crashed, a working file or directory could not be created, etc.

(The API contains one additional method, which checks whether a pass’s external dependencies are satisfied, that doesn’t matter here.)

Our experience has been that every transformation pass that we wanted has been easy to implement behind this API.

The C-Reduce core implements this algorithm:

current = original_test_case
  size_at_start = size(current)
  foreach (p, option) in pass_list
    state = p::new(current, option)
      variant = current         // this is a file copy operation
      result = p::transform(variant, option, state)
      if result == ERROR
	report_problem_in_pass(p, option)
      if result == OK
        if is_interesting(variant)
          current = variant     // also a file copy
          state = p::advance(current, option, state)
    while result == OK
while size(current) < size_at_start

The termination argument for C-Reduce is:

  1. Since the outermost loop requires the size of the test case to decrease monotonically, it can only execute as many times as the size (in bytes) of the unreduced test case. In practice it executes many fewer times than this.
  2. The loop over passes terminates because the pass list is immutable once C-Reduce is initialized.
  3. Each iteration of the innermost loop either advances the state object or else (by selecting an interesting variant) removes one transformation opportunity. Either way, the number of remaining transformations decreases by one.
  4. The interestingness test is, at worst, terminated (using OS support for killing all processes in a process group) after a configurable timeout.

In practice the weak link in this argument is case 3, which is vulnerable to bugs in passes. C-Reduce terminates robustly by abandoning passes when they appear to be behaving unreasonably.

The C-Reduce core does not insist that transformations make the test case smaller, and in fact quite a few of its passes potentially increase the size of the test case, with the goal of eliminating sources of coupling within the test case, unblocking progress in other passes.

The sequence of transformation passes is carefully orchestrated such that passes that are likely to give the biggest wins -- such as those that remove entire functions -- run first; otherwise the tool would end up spending days or weeks doing silly things such as trying to shrink numeric constants in a huge source file. Shrinking numbers is useful, and it should be done, but only after many other reduction mechanisms have run to completion.

C-Reduce's collection of cooperating passes, with heavy phase-ordering constraints, is highly reminiscent of how a modern optimizing compiler works. However, only a small proportion of the transformation passes is intended to be semantics-preserving in the sense that a compiler's optimization passes must be. In this domain, we only want to preserve enough semantics that we can probabilistically avoid breaking whatever property makes a test case interesting.

A consequence of writing a modular reducer is that once we came up with the right API for writing passes, we were free to write a lot of passes. My colleagues and I spent several years doing this and we ended up with:

  • 35 passes, implemented in Perl, that include heuristics such as removing lines, removing various kinds of matched delimiters (and perhaps also the text between them), and shrinking integer values
  • 6 passes that invoke external utilities such as unifdef, a partial evaluator for the C preprocessor language, a lexer for C and C++ that supports various token-level reduction transformations, and pretty-printing utilities that make the reduced test case more pleasant to look at
  • 69 passes, implemented in C++, that use LLVM's Clang front end as a library for source-to-source transformation of C and C++ code; these include function inlining, partial template instantiation, scalar replacement of aggregates, copy propagation, and eliminating levels of a class hierarchy.

The actual number of dynamic passes is larger than the total of these numbers since some passes can be invoked in different modes using the "option" parameter mentioned above.

In this piece, we looked at why we had to create C-Reduce and at the modular structure that was key to making it solve the problems that we wanted to solve. In Part 2, I'll describe how C-Reduce improves reduction times using multiple cores and why C-Reduce usually does a good job reducing test cases in languages other than C and C++; finally, I'll discuss a few open research problems in test case reduction.

It’s Time for a Modern Synthesis Kernel

Alexia Massalin’s 1992 PhD thesis has long been one of my favorites. It promotes the view that operating systems can be much more efficient than then-current operating systems via runtime code generation, lock-free synchronization, and fine-grained scheduling. In this piece we’ll only look at runtime code generation, which can be cleanly separated from the other aspects of this work.

Runtime Code Generation in Ring 0

The promise of kernel-mode runtime code generation is that we can have very fast, feature-rich operating systems by, for example, not including code implementing generic read() and write() system calls, but rather synthesizing code for these operations each time a file is opened. The idea is that at file open time, the OS has a lot of information that it can use to generate highly specialized code, eliding code paths that are provably not going to execute.

Runtime code generation was a well-known idea in 1992, but it wasn’t used nearly as widely as it is today. In 2019, of course, just-in-time compilers are ubiquitous. However, operating system kernels still do not use runtime code generation very much, with a few exceptions such as:

  • several OS kernels, including Linux, have a simple JIT compiler in their BPF implementation
  • VMware used to use dynamic code generation to performantly virtualize OS kernels on x86 chips that lacked hardware virtualization extensions; I doubt that this is commonly used any longer
  • pre-NT Windows kernels would dynamically generate bitblit code. I learned this in a talk by a VMware employee; this code generation was apparently a debugging issue for VMware since it would fight with their own runtime code generator. Some details can be found in this post. The old paper about the origins of this technique in the Xerox Alto is a classic.
  • TempleOS, as explained in this nice writeup, made heavy use of dynamic code generation

Anyway, back to Synthesis. The OS and code generators were all written, from scratch, in 68020 assembly language. How do we translate Massalin’s ideas to 2019? Most likely by reusing an existing code generator and OS. For most of this piece I’ll assume that that’s what we want to do, but we’ll also briefly touch on customized alternatives.

Code Generator Requirements

The particular technology that we use for runtime code generation isn’t that important, but for now let’s imagine using LLVM. This means that the pieces of the kernel that we wish to specialize will need to be shipped as bitcode, and then we’ll ask LLVM to turn it into object code as needed. LLVM has lots of great optimization passes, from which we could pick a useful subset, and it is not hard to use in JIT mode. On the other hand, LLVM isn’t as fast as we’d like and also it has a large footprint. In production we’d need to think carefully whether we wanted to include a big chunk of non-hardened code in the kernel.

What optimizations are we expecting the code generator to perform? Mostly just the basic ones: function inlining, constant propagation, and dead code elimination, followed by high-quality instruction selection and register allocation. The hard part, as we’re going to see, is convincing LLVM that it is OK to perform these optimizations as aggressively as we want. This is an issue that Massalin did not need to confront: her kernel was designed in such a way that she knew exactly what could be specialized and when. Linux, on the other hand, was obviously not created with staged compilation in mind, and we’re going to have to improvise somewhat if we want this to work well.

My guess is that while LLVM would be great for prototyping purposes, for deployment we’d probably end up either reusing a lighter-weight code generator or else creating a new one that is smaller, faster, and more suitable for inclusion in the OS. Performance of runtime code generation isn’t just a throughput issue, there’ll also be latency problems if we’re not careful. We need to think about the impact on security, too.

Example: Specializing write() in Linux

Let’s assume that we’ve created a version of Linux that is capable of generating a specialized version of the write() system call for a pipe. This OS needs — but we won’t discuss — a system call dispatch mechanism to rapidly call the specialized code when it is available. In Synthesis this was done by giving each process its own trap vector.

Before we dive into the code, let’s be clear about what we’re doing here: we are pretending to be the code generator that is invoked to create a specialized write() method. Probably this is done lazily at the time the system call is first invoked using the new file descriptor. The specialized code can be viewed as a cached computation, and as a bonus this cache is self-invalidating: it should be valid as long as the file descriptor itself is valid. (But later we’ll see that we can do a better job specializing the kernel if we support explicit invalidation of runtime-generated code.)

If you want to follow along at home, I’m running Linux 5.1.14 under QEMU, using these instructions to single-step through kernel code, and driving the pipe logic using this silly program.

Skipping over the trap handler and such, ksys_write() is where things start to happen for real:

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos = file_pos_read(f.file);
		ret = vfs_write(f.file, buf, count, &pos);
		if (ret >= 0)
			file_pos_write(f.file, pos);

	return ret;

At this point the “fd” parameter can be treated as a compile-time constant, but of course “buf” and “count” cannot. If we turn “fd” into a constant, will LLVM be able to propagate it through the remaining code? It will, as long as:

  1. We inline all function calls.
  2. Nobody takes the address of “fd”.

It’s not that calls and pointers will always block the optimizer, but they complicate things by bringing interprocedural analysis and pointer analysis into the picture.

Our goal is going to be to see whether the code generator can infer the contents of the struct returned from fdget_pos(). (You might wonder why performance-sensitive code is returning a “struct fd” by value. Turns out this struct only has two members: a pointer and an integer.)

The call to fdget_pos() goes to this code:

static inline struct fd fdget_pos(int fd)
	return __to_fd(__fdget_pos(fd));

and then here:

unsigned long __fdget_pos(unsigned int fd)
	unsigned long v = __fdget(fd);
	struct file *file = (struct file *)(v & ~3);

	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
		if (file_count(file) > 1) {
	return v;

and then (via a trivial helper that I’m not showing) here:

static unsigned long __fget_light(unsigned int fd, fmode_t mask)
	struct files_struct *files = current->files;
	struct file *file;

	if (atomic_read(&files->count) == 1) {
		file = __fcheck_files(files, fd);
		if (!file || unlikely(file->f_mode & mask))
			return 0;
		return (unsigned long)file;
	} else {
		file = __fget(fd, mask, 1);
		if (!file)
			return 0;
		return FDPUT_FPUT | (unsigned long)file;

Keep in mind that up to here, we haven’t seen any optimization blockers. In __fdget_light(), we run into our first interesting challenge: “current” is a macro that returns a pointer to the running process’s PCB (in Linux the PCB, or process control block, is a “task_struct” but I’ll continue using the generic term). The current macro ends up being a tiny bit magical, but its end result can be treated as a constant within the context of a given process. There is no way a code generator like LLVM will be able to reach this conclusion, so we’ll need to give it some help, perhaps by annotating certain functions, macros, and struct fields as returning values that are constant over a given scope. This is displeasing but it isn’t clear there’s any easier or better way to achieve our goal here. The best we can hope for is that the annotation burden is close to proportional to the number of data types in the kernel; if it ends up being proportional to the total amount of code then our engineering effort goes way up.

Now, assuming that we can treat “current” as a compile-time constant, we’re immediately faced with a similar question: is the “files” field of the PCB constant? It is (once the process is initialized) but again there’s not going to be any easy way for our code generator to figure this out; we’ll need to rely on another annotation.

Continuing, the “count” field of files is definitely not a constant: this is a reference count on the process’s file descriptor table. A single-threaded Linux process will never see count > 1, but a multi-threaded process will. (Here we need to make the distinction between open file instances, which are shared following a fork, and the file descriptor table, which is not.) The fast path here is exploiting the insight that if our process is single-threaded we don’t need to worry about locking the file descriptor table, and moreover the process is not going to stop being single-threaded during the period where we rely on that invariant, because we trust the currently running code to not do the wrong thing.

Here our specializing compiler has a fun policy choice to make: should it specialize for the single threaded case? This will streamline the code a bit, but it requires the generated code to be invalidated later on if the process does end up becoming multithreaded — we’d need some collection of invalidation hooks to make that happen.

Anyhow, let’s continue into __fcheck_files():

static inline struct file *__fcheck_files(struct files_struct *files, unsigned int fd)
	struct fdtable *fdt = rcu_dereference_raw(files->fdt);

	if (fd < fdt->max_fds) {
		fd = array_index_nospec(fd, fdt->max_fds);
		return rcu_dereference_raw(fdt->fd[fd]);
	return NULL;

At this point we’re in deep “I know what I’m doing” RCU territory and I’m going to just assume we can figure out a way for the code generator to do what we want, which is to infer that this function returns a compile-time-constant value. I think this’ll work out in practice, since even if the open file instance is shared across processes, the file cannot be truly closed until its reference count goes to zero. Anyway, let’s move forward.

Next, we’re back in __fget_light() and then __fdget_pos(): our code generator should be able to easily fold away the remaining branches in these functions. Finally, we return to line 4 of ksys_write() and we know what the struct fd contains, making it possible to continue specializing aggressively. I don’t think making this example any longer will be helpful; hopefully the character of the problems we’re trying to solve are now apparent.

In summary, we saw four kinds of variables in this exercise:

  1. Those such as the “fd” parameter to write() that the code generator can see are constant at code generation time.
  2. Those such as the “current” pointer that are constant, but where the code generator cannot see this fact for one reason or another. To specialize these, we’ll have to give the compiler extra information, for example using annotations.
  3. Those such as the “count” field of the “files_struct” that are not actually constant, but that seem likely enough to remain constant that we may want to create a specialized version treating them as constants, and then be ready to invalidate this code if the situation changes.
  4. Those that are almost certainly not worth trying to specialize. For example, the “count” parameter to write() is not likely to remain constant over a number of calls.

Writing one byte to a pipe from a single-threaded process executes about 3900 instructions on Linux 5.1.14 (this is just in ksys_write(), I didn’t measure the trapping and untrapping code). The Synthesis thesis promises an order of magnitude performance improvement. Can specialization reduce the fast path on this system call to 390 instructions? It would be fun to find out.

I’ll finish up this example by observing that even though I chose to present code from the filesystem, I think it’s network stack code that will benefit from specialization the most.


I have some experience with OS kernels other than Linux, and my belief is that attempting to dynamically specialize any mainstream, production-grade OS other than Linux would run into the same issues we just saw above. At the level the code generator cares about, there just isn’t much effective difference between these OSes: they’re all big giant blobs of C with plentiful indirection and domain-specific hacks.

If our goal is only to create a research-grade prototype, it would be better to start with something smaller than Linux/Windows/Darwin so that we can refactor specialization-unfriendly parts of the OS in a reasonable amount of time. xv6 is at the other extreme: it is super easy to hack on, but it is so incredibly over-simplified that it could not be used to test the hypothesis “a realistic OS can be made much faster using specialization.” Hilariously, an xv6+LLVM system would be about 0.15% OS code and 99.85% compiler code. Perhaps there’s a middle ground that would be a better choice, Minix or OpenBSD or whatever.

Given two developers, one who knows LLVM’s JIT interfaces and one who’s a good Linux kernel hacker, how long would it take to bring up a minimally ambitious dynamically specializing version of Linux? I would guess this could be done in a week or two, there’s not really anything too difficult about it (it’s easy to say this while blogging, of course). The problem is that this would not give good results: only the very easiest specialization opportunities will get spotted by the runtime code generator. But perhaps this would generate enough interest that people would keep building on it.

Do we want to do specialization work on C code? No, not really, it’s just that every one of our production-grade kernels is already written in it. A fun but engineering-intensive alternative would be to create a new, specialization-friendly kernel in whatever programming language looks most suitable. Functional languages should offer real advantages here, but of course there are issues in using these languages to create a performant OS kernel. Perhaps Mirage is a good starting point here, it is already all about specialization — but at system build time, not at runtime.

An ideal programming environment for a modern Synthesis kernel would provide tool and/or language support for engineering specialization-friendly kernel code. For example, we would identify a potential specialization point and then the tools would use all of our old friends — static analysis, dynamic analysis, symbolic execution, etc. — to show us what data items fall into each of the four categories listed in the last section, and provide us with help in refactoring the system so that specialization can work better. A tricky thing here is taking into account the different kinds of concurrency and synchronization that happen in a sophisticated OS.

Some useful questions to ask (and of course we’re always asking these same things when doing OS and compiler research) are: How are we supposed to think about a dynamically specializing OS kernel? What are the new abstractions, if any? Specialization could really benefit from some sort of first-class “code region over which these values are effectively constant” and then also “but the constant-ness is invalidated by this set of events.”

Why Now?

The literature on dynamic specialization of OS code is interesting: it looks like there was a flurry of interest inspired by Synthesis in the mid/late 90s. Many of these papers had Calton Pu, Massalin’s thesis supervisor, on the author list. Not a whole lot has happened in this area since then, as far as I know. The only paper I can think of about optimistic OS specialization is this one; it’s a nice paper, I recommend it. Static OS specialization, on the other hand, is what unikernels are all about, so there’s been quite a bit of work done on this.

It seems like time to revive interest in dynamic OS specialization because:

  • Most of the processor speed wins lately are application specific; the cores that execute OS code are not getting noticeably faster each year, nor do they seem likely to. In fact, way back in 1989 John Ousterhout argued that increases in processor speed weren’t benefiting OS code as much as other kinds of code.
  • OSes have slowed down recently to mitigate side channel attacks. Maybe we can get some of that speed back using dynamic specialization.
  • OSes are way bloatier than they were in the 90s, increasing the potential benefits due to specialization.
  • Compiler technology is far ahead of where it was in the 90s, with off-the-shelf toolkits like LLVM providing high-quality solutions to many of the problems we’d run into while prototyping this work.

I’d like to thank Perry Metzger who suggested this piece and also provided feedback on a draft of it. Perry worked with Alexia back in the day and hopefully he’ll also write about this topic.

Finally, I don’t want to give the impression that I’m summarizing a research proposal or an in-progress project. This is the kind of thing I love to think about, is all.