Responsible and Effective Bugfinding

NB: This piece is not about responsible disclosure of security issues.

For almost as long as people have written code, we have also worked to create methods for finding software defects. Much more recently, it has become common to treat “external bug finding” — looking for defects in other people’s software — as an activity worth pursuing on its own. Several factors have contributed:

  • The amount of software in use has grown massively, and presumably the number of latent bugs has grown with it.
  • Since software is increasingly used to monitor and control aspects of our lives, and because systems are often networked, bugs can have more impact than they previously did.
  • Cheap computer power for brute-force bug finding is ubiquitously available.

The OSS-Fuzz project is a good example of a successful external bug finding effort; its web page mentions that it “has found over 20,000 bugs in 300 open source projects.”

This article is about how to conduct an external bug finding effort in a way that is likely to maximize the overall benefit of the project. The high level summary is that this kind of work has to be done carefully, thoughtfully, and in collaboration with the software developers. In contrast, simply throwing a huge pile of bugs over the wall to some project and walking away, is unlikely to do much good. I’ve not seen a lot written up about this, but  “A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World” is very good and touches on some of the same issues.

The problem is that there are usually mismatches in costs and incentives between bug finders and project developers. External bug finders would like to find and report as many bugs as possible, and if they have the right technology (like a shiny new fuzzer) it can end up being cheap to find hundreds or thousands of bugs in a large, soft target. Bug finders are motivated altruistically (they want to make the targeted software more reliable by getting lots of bugs fixed) but also selfishly (a bug finding tool or technique is demonstrably powerful if it can find a lot of previously unknown defects). On the other side, developers of the targeted project typically appreciate bugfinding efforts (of course, they too want a reliable system) but for them a huge bug count is undesirable. First, every bug report requires significant attention and effort, usually far more effort than was required to simply find the bug. Second, a large bug count potentially reflects negatively on the project and consequently on its developers.

In the rest of this post I’ll present some questions that people doing external bug finding work should ask themselves about each bug that they find. The answers, and the process of answering, will help them do their work more responsibly and effectively.

Is the system being tested an appropriate target for external bug finding?

This is obvious, but it still needs mentioning: not every code base wants or needs bug reports. There are plenty of throw-aways, research projects, and other code bases out there that have a very small number of users and at most a couple of developers. It is often inappropriate to use this sort of code as a target for external bugfinding work. In contrast, the more users a system has, and the more important its use cases are, the more appropriate a target it becomes. A reasonable bar for external bug finding might be: Does the package manager for your operating system include, by default, a recipe for installing the software you want to test? If not, then perhaps it has not risen to the level where it is a suitable target for external bugfinding.

Is the bug a security vulnerability?

This is a completely separate can of worms; if you have found a potential vulnerability, please seek advice elsewhere.

Is the bug already known?

A drawback of fuzzers is that they tend to keep rediscovering known bugs. Automated bug triaging techniques exist, but they are far from perfect, especially for bugs that don’t trigger crashes. Furthermore, there’s good reason to believe that perfect automated bug triaging is impossible, since the notion of “distinct bugs” is inherently a human construct. When there is a reasonable possibility that a bug you have discovered is already known and reported, it is best to just sit on your new bug report for a little while. Once the potential duplicates that are already in the bug tracker get fixed, it will be easy to see if your test case still triggers a failure. If so, you can report it. Otherwise, discard it.

Occasionally, your bug finding technique will come up with a trigger for a known bug that is considerably smaller or simpler than what is currently in the issue tracker. For example, if a web browser contains a race condition that people believe can only be triggered using five threads, and your test case makes it happen using just two threads, then this is an interesting new finding. In this case you should consider appending your test case to the existing issue.

Can you write a convincing bug report?

Writing good bug reports is a learned skill that requires significant care and patience. Here are a couple of good resources.

I only have a few points to add:

  • A bug report is not simply a collection of facts, it also forms an argument for why a developer — who surely already had plans for the day — should instead work on fixing the defect.
  • It is important to put yourself in the developers’ position. If you were reading this report, would you have enough information to act on it? Would you want to?
  • Avoid, at all costs, being rude or presumptuous.
  • Test cases found by fuzzers can be inherently less compelling to developers than test cases that come from real use cases. First, these test cases often just look funny. Second, developers often believe (no doubt correctly, sometimes) that fuzzer-discovered corner cases will not be triggered by human users. If you are submitting a bug report containing fuzzer-generated input, careful test case reduction must be performed, with an eye not only towards minimality but also towards readability and understandability.

A major advantage of external bug finding is that since the people performing the testing are presumably submitting a lot of bug reports, they can do a really good job at it. In contrast, regular end users may not submit bug reports very often, and consequently we would not expect them to be as skilled. Treat the ability to write high-quality bug reports as a superpower that enables you and the system’s developers to collaboratively increase the quality of the code in an efficient and pleasant fashion.

Is the bug important?

Developers are almost always willing to fix important bugs that might have serious consequences for their users. On the other hand, they are often happy to delay fixing bugs that they believe are unimportant, so they work on more pressing things. As an external bug finder, it can be hard to tell which kind of bug you have discovered (and the numbers are not on your side: the large majority of bugs are not that important). One way to approach this problem is to look at bugs (both fixed and unfixed) that are already in the bug tracker: what common characteristics can you identify that caused developers to address defects rapidly? What common characteristics can you identify that caused developers to ignore bugs or mark them WONTFIX? You can also open a dialogue with developers: they may be willing to tell you the kinds of bugs they’re interested in, and not interested in. The developers of the CVC4 SMT solver recently added a public document to this effect. I hope that many other projects will follow their lead.

Common sense is also useful: if the bug requires an obscure or unlikely combination of command-line options to be triggered, then perhaps it doesn’t matter that much. For example, when fuzzing compilers using Csmith we rarely invoked GCC and LLVM with flags other than -O0, -O1, -O2, -Os, and -O3.

Have you tried to fix the bug?

In some cases, fixing a bug in software you’re unfamiliar with is impractical because it’s something like a deep race condition in an OS kernel or a subtle emission of incorrect object code by a compiler. Years of specific experience with the system might be required to locate the right place to fix the bug. On the other hand, a lot of bugs end up being easy to fix, including typos, copy-and-paste errors, or incorrect order of arguments to a library call. If you are able to take a few minutes to check if a bug you are reporting is easy to fix, and to suggest a patch if it is, then you can potentially save project developers considerable time and effort. They will appreciate this.

How many open issues do you have in the bug tracker?

If you have already reported a few bugs and they have not been fixed (or, worse, have not even been acknowledged), then something is not working. Perhaps the developers are otherwise occupied, or perhaps they don’t find these bugs to be actionable or important. Regardless, at this point it is not helpful to continue reporting bugs. If you haven’t done so already, you might open a dialogue with developers to see what the situation is, or you might simply move on and find bugs in a different code base.

Do the developers trust you?

If the developers of the system you’re reporting bugs in have never seen or heard of you, they are very likely to at least take a look at the first issue you submit. If they trust you, because you’ve submitted a number of high-quality bug reports before, they’ll look at it closely and are likely to take it seriously. However, if they actively don’t trust you, for example because you’ve flooded their bug tracker with corner-case issues, then they’re not likely to ever listen to you again, and you probably need to move on to do bug finding somewhere else. Do not let this happen, it makes developers salty and that makes bug finding work more difficult for all of us.

Are your tools open source?

Internal bug finding — where project developers find bugs themselves — can be superior to external bug finding because developers are more in tune with their own project’s needs and also they can schedule bug finding efforts in such a way that they are most effective, for example after landing a major new feature. Thus, an alternative to external bug finding campaigns is to create good bug finding tools and make them available to developers, preferably as open source software. Drawbacks of this scheme include increasing the software engineering burden on bug-finding researchers (who must now create tools that other people can use, rather than creating tools solely for their own use) and decreasing social and academic credit given to researchers since some of the benefit of their work is now hidden, rather than being quantifiable and out in the open.

Conclusions

Although software correctness is at some level a technical problem (either the code implements its specification or it doesn’t), improving the quality of software is an intensely human activity. Developers and external bug finders are on the same team here, but there’s plenty of potential for friction because their costs and incentives differ. This friction can be reduced if bug finders ask themselves a series of questions listed in this post before submitting bug reports.

Alive2 Part 3: Things You Can and Can’t Do with Undef in LLVM

[Also see Part 1 and Part 2 in this series.]

Let’s talk about these functions:

unsigned add(unsigned x) { return x + x; }
unsigned shift(unsigned x) { return x << 1; }

From the point of view of the C and C++ abstract machines, their behavior is equivalent: in a program you’re writing, you can always safely substitute add for shift, or shift for add.

If we compile them to LLVM IR with optimizations enabled, LLVM will canonicalize both functions to the version containing the shift operator. Although it might seem that this choice is an arbitrary one, it turns out there is a very real difference between these two functions:

define i32 @add(i32 %0) { 
  %2 = add i32 %0, %0 
  ret i32 %2
}
define i32 @shift(i32 %0) {
  %2 = shl i32 %0, 1
  ret i32 %2
}

At the LLVM level, add can always be replaced by shift, but it is not generally safe to replace shift with add. (In these linked pages, I’ve renamed the functions to src and tgt so that Alive2 knows which way the transformation is supposed to go.)

To understand why add can be replaced by shift, but shift cannot be replaced by add, we need to know that these functions don’t have 232 possible input values, but rather 232+2. Analogously, an LLVM function taking an i1 (Boolean) argument has four possible input values, not the two you were probably expecting. These extra two values, undef and poison, cause a disproportionate amount of trouble because they follow different rules than regular values follow. This post won’t discuss poison further. Undef sort of models uninitialized storage: it can resolve to any value of its type. Unless we can specifically prove that an LLVM value, such as the input to a function, cannot be undef, we must assume that it might be undef. Many compiler bugs have been introduced because compiler developers either forgot about this possibility, or else failed to reason about it correctly.

Let’s look at some things you’re allowed to do with undef. You can leave it alone:

You can replace it with zero:

You can replace it with any number you choose:

You can replace it with the set of odd numbers:

You can take a function that returns [0..15] and replace it by a function that returns [3..10]:

The common pattern here is that the new function returns a subset of the values returned by the old function. This is the essence of refinement, which is what every step of a correct compilation pipeline must do. (Though keep in mind that right now we’re looking at a particularly simple special case: a pure function that takes no arguments.) So what can’t you do with undef? This is easy: you can’t violate refinement. In other words, you can’t replace a (pure, zero-argument) function with a function that fails to return a subset of the values that is returned by the original one.

So, for example, you can’t take a function that returns [0..15] and replace it by a function that returns [9..16]:

As we can see in the counterexample, Alive2 has correctly identified 16 as the problematic output. Returning [9..15] would have been fine, as would any other subset of [0..15].

When a function takes inputs, the refinement rule is: for every valuation of the inputs, the new function must return a subset of the values returned by the old function. Refinement gets more complicated for functions that touch memory; my colleagues Juneyoung and Nuno will cover that in a different post.

Now let’s return to the question from the top of this piece: why can we turn add into shift, but not shift into add? This is because every time an undef is used, the use can result in a different value (here “use” has a precise meaning from SSA). If undef is passed to the shift function, an arbitrary value is left-shifted by one position, resulting in an arbitrary even value being returned. On the other hand, the add function uses the undef value twice. When two different arbitrary values are added together, the result is an arbitrary value. This is a clear failure of refinement because the set of all 32-bit values fails to be a subset of the set of the even numbers in 0..232-1. Why was undef designed this way? It is to give the compiler maximum freedom when translating undef. If LLVM mandated that every use of an undef returned the same value, then it would sometimes be forced to remember that value, in order to follow this rule consistently. The resulting value would sometimes occupy a register, potentially preventing something more useful from being done with that register. As of LLVM 10, this “remembering” behavior can be achieved using the freeze instruction. A frozen undef (or poison) value is arbitrary, but consistent. So if we want to transform the shift function into the add function in a way that respects refinement, we can do it like this:

A general guideline is that an LLVM transformation can decrease the number of uses of a possibly-undef value, or leave the number of uses the same, but it cannot increase the number of uses unless the value is frozen first.

Instead of adding freeze instructions everywhere, one might ask whether it is possible to avoid undef-related problems by proving that a value cannot possibly be undef. This can be done, and as of recently Juneyoung Lee has added a static analysis for this. Alas, it is necessarily pretty conservative, it only reaches the desired conclusion under restricted circumstances.

Before concluding, it’s worth recalling that undefined behavior in an IR is not necessarily closely related to the source-level UB. In other words, we can compile a safe language to LLVM IR, and we can also compile C and C++ to IRs lacking undefined behavior. In light of this, we might ask whether undef, and its friend, poison, are worthwhile concepts in a compiler IR at all, given the headaches they cause for compiler developers. On one hand, we can find IR-like languages such as Java bytecode that lack analogous abstractions. On the other hand, as far as actual IRs for heavily optimizing compilers go, I believe all of the major ones have something more or less resembling poison and/or undef. The fact that a number of different compiler teams invented similar concepts suggests that there might be something fundamental going on here. An open question is to what extent these abstractions are worthwhile in an IR for a heavily-optimizing compiler middle-end that is only targeted by safe languages.

My colleagues and I believe that having both undef and poison in LLVM is overkill, and that undef can and should be removed at some point. This was not true in the past, but we believe that the new freeze instruction, in combination with poison values, provides adequate support for undef’s use cases.

The Gods Pocket Peak Trail

Years ago my friend Derek heard that the Jarbidge Wilderness — a remote, mountainous area along the Idaho-Nevada border — is one of the least-visited wilderness areas in the USA, and we’ve wanted to visit since then. There’s little good information about this area online, but a book I had sitting around listed The Gods Pocket Peak Trail, and we hoped to connect up to some other routes in the area to make a loop. The trailhead for this route is above 9000 feet, but then the trail immediately drops a thousand feet, climbs a thousand feet, and sort of keeps on doing this. The views and microclimates are amazing but it’s tiring walking.

We left Salt Lake City in the morning, and about four hours later turned off of US 93 in Rogerson, NV. Our trailhead was near the (abandoned) Pole Creek Ranger Station: We started walking by mid-afternoon with good weather and good views:

We camped about five miles down the trail on a ridge that had a few trees for shelter and some moderately flat spots to camp. Here’s Derek’s tent as we’re getting ready to sleep: Gods Pocket Peak in the morning light: After coffee we started a long traverse of the eastern slopes of Gods Pocket Peak and Divide Peak, and then crossed the upper parts of several forks of Camp Creek. Views were outstanding but we kept losing and finding the trail, and progress was overall not super fast. We ended up beneath the imposing Marys River Peak: at a 9300′ pass overlooking the East Fork of the Jarbidge River, roughly 12 miles from the trailhead. From here we wanted to follow a trail I’d seen in some old online imagery, but the descent on the south side of the pass was very steep and loose, and we could see no signs at all of a trail that went in the direction we wanted to go. Since it was getting late and we didn’t feel like bushwhacking, we turned around and went back to an excellent campsite we had passed earlier in the afternoon. At this point we knew the rest of the trip was straightforward, so we put our whisky flasks into a convenient snowbank and relaxed.

Next day, we returned to the trailhead, another tiring day of uphill and downhill hiking. And, of course, when walking a faded trail in reverse, you lose the trail exactly the same number of times, but in totally different locations, so we had a decent amount of hunting around for where the trail picked up again. The final uphill slog to the trailhead hurt.

Obligatory selfie on reaching the truck: We could have just headed out right then, but we had beer, a great car camp site, and we didn’t feel like a long drive in the dark so we spent the night.

Setting up to eat dinner and watch the sunset:

It got cold pretty fast:

This is why we leave our houses.

Morning light on the main ridge of the Jarbidge Range was good.

Mid-July was a good time to be in Jarbidge: there was plenty of water around, but not so much that stream crossings were difficult. The first day and night were a bit warm, but after that temperatures were great. We saw oddly little wildlife: no snakes, no frogs, not many birds, and no mammals larger than a marmot. There was plenty of deer and elk scat around, however. Our first campsite had mosquitoes, though they were not particularly bothersome, and we didn’t have any bugs after that, except that I managed to find a patch of ticks while hiking off trail through some dense vegetation. Who expects ticks in Nevada?

The Gods Pocket Peak Trail gave the impression of having been well maintained up until a few decades ago. Major trail junctions were marked with signs, but most of the signs were faded to near illegibility and many had fallen over. The trail is excellent for long stretches, but then parts have eroded completely away or have been covered by plants to the point that it’s basically gone. We were happy to have a map I had traced from Google Earth imagery: each time we lost the trail, the GPS made it pretty easy to find again. We didn’t see any evidence of recent camping activity, such as the fire rings or areas cleared of fallen wood. One of the reasons I’m posting this trip report is that I think this area could use some more foot traffic: this might keep the trail usable for a few more years.

You Might as Well Be a Great Copy Editor

An early draft of a paper, blog post, grant proposal, or other piece of technical writing typically has many problems. Some of these are high-level issues, such as weak motivation, sections in the wrong order, or a key description that is difficult to understand because it lacks an accompanying figure. These problems need to be identified and fixed, but this is just about impossible as long as the text is crappy and drafty. All of the little problems — run-on sentences, filler words, paragraphs that switch topics without warning, etc. — tend to gang up and make it hard to see the big problems. In contrast, when individual sentences and paragraphs are clear and polished, high-level problems often just jump out at the reader, making them far easier to fix.

The process of writing (or my process, at any rate) ends up something like this:

  1. Write a rough draft.
  2. Edit until a high-level problem becomes apparent.
  3. Fix the high-level problem.
  4. Go back to step 2 until there’s nothing left to fix or the deadline arrives.

This process works, and it is the only way I know to produce good writing. However, the virtuous cycle can’t even get properly started if the writer is bad at copy editing: the process of fixing low-level bugs in writing. This is an absolutely fundamental skill, and if I could wave a wand and make students better at just one part of being a writer, I would choose copy editing.

Alas, we have to get better at copy editing the hard way: by practicing. Start by acquiring a half-baked piece of writing from someone else and trying to improve it. Something short like a cover letter for a job application or even an email would be good to begin with, and you can work up to longer pieces. It’s best to use someone else’s text: since you won’t be attached to it, your editing can be merciless. Just go over and over it until there’s nothing left to change; it can be a pleasure to beat a piece of text into shape this way (years ago I collaborated with a somewhat wordy colleague on a grant proposal, and secretly had a hugely enjoyable time making his text clearer and about half as long). When you’re done, give the edited piece back to its author and see what they think of your new version. This part isn’t necessarily fun, but it’s important to get feedback. Do this again and again.

While practice is important, nobody should be expected to intuit the principles of copy editing: we require some outside help. There are numerous books; I’ll briefly describe two that I recommend.

The first book is Copy Editing for Professionals. I really love this book: it’s hilarious, informative, and distills decades of hard-won wisdom. On the other hand, there is material in there that, as a technical writer, you’ll need to ignore: it is mainly written by and for people who work at the copy editing desk of a newspaper or magazine. Even so, in my opinion this book is completely worthwhile for technical writers.

My second book recommendation is Line by Line. I’ve read a lot of books about writing and this one was easily the most influential; it opened my eyes to a rich collection of rules, guidelines, and conventions for structuring written English that I only dimly remembered from school, or was never taught at all. I suspect many others will have the same experience. Line by Line is not remotely an easy book, but it gives us what we need: clear, actionable advice about writing, accompanied by numerous good and bad examples. Technical people may appreciate Cook’s view that the work of a copy editor is largely mechanical: she is explicitly trying to teach us algorithms she developed over decades of experience as a professional copy editor.

Writing is a crucial skill, but nobody is going to fix your stuff: you’ll have to do it yourself. Making things worse, we do a very poor job teaching people to copy edit. The secret, insofar as there is one, is to seek out resources such as Line by Line and Copy Editing for Professionals, and to combine the kind of high-grade advice that they provide with a lot of practice, ideally augmented with feedback from someone who is already good at copy editing.

The Saturation Effect in Fuzzing

This piece is co-authored by Alex Groce and John Regehr.

Here’s something we’ve seen happen many times:

  • We apply a fuzzer to some non-trivial system under test (SUT), and initially it finds a lot of bugs.
  • As these bugs are fixed, the SUT sort of becomes immune to this fuzzer: the number of new bugs found by the fuzzer drops off, eventually approaching zero.
  • Subsequently, a different fuzzer, applied to the same system, finds a lot of bugs.

What is going on with the first fuzzer? Two things are happening. First, the frequency at which the bugs in a system are triggered by any given source of stochastic inputs follows a power law distribution. As the often-triggered bugs get discovered and fixed, we’re eventually left with only bugs that are triggered at such low probability that they are effectively not findable within a reasonable compute budget. Second, some bugs are out of reach for a given fuzzer even given an infinite amount of CPU time. For example, if we fuzz GCC using Csmith, we will never find a bug in GCC’s optimization of string compares because Csmith is incapable of generating code that compares strings. Similarly, a fuzzer for JPEG compressors that only constructs images that are 1000×1000 pixels, all the same random color, is unlikely to be capable of discovering most bugs. These examples are basic, but almost every fuzzer, no matter how sophisticated, has analogous limitations.

The second problem is usually the preferable one to have: we can fix it by adding new features to a generation-based fuzzer, or by finding a better set of seeds for a mutation-based fuzzer. Even if we don’t know anything about the as-yet-undiscovered bugs, we can often infer that they might exist by noticing a gap in coverage induced by the fuzzer.

The first problem is often much harder to work around because we typically do not have any real understanding of how the space of test cases that triggers a particular bug intersects with the probability distribution that is implicit in our fuzzing tool. We can tweak and tune the probabilities in the fuzzer in order to change the distribution, but without solid guidance this is often no better than guesswork.

Detecting Saturation

There isn’t much research on fuzzer saturation. In principle, coverage-based stopping criteria aim at this, but most of these simply set some moderately-difficult target and assume that if you reach it, you are done. Since most fuzzing efforts will never hit an arbitrary, high code coverage target for any complex SUT, this is not useful. The one effort we are aware of to propose something more relevant for typical saturation is by Tramontana et al. They propose splitting a testing effort into k subsets, each with its own coverage measure, and stopping when all k subsets have identical coverage. The larger k is, the longer it will take to detect saturation, but the more confident you can be that testing is “done.” The stopping point does not need to be in terms of coverage; it could instead use automated bug triage to see if all runs have found identical sets of bugs.

It is possible to have two different kinds of saturation interacting. The most important kind of saturation is saturation in terms of bug-finding, of course: a particular way of generating test cases has hit the point at which producing new bugs is highly unlikely. However, we don’t just look at bug-finding in testing; we also look at coverage, and use coverage as a signal of progress in testing. Critically, it’s not just humans who do this; fuzzers use novel coverage of some kind (CFG edges, paths, or even dataflow) to guide their behavior. AFL attempts to cover new paths as its primary goal, with new crashes as a kind of side-effect. This means, however, that there can be four cases for saturation, in practice:

  1. Coverage is not saturated, and bugs are not saturated. This is the case where you don’t need to think about saturation. Your testing process is generating new coverage and new bugs frequently; just continue fuzzing.
  2. Coverage is saturated, and bugs are saturated. The opposite case is also straightforward: if neither new coverage nor new bugs are being generated, it’s time for a change, such as improving the fuzzer or perhaps finding a different SUT to target.
  3. Coverage is saturated, but bugs are not saturated. This is an interesting case, where the coverage metric being used is insufficiently expressive. In other words, coverage stops increasing not because no further interesting exploration of the SUT is possible, but rather because coverage is poorly correlated with the kind of execution space exploration that leads to new bugs. It is pretty easy to get into this situation, for example if the notion of coverage is weak (statement coverage) and bugs tend to depend on hitting particular paths or states, rather than just hitting bad code locations. This can be a problem for automated termination of fuzzing efforts, either because the fuzzer appears to be “done” but is in fact still generating good results, or (more commonly) when AFL/libFuzzer is stuck due to the coverage signal not helping it reach another part of the input space (where coverage might no longer be saturated). The path-dependency of AFL/libFuzzer means that just starting again with the same corpus, or throwing in some new seeds from another fuzzer, might help escape this particular trap.
  4. Coverage is not saturated, but bugs are saturated. This is a real possibility and a real pain in coverage-driven fuzzing. When using AFL to fuzz a compiler, it is not uncommon to see a run get stuck generating huge numbers of uninteresting paths, presumably exploring things like “ways to have the parser fail,” that seldom produce interesting bugs. Recent research on specifying parts of code that aren’t to be counted for coverage can help with this problem. In our experience fuzzing real-world compilers with AFL, the “pending:” field often never goes to zero, even if nothing interesting is happening, and fuzzing has been running for days or weeks.

Saturation in Practice: Fuzzing a Smart Contract Compiler

Saturation isn’t a one-time thing in a real fuzzing effort, but an ebb-and-flow responding to either new code in the system under test or changes in the fuzzing regime. For example, here is a chart showing the number of GitHub issues (almost all actual bugs, with a small number of duplicates of already-known bugs) we submitted for the solc Solidity smart contract compiler, as part of an ongoing research effort to make fuzzing compilers with AFL easier. Here’s some more detail about this fuzzing effort.

We started fuzzing the compiler in early February, and immediately found a few bugs, using both our new technique and plain-old-AFL.  Then we continued running the fuzzer, but found nothing novel for a few days. On February 21, we added several new mutators to our fuzzing engine (see e.g. this commit), and “broke” the saturation.  Then there was a long gap, with a few more bugs finally found after running the fuzzer for over a month, performing well over 1 billion compilations, on a corpus consisting of every interesting path found by any previous run.

Just recently, however, we made another change.  Previously, we had based our corpus only on the Solidity compiler’s syntax tests; we switched to using all the Solidity files in the entire test subtree of the repository on May 14th. This turned up a number of new bugs, for example in the SMTChecker (which previously wasn’t being fuzzed at all). We’ll probably return to saturation soon, though! Note that during this time, no fuzzer instance (and we’ve run plenty of them) has ever saturated path coverage in AFL.

A just-accepted FSE paper by Bohme and Falk gives much more explicit empirical data on the nature of saturation barriers: finding linearly more instances of a given set of bugs requires linearly more computation power directed towards fuzzing a target; finding linearly more distinct bugs requires exponentially more fuzzing compute power. 

Mitigating Saturation

Beyond understanding and detecting saturation, of course we would like to be able to delay its onset. That is, we want to find more bugs before a given fuzzer saturates. This is a relatively natural and easy activity for mutation-based fuzzers because we can not only cast a wider net when finding seeds, but we can also add new mutation operators in a relatively painless way because mutations should be independent of each other.

Fuzzers that generate code from scratch are a trickier case. Perhaps the closest thing to a free lunch that we are aware of is swarm testing, which randomly reconfigures the fuzzer prior to generating a new test case. The goal is to avoid the problem where the products of a large number of random choices all end up looking more or less the same. (A low-dimensional example is a collection of images, each of which is created by randomly assigning each pixel to be black or white with 50% probability. These are not easy to tell apart.)

Our other options are more work:

  • Improve the stuck fuzzer, for example by teaching it to generate more features supported by the SUT.
  • Run a collection of fuzzers, if these are available.
  • Do a better job at integrating feedback from the SUT.

This last option is probably the most promising one: the state of the art in merging generative fuzzing with feedback-driven fuzzing is, at present, not very sophisticated. We believe there is room for major improvements here.

Alive2 Part 2: Tracking miscompilations in LLVM using its own unit tests

[This piece is co-authored by Nuno P. Lopes and John Regehr.]

Alive2 is a formal verification framework for LLVM optimizations. It includes multiple tools, including a plugin for `opt’ to verify whether the optimizations just run are correct or not. We gave an introduction to Alive2 in a previous post.

A few years ago we used an old version of Alive to verify small, automatically generated test cases. These test cases had functions with three instructions which were then piped through different optimizations and their outputs verified with Alive for correctness. This approach found several bugs in the then-current version of LLVM.

But we want more — we want to identify and hopefully help fix all miscompilations that can be triggered in practice. To this end, we created Alive2, which can reason about a wide variety of intraprocedural optimizations including those that work with control flow, memory, vectors, and floating point types. We then started running Alive2 regularly on LLVM’s own unit tests. This has already resulted in 39 bug reports.

Working on the unit tests makes a lot of sense:

  • The behaviors being tested are clearly important, or at least interesting, or else the LLVM developers wouldn’t have made them part of the unit test suite in the first place.
  • The tests are usually small and only involve one or a few optimizations, making it much easier to fix problems than it is when we observe miscompilations in the field.

In this post we describe the process of running Alive2 on LLVM’s unit tests, and give an overview of the remaining problems we’ve found in LLVM that still need to be fixed.

Running Alive2 on LLVM’s unit tests

LLVM’s unit tests are run by a tool called lit, which makes it easy to interpose on calls to opt, the standalone optimizer for LLVM IR. We simply run a subset of the unit tests like this:

$ ~/llvm/build/bin/llvm-lit -vv -Dopt=$HOME/alive2/scripts/opt-alive.sh ~/llvm/llvm/test/Transforms

Our script, then, loads the Alive2 opt plugin and fails the test case if Alive2 can prove that it was optimized improperly. This script has a list of passes that are not supported by Alive2, so it can avoid false positives (tests that look wrong to Alive2, but are actually correct). We take this issue very seriously, to avoid wasting people’s time. Passes that aren’t currently supported are those that do interprocedural optimizations.

In addition to unsupported passes, Alive2 doesn’t support all of LLVM’s IR, including some function attributes and many intrinsics and library functions. If a test case uses an unsupported feature, Alive2 still tries to analyze it to find miscompilations involving only supported IR features, while conservatively approximating the remaining parts.

Running the above lit command takes around 15 hours of CPU time, or less than 2 hours on an 8-core machine.

Tracking bugs over time

We built a simple dashboard to track the evolution of bugs found by Alive2 in LLVM’s unit tests over time.

The plot shows two lines: number of failures per day, and number of failures if `undef’ values were to disappear from LLVM. At the time of writing, Alive2 reported a total of 95 bugs in the unit tests, with 25 of those being ok if undef was removed. Since `undef’ values still exist, these optimizations are incorrect. We plot both lines as an incentive to get rid of `undef’ values: over a quarter of the reports would disappear.

The page has further information for the last five runs, including the reports that were fixed and which are new in each run. Clicking on a run gives the list of all failed tests, as well as a short manually written description of the failure cause.

We use a timeout in Alive2 for the work done by the theorem prover underneath. Unfortunately, some reports are only triggered very close to the timeout and therefore sometimes there is churn, with a test failing only in some runs.

Current issues in LLVM

Many of the bugs we’ve found and reported have been fixed already. Most of these were some Instcombine pattern that was incorrect for some inputs. Other issues are trickier to solve, for example some bugs are hard to fix without performance regressions, as they may need further changes in other optimizations or even in SelectionDAG to recognize a different IR canonicalization. In some cases, there is not yet an agreement on the right semantics for the IR: these can’t be fixed straight away without considering pros and cons of different semantics and getting the community to reach a consensus about them. We now give a few examples of open issues in LLVM.

Introduction of branch on poison/undef

Branching on a poison or undef value is immediate UB, like dividing by zero. Optimizations that hoist branches and/or combine multiple branches can introduce branches on values that were not observed in the original program. This issue is prevalent across multiple optimizations like SimplifyCFG, loop unswitch, simple loop unswitch, and induction variable simplification. A simple example:

if (x) {
  if (y) {
    S1
  }
}
  =>
if (x & y) {
  S1
}

The original program would not branch on y if x was false. The optimized code, however, always branches on x and y. Therefore, when x=false and y=poison, the source program is well defined, but the optimized program is UB. This is a failure of refinement and the transformation is wrong.

One possible solution is to stop performing these transformations, but in general this is not considered acceptable because it loses performance. A different solution is to add freeze instructions that we introduced in a paper, and that have been supported in LLVM since version 10.0. Freezing a regular value has no effect, and freezing undef or poison results in an arbitrary value being chosen.
A correct version of the wrong optimization above is:

if (x & freeze(y)) {
  S1
}

We know where freeze instructions need to be introduced — this is not too difficult. However, adding too many freezes can (and has) led to performance regressions. These are usually not because freeze is inherently expensive, but rather because freezes can inhibit optimizations, particularly those involving loops. Fixing the regressions is a matter of putting in the engineering effort to do things like hoisting freeze out of loops, eliminating unnecessary freezes, and teaching optimizations not to be blocked by freezes. Work is underway, but several parts of the compiler need to be touched and it will take some time.

Introduction of undef uses

Generally speaking, it is not legal to add a new use of an undef value in LLVM. This is a problem in practice because it is easy for compiler developers to forget that almost any value is potentially undef. For example, shl %x, 1 and add %x, %x are not equivalent if %x might be undef. The former always returns an even number, while the latter could also yield an odd number if %x is undef. In other words, different uses of an undef value can resolve to different concrete values. Alive2 reports 25 test cases that are incorrect if we consider some of the variables to be undef, but are correct otherwise.

Some cases are easy to fix. For the example above, we can simply choose to canonicalize the IR to the shift instruction. We could also freeze %x, since all uses of the frozen value will resolve to the same concrete value.

This canonicalization has no obvious alternative solution other than using freeze or removing undef entirely from LLVM:

sext(x < 0 ? 0 : x)
  =>
zext(x < 0 ? 0 : x)

If x is undef, the expression x < 0 ? 0 : x may yield a negative number. Therefore, replacing the sign-extension with a zero-extension is not correct. Freezing x or eliminating undef values would both work.

Pointer comparison semantics

LLVM pointers track object provenance. That is, a pointer derived from a malloc() call can only point within the buffer allocated by that call, even if the underlying machine value of the pointer happens to point to another buffer. For example:

int *p = malloc(4);
int *q = malloc(4);
p[x] = 2; // this implies x = 0; cannot write to q, for example

The reason for this semantics is that it enables cheap, local alias analysis. If provenance tracking didn't exist, alias analysis would often need to do whole program analysis where today it can get away with easy, local reasoning. This is nice; we all love fast compilation times and good object code!

The goal of provenance tracking is to prevent the program from learning the layout of objects in memory. If the program could learn that q is placed right after p in memory for the example above, the program could potentially change their contents without the compiler realizing it. A way the memory layout can leak is through pointer comparison. For example, what should p < q return? If it's a well defined operation, it leaks memory layout information. An alternative is to return a non-deterministic value to prevent layout leaks. There are pros and cons of both semantics for the comparison of pointers of different objects:

Comparison yields a
deterministic value
Comparison yields a
non-deterministic result
p = obj1 + x; q = obj2 + y;
x,y unknown
Cannot simplify the comparison;
depends on the values of x & y
Can fold p < q to false
Equivalence between pointer
and integer comparison?
Yes; can simplify
(ptrtoint p) < (ptrtoint q) to p < q
No
Assume all pointers are > null Yes; can simplify
(p <= q) or (p == null) to: (p <= q)
Maybe (can special case?)
inttoptr simplification Harder; pointer comparisons
leak memory layout
Simpler

Unfortunately LLVM currently wants the best of both worlds, with different optimizations assuming different semantics, which can lead to miscompilations. We need to settle on one semantics and stick to it throughout the compiler.

Load type punning

In order to respect pointer provenance semantics, LLVM's memory is typed. We need at least to be able to distinguish whether a byte in memory holds an integer/float or a pointer. However, LLVM sometimes violates this rule and loads a pointer with an integer load operation or vice-versa. This implies that any load instruction could potentially do an implicit inttoptr or ptrtoint operation, thus leaking information about the memory layout. This is not good: as we have seen before, it undermines alias analyses.

Introducing type punning instructions is common in LLVM. Many of these cases arise when converting small memcpy operations to load/store instructions. While memcpy copies memory as-is, load/store instructions are typed. Additionally, LLVM currently has no way of expressing C's char or C++'s std::byte. These types are allowed to store any type of data, including pointers and integers.

A suggestion to fix this issue is to introduce a `byte' type in LLVM IR, a sort of union of all LLVM types. This type couldn't be used in arithmetic operations, but only in memory operations and appropriate casts (yielding poison when casting byte to the wrong type). An alternative solution is to accept that load instructions may do implicit casts and fix alias analyses to take this fact into account (likely paying a performance price).

Select to Arithmetic

LLVM canonicalizes select instructions to arithmetic when it can do so. All of these transformations are wrong because select blocks a poison value from its not-chosen input, but arithmetic does not. The most obvious fix, changing the meaning of select so that it no longer blocks poison, is unacceptable because this renders unsound a different class of transformation that LLVM wants to do: canonicalizing simple conditional expressions to select. This kind of tricky semantic balancing act is found again and again in LLVM IR. Appropriate fixes would be to avoid these transformations (which requires strengthening up support for select in backends, and in fact some work in this direction has been done) or else freezing values before performing arithmetic on them.

Other issues

This post is already long, so we list here a few further open issues and a short description:

  • FP bitcast semantics. What's the value of (int)(float)x, where x is an integer? Is it always x? The corner case is when x matches one of the NaN bit-patterns. Can the CPU canonicalize the NaN bit-pattern and thus when converting the float back to integer yield a result different than x?
  • Load widening. LLVM combines multiple consecutive loads into a single load by widening the loaded type. This is not correct: if any of the loaded values is poison, all of them become poison. A solution is to use loads with vector types, as poison elements in a vector don't contaminate each other.
  • inttoptr(ptrtoint(x)) => x. This is not correct, as there can be two valid pointers with the same integer value (beginning of an object and one past the end of another). The solution is to enable this transformation only when a static analysis determines it's safe to do so. Currently, LLVM performs this optimization unconditionally.
  • Shufflevector semantics. When the mask is out of bounds, the current semantics are that the resulting element is undef. If shufflevector is used to remove an element from the input vector, then the instruction cannot be removed, as the input might have been poison. The solution is to switch to give poison instead.

Future

Our short-term goal is to get LLVM to a point where Alive2 detects zero miscompilations in the unit test suite. Then we hope that an Alive2 buildbot can continuously monitor commits to help ensure that no new miscompilation bugs are committed to the code base. This is a powerful baseline, but we believe there are additional miscompilation errors lurking in LLVM that will only be exposed by more thorough testing. We will accomplish this by using Alive2 to monitor LLVM while it optimizes real applications and also randomly generated programs.

We thank everyone involved in fixing LLVM bugs as well as discussing semantics and fixing LangRef!

Sid’s Mountain Backpacking Loop

Last fall my friend Brian and I went on a short backpacking trip in the San Rafael Swell. We left SLC early, drove to Ferron Utah, and then followed a high-clearance dirt road to the rim of North Salt Wash, a wide canyon that feeds the San Rafael River. We dropped into this open canyon using an old road and then a foot trail:

There’s an extensive area of rock art near where we reached the canyon floor, some of it fairly fresh-looking (so probably less than 1,000 years old):


We particularly enjoyed the turtle.

Other petroglyphs in the area are faded, meaning they are most likely much older:

We walked down North Salt Wash for about five miles until it ended at the San Rafael River:

We had lunch here and then walked downstream, sometimes on trails, other times just walking in the water. This hike would be difficult or impossible in spring when the river has a lot of flow, but in fall it was slow and shallow. About three miles later we reached Virgin Springs Canyon, which also has some nice rock art:

A short distance past the rock art, this canyon has a beautiful perennial pool, and we stopped for a while there to filter a lot of water — about nine liters each — since we weren’t going to run into reliable water for the next two days. October is an idyllic time in canyon country with moderate temperatures, stable weather, and no bugs, but it’s not generally a good time to plan on finding water in intermittent sources such as potholes and weak springs.

Above the pool, Virgin Springs Canyon is impassable to cattle, making it immeasurably more pleasant than if it was all cowed up. It has no trails of any sort and some of the walking was fast and easy, but other times we had to climb through boulder piles and shove our way through thick scrub. Our packs had started the day nice and light (25 or 30 pounds probably) but adding 20 pounds of water made them a lot less pleasant to carry.

Soon enough it was late afternoon and we found my favorite kind of desert campsite: an area of flat slickrock. These are perfect for low-impact camping and also for walking around barefoot without getting all dirty or stepping on a cactus. The weird white thing is my bivy bag, I don’t bother with a tent when the weather is good and the bugs aren’t bad.

In the morning we continued walking up Virgin Springs Canyon. First order of business was bypassing a big dryfall by working through a cliff band:

This left us with only the final headwall to surmount; there’s a slightly scrambly route out of the canyon in the shadowed area in the left-center of this picture:

Looking back at upper Virgin Springs Canyon from just below the exit:


Overall this canyon exit ended up being pretty easy and straightforward, as these things go.

At this point we were up on Sid’s Mountain, a remote plateau. We were only a short distance from Swazy, the local high point, so we climbed it and got good views in all directions.

Our next destination was an old cabin; I think the metal lids on top of the stilts are to keep mice out.

Next we had a longish walk up on the flats. Despite the good scenery, this part of the hike was not too inspiring: the afternoon was a bit too warm, plenty of cows had been in the area, and those yellow-green bushes Brian is standing next to are full of hellish little sharpies that try to break off and get into your socks. I was wearing running shoes and had to stop and pick those things out every few minutes.

Towards evening we reached the rim of Saddle Horse Canyon where we found another nice slickrock campsite. The upside of carrying water is you can camp anywhere! I didn’t take pictures but we had a spectacular sunset and again it was warm enough to walk around barefoot, a welcome change after being in dusty sneakers all day.

I got up early and walked around shivering, waiting to take a few pictures as the sun came up.

In a little while the Earth’s shadow started dropping towards the skyline.

First light on the Wasatch Plateau.

The rugged central San Rafael Swell to the south of us.

And finally full sun.

We only had about eight miles of walking to get back to my truck, so didn’t worry about getting an early start. The descent down into Saddle Horse Canyon was on an angle-of-repose rubble slope that looked grim, but on fresh legs it wasn’t bad. A big arch lives nearby:

Lower down in Saddle Horse Canyon there were areas of running water. Though there were a few sections of brush and soft sand, overall it was pretty easy walking.

Saddle Horse Canyon ends at North Salt Wash basically right at the bottom of the trail we had used to enter the canyon on the first day, so it was a quick walk out from there. Obligatory post-trip selfie:

Alive2 Part 1: Introduction

[This piece is co-authored by Nuno P. Lopes and John Regehr.]

Compiler bugs threaten the correctness of almost any computer system that uses compiled code. Translation validation is a path towards reliably correct compilation that works by checking that an individual execution of the compiler did the right thing. We created a tool, Alive2, that takes an unoptimized function in LLVM intermediate representation (IR) and either proves that it is refined by the optimized version of the function, or else shows a set of inputs that illustrate a failure of refinement.

We work in terms of refinement, rather than equivalence, because compilers often change the meaning of the code they are optimizing — but in allowed ways. For example, adding two variables of “int” type in a C or C++ program means something like “add the two quantities, returning the mathematical result as long as it is within an implementation-specified range that is no smaller than -32767..32767.” When integer addition is lowered from C or C++ to LLVM IR, this very restrictive range for validity will typically be refined to be -2147483648..2147483647. Then, when LLVM IR is lowered to a target platform such as ARM or x86-64, the behavior of additions whose results fall out of bounds is refined from “undefined” to “two’s complement wraparound.” Both of these refinements are perfectly legal, but would be incorrect if the transformation went in the other direction. Non-trivial refinement is ubiquitous during compilation and we have to take it into account in a translation validation tool.

Let’s look at an example of refinement and also learn how to invoke Alive2 and interpret its results. Here’s a pair of functions in LLVM IR:

define i16 @src(i16 %x, i16 %y) {
    %r = sdiv i16 %x, %y
    ret i16 %r
}

define i16 @tgt(i16 %x, i16 %y) {
    %z = icmp eq i16 %y, 0
    br i1 %z, label %zero, label %nonzero
zero:
    ret i16 8888
nonzero:
    %q = sdiv i16 %x, %y
    ret i16 %q
}

The first function, src, simply returns the quotient of its arguments. The second function, tgt, returns the quotient of its arguments unless the second argument is zero, in which case it returns 8888. It turns out that src is refined by tgt. This works because division by zero in src has undefined behavior: LLVM makes no promise whatsoever about the behavior of a program that divides a number by zero. tgt, on the other hand, has a specific behavior when its second argument is zero, and behaves the same as src otherwise. This is the essence of a refinement relationship. If we run the refinement check in the other direction, Alive2 will tell us that src does not refine tgt.

To see these results for yourself, you can build Alive2 (making sure to enable translation validation at CMake time) and then invoke our alive-tv tool like this:

alive-tv foo.ll

Alternatively, you can simply visit our Compiler Explorer (CE) instance. Let’s talk about this user interface for a second. When you load the page you should see something like this:

The large pane on the left is where you type or paste LLVM IR. The large pane on the right shows alive-tv’s output in the case where it is able to parse the LLVM code successfully. The “share” pulldown at the top provides a short link to what you are currently seeing: this is an excellent way to show other people your results. If your LLVM code contains an error, the right pane will say “Compilation failed” and the text at the bottom of the right pane will change from “Output (0/0)” to something like “Output (0/3)”. Click on this text and a third pane will open showing the error output. You should then see something like this:

After fixing the error you can close this new pane.

Returning to our example, here’s Compiler Explorer with the example already loaded. We’ve also given alive-tv the --bidirectional command line option, which causes it to run the refinement check in both directions. The output therefore has two parts; this is the first one:

----------------------------------------
define i16 @src(i16 %x, i16 %y) {
%0:
  %r = sdiv i16 %x, %y
  ret i16 %r
}
=>
define i16 @tgt(i16 %x, i16 %y) {
%0:
  %z = icmp eq i16 %y, 0
  br i1 %z, label %zero, label %nonzero

%nonzero:
  %q = sdiv i16 %x, %y
  ret i16 %q

%zero:
  ret i16 8888
}
Transformation seems to be correct!

“Transformation seems to be correct!” indicates that Alive2 thinks that src is refined by tgt. The arrow (=>) points in the direction of the refinement relation. The second part of the output is:

----------------------------------------
define i16 @tgt(i16 %x, i16 %y) {
%0:
  %z = icmp eq i16 %y, 0
  br i1 %z, label %zero, label %nonzero

%nonzero:
  %q = sdiv i16 %x, %y
  ret i16 %q

%zero:
  ret i16 8888
}
=>
define i16 @src(i16 %x, i16 %y) {
%0:
  %r = sdiv i16 %x, %y
  ret i16 %r
}
Reverse transformation doesn't verify!
ERROR: Source is more defined than target

Example:
i16 %x = undef
i16 %y = #x0000 (0)

Source:
i1 %z = #x1 (1)
i16 %q = #xffff (65535, -1)	[based on undef value]

Target:
i16 %r = #xffff (65535, -1)

Notice that now the arrow points from tgt to src. This time Alive2 says:

Reverse transformation doesn't verify!
ERROR: Source is more defined than target

(Here in the reverse direction, “source” unfortunately refers to “tgt” and “target” refers to “src”.)

Interpreting a counterexample from Alive2 isn’t always easy. Part of the blame falls on LLVM’s tricky undefined behavior semantics, but also there are some UX details that we hope to improve over time. The main thing to remember is that a counterexample is an example: it represents a single execution which illustrates the failure of refinement. The most important part of a counterexample is the set of input values to the IR being checked:

i16 %x = undef
i16 %y = #x0000 (0)

These are, of course, the arguments to the functions. Given a set of input values, you should ideally be able to execute both the unoptimized and optimized functions in your head or on paper, in order to see the difference. Alas, this depends on the functions being fairly small and also on you having a good understanding of LLVM’s undefined behavior model.

The next part of the output is designed to help you think through the execution of the code; it provides intermediate values produced as the source and target execute the counterexample:

Source:
i1 %z = #x1 (1)
i16 %q = #xffff (65535, -1)	[based on undef value]

Target:
i16 %r = #xffff (65535, -1)

Now that we’re seen all of the parts of Alive2’s output, let’s work through this specific counterexample. The key is passing 0 as the argument to the %y parameter; the value of %x doesn’t matter at all. Then, tgt branches to %zero and returns 8888. src, on the other hand, divides by zero and becomes undefined. This breaks that rule that optimizations can make code more defined, but not less defined. We plan to augment Alive2’s counterexamples with some indication of where control flow went and at what point the execution became undefined. Also, we should suppress printing of values that occur in dead code. Alternately, and perhaps preferably, we could feed Alive2’s counterexample to a separate UB-aware LLVM interpreter (a tool that doesn’t yet exist, but should).

In general, when refinement fails, it means that there exists at least one valuation of the inputs to the function such that one of these conditions holds:

  1. the source and target return different concrete answers
  2. the source function returns a concrete answer but the target function is undefined
  3. there is a failure of refinement in the memory locations touched by the function

We’ll explain memory refinement in a later post in this series.

Alive2’s predecessor, Alive, was also a refinement checker. It was based on the idea that a user-friendly domain-specific language could make it easier for compiler developers to try out new optimizations. This language supported omitting some details found in real LLVM IR, such as how wide, in bits, an operation being optimized is. Alive2 still supports this domain-specific language, but much of our recent work has focused on handling LLVM IR directly. This means that we are working at a slightly lower level of abstraction than before, but it allows us to run our tool on large amounts of code found in the wild.

One of our main targets for Alive2 has been LLVM’s unit test suite: a collection of about 22,000 files containing LLVM IR. Each file specifies that one or more LLVM optimization passes should be run, and then checks that the file gets optimized as expected. When we run Alive2 over the test suite, we ignore the expected result and instead simply check that the optimized code refines the original code. Perhaps surprisingly, this has caught quite a few bugs: 27 so far. Why would the test suite itself, when all tests are in a passing state, be capable of triggering bugs in LLVM? This can happen for several reasons, but the most common one is that LLVM’s test oracles — which decide whether the compiler behaved correctly or not — are generally coarse-grained. For example, a unit test about optimizing division by power-of-two into right shift might simply test for the presence of a shift instruction in the optimized code. Alive2, on the other hand, acts as a fine-grained oracle: if the optimized code does not preserve the meaning of the original code for even one choice of input values, this fact will be detected and reported. Here’s an example of a bug we found this way.

Alive2 handles a large subset of LLVM, but not all of it. Right now there are three main limitations. First, loops are only partially analyzed. Alive2 unrolls loops once, so any miscompilation that only manifests when the loop runs for more than one iteration cannot be detected. Second, interprocedural optimizations are not supported. Although Alive2 can handle function calls in the code being analyzed, it does not look across functions. So a function inlining transformation cannot be analyzed, nor can something like removing an unused global variable. Third, some kinds of code will reliably cause the underlying SMT solver, Z3, to take too long to return a result. Floating point operations, wide integer divisions, and complex memory operations are common culprits.

This post has introduced Alive2 and shown off some of its basic capabilities. Subsequent posts in this series will cover more detailed technical topics such as Alive2’s memory model, verifying vector and floating point optimizations, and finding bugs in LLVM backends and also binary lifters that turn object code into LLVM IR.

Precision Opportunities for Demanded Bits in LLVM

[Although this post was written to stand by itself, it builds on the previous one. It is authored by Jubi Taneja, Zhengyang Liu, and John Regehr.]

When designing computer systems, it can be useful to avoid specifying behaviors too tightly. For example, we might specify that a math library function only needs to return a result that is within some distance of the true answer, in order to avoid paying the cost of a fully precise answer when that precision is not needed. Similarly, when a processor boots we might specify that some of its registers or memory are in unknown configurations, to avoid burdening the processor designers with bringing up all hardware in known states. In hardware design we often refer to don’t-care terms in a functional specification. Undefined behavior in C and C++ has a similar rationale: by avoiding specification of an implementation’s behaviors in numerous inconvenient situations, more efficient implementations become possible.

A fun thing that optimizing compilers can do is to automatically infer when the full power of some operation is not needed, in which case it may be that the operation can be replaced by a cheaper one. This happens in several different ways; the method we care about today is driven by LLVM’s demanded bits static analysis, whose purpose is to prove that certain bits of an SSA value are irrelevant. For example, if a 32-bit value is truncated to 8 bits, and if that value has no other uses, then the 24 high bits of the original value clearly do not matter.

Here is a demanded-bits-based optimization that looks for a call to the bswap intrinsic where only a single byte of the result is demanded, and then simply shifts that one byte into the necessary position. This isn’t an important optimization for targets that have a fast bswap instruction, but it would be useful for targets that require bswap to be open-coded. In this example, a full-power 64-bit bswap for 32-bit ARM (foo1) requires more than a dozen instructions but a bswap where only the low byte is demanded (foo2) requires just two instructions.

In this post we’ll present a few examples where the precision of LLVM’s demanded bits analysis compares unfavorably with a highly precise demanded bits that we created using an SMT solver. Some of these situations may be worth fixing. The solver-based algorithm is pretty straightforward. For every bit that is an input to a computation, it asks two questions:

  1. If this bit is forced to zero, does the resulting computation have the same behavior as the original, for all valuations of the remaining input bits?
  2. If this bit is forced to one, does the resulting computation have the same behavior as the original, for all valuations of the remaining input bits?

If both answers are “yes” then that input bit is not demanded. Otherwise it is demanded.

Side note: “demanded bits” is a slightly confusing name for this analysis. Since static analyses always approximate, we can say that they have a may direction and a must direction. If demanded bits says that a bit is demanded, this actually means “it may be demanded, or perhaps not: this analysis is incapable of saying for sure.” On the other hand, if it says that a bit is not-demanded, then this should be interpreted as “this bit’s value must not matter, and we’re confident that we could provide a formal proof of this.” The problem here is that it is more intuitive to name an analysis after its must direction, since this is the reliable information that we can directly use to drive optimizations. So, in other words, this analysis should have been called a don’t-care analysis, as is traditional. We are getting off of our soapbox now.

Ok, now on to the examples. The code will be in Souper syntax, but hopefully it is close enough to LLVM IR that nobody will have trouble seeing what’s going on. All of these program fragments are ones that we encounter while compiling SPEC CPU 2017. The LLVM that we’re comparing against is from January 20, 2020.

When a number is multiplied by a constant, some of its high bits may not affect the answer:

%0:i32 = var
%1:i32 = mul 40:i32, %0

demanded-bits from Souper:   for %0 : 00011111111111111111111111111111
demanded-bits from compiler: for %0 : 11111111111111111111111111111111

Let’s pause here for a second and assume that you don’t believe us. And why should you? This material can be counterintuitive and we make mistakes all the time, like anyone else. So here’s a simple program you can use to check this result using exhaustive testing instead of a solver. With a bit of effort you should be able to use it to double-check all of the results in this post.

It is also the case that some low bits of a number may not matter, when it is going to be divided by a constant:

%0:i32 = var
%1:i32 = udiv %0, 1000:i32

demanded-bits from Souper:   for %0 : 11111111111111111111111111111000
demanded-bits from compiler: for %0 : 11111111111111111111111111111111

This one is cute:

%0:i32 = var
%1:i32 = srem %0, 2:i32

demanded-bits from Souper:   for %0 : 10000000000000000000000000000001
demanded-bits from compiler: for %0 : 11111111111111111111111111111111

And so is this one:

%0:i32 = var
%1:i32 = mul %0, %0

demanded-bits from Souper:   for %0 : 01111111111111111111111111111111
demanded-bits from compiler: for %0 : 11111111111111111111111111111111

Another flavor of lost precision comes from comparison instructions:

%0:i8 = var
%1:i1 = ult %0, 8:i8

demanded-bits from Souper:   for %0 : 11111000
demanded-bits from compiler: for %0 : 11111111

This one is fun, only the sign bits matters:

%0:i8 = var
%1:i1 = slt %0, 0:i8

demanded-bits from Souper:   for %0 : 10000000
demanded-bits from compiler: for %0 : 11111111

The above represent some low-hanging fruit. We’re leaving out some UB-related imprecision since that stuff is fiddly to work with, and also a lot of imprecision that requires looking at multiple instructions together. (Unhappily, abstract transfer functions are non-compositional: even when a collection of maximally precise functions is applied to a collection of instructions, the result is often not maximally precise.) Here are two quick examples:

%0:i16 = var
%1:i16 = add 64:i16, %0
%2:i16 = and 448:i16, %1

demanded-bits from Souper:   for %0 : 0000000111000000
demanded-bits from compiler: for %0 : 0000000111111111

And second:

%0:i32 = var
%1:i1 = eq 0:i32, %0
%2:i32 = select %1, 1:i32, %0
%3:i1 = ne 0:i32, %2

demanded-bits from Souper:   for %0 : 00000000000000000000000000000000
demanded-bits from compiler: for %0 : 11111111111111111111111111111111


Here we can see that %2 can never be zero, since %2 takes the value of %0 when %0 is not zero, and takes the value 1 otherwise. This analysis result is particularly satisfying because the special case of no demanded bits means that some instructions become dead. In other words, as far as this little collection of instructions is concerned, there wasn’t any need to compute %0 in the first place, because every one of the 232 values that it could take results in %3 being true.

Which of these imprecisions are worth fixing? That is harder to answer. The potential benefit is obvious: it may allow additional optimizations to be performed. The costs of increasing precision include increased compile time, the overhead of maintaining the new code, and the risk of miscompilation if someone writes an unsound transfer function.

Testing Dataflow Analyses for Precision and Soundness

[This piece is co-authored by Jubi Taneja, Zhengyang Liu, and John Regehr; it’s a summary of some of the findings from a paper that we just recently completed the camera ready copy for, that is going to be published at CGO (Code Generation and Optimization) 2020.]

Update from Jan 12 2020: Looks like there’s a patch in the works fixing some imprecisions we found!

Optimizing compilers use dataflow analyses to compute properties that are true in every execution of the program being compiled. For example, if an array access is provably in-bounds in every execution, the bounds check can be elided. A dataflow analysis should be:

  • Sound: It must never erroneously claim that a property holds. Unsound dataflow analyses typically lead to miscompilations, for example by helping eliminate a bounds check that could fail.
  • Precise: When the property holds, the dataflow analysis should be able to see this. The halting problem tells us that this is not always possible, but we’d like the analysis to mostly work pretty well in practice.
  • Fast: We don’t want dataflow analysis to slow down the overall compilation very much.

Meeting these constraints isn’t easy, and moreover compiler developers generally attempt to do this in a seat-of-the-pants fashion, without any kind of help from formal-methods-based tools. Our paper is based on the premise that formal methods can help! Using an SMT solver, we can create dataflow analyses that are sound and maximally precise: they never erroneously claim that the property holds and they always conclude that the desired property holds when this can be done at all by the analysis in question. (But keep in mind these claims require an ideal SMT solver: solver bugs and solver timeouts respectively defeat them.)

Our work is only about soundness and precision: it does not help make dataflow analyses faster. Moreover, since it involves lots of solver calls, it is slow: we can use it offline to find places where the compiler might want to be improved, but it is not possible to just drop it into the compiler as a replacement for existing dataflow analyses. (We do have some preliminary work on synthesizing dataflow implementations, and will write about that some other time.)

The first dataflow analysis we’ll look at is one of LLVM’s workhorses: known bits. This analysis tries to prove that individual bits of SSA values are either 0 or 1 in all executions. Many cases handled by this analysis are easy to understand. For example, a value that has been zero-extended obviously contains zeroes in its high-order bits. Similarly, a value that has been orred with a bitmask will have some ones in it. Other cases are harder. As an exercise, you might try to work out the known bits in the result of a subtraction or a signed division.

Known bits has a variety of customers in LLVM middle-ends and backends. For example, the “simplify CFG” pass computes the known bits in the argument to a switch instruction in order to optimize away dead switch cases; a switch case is provably dead if at least one bit of its value conflicts with a known bit in the value being switched on.

Ok, it’s time for an example. Consider this LLVM IR which shifts the value four to the right by the number of bit positions specified in its argument value:

define i32 @foo(i32) {
  %a = lshr i32 4, %0
  ret i32 %a
}

As of January 2020, LLVM doesn’t return any known bits for %a. If you would like to see this for yourself, first build a fresh LLVM and then second build this out-of-tree pass which calls LLVM’s dataflow analysis and pretty-prints its results. Run it like so:

$ opt -load ./LLVMDataflowInfo.so -dataflow-info ../test/test1.ll -o /dev/null
function: foo
  %a = lshr i32 4, %0
    known: ????????????????????????????????
    isKnownToBePowerOfTwo = 0
    LVI: [0,5)
    SCEV: %a  full-set  full-set
    demanded: 11111111111111111111111111111111
$

It is easy to see that a better answer is possible: LLVM could have returned 00000000000000000000000000000???. In other words, the result will always have 29 zeroes in the high end. This is the best possible result: a sound implementation of known bits cannot prove anything about the last three bits. This is because, for each of these bits, there exists an execution in which the bit is zero and also an execution in which the bit is one.

Now let’s switch gears and look at a different dataflow analysis from LLVM: integer ranges. This analysis lives in the “Lazy Value Info” analysis pass and it computes one of four results for an SSA value; the one we care about right now is a “regular range” [a, b) where a is smaller (using unsigned comparison) than b. This result means that in every execution of the program, the SSA value will be at least a and also less than b, again using unsigned comparisons.

For example, consider this IR which zero-extends its 8-bit arguments to 64 bits, xors the extended values, and returns the result:

define i64 @foo(i8, i8) {
  %x = zext i8 %0 to i64
  %y = zext i8 %1 to i64
  %z = xor i64 %x, %y
  ret i64 %z
}

It is obvious that [0, 256) is the result we want for %z: no smaller range captures all of the possibilities, and there are no possible executions that require a larger range. LLVM, on the other hand, returns a special “full set” result for %z, meaning that it is unable to compute any useful information about it. This happens because the code for modeling the effect of binary operators on integer ranges does not handle xor at all.

In our paper we deal with some additional dataflow analyses, such as “demanded bits,” which is LLVM’s funny name for a don’t-care analysis that looks for bits whose value does not affect subsequent computations. Demanded bits is interesting because it propagates information backwards along SSA edges. We won’t cover demanded bits or any of the others in this post.

Ok, now let’s get to the good stuff: computing maximally precise dataflow results. We build on Souper, since it already knows how to ask a solver questions about LLVM code, but besides that we don’t make interesting use of Souper’s capabilities. The bad news is that we don’t know how to just plug in a formal definition of a dataflow analysis and get answers directly from that. Rather, we need to spend a bit of time studying the structure of each abstract domain that we target, in order to create a solver-based algorithm for computing maximally precise results reasonably efficiently.

For known bits, we can deal with each bit individually. For a given bit, we’ll first ask the solver whether there exists any valuation of the inputs that can force the bit to be zero. If this returns UNSAT, then we have just proven that this bit is one in every execution. If this query is SAT, we issue a second, analogous, query asking whether this bit can ever be one. If not, it is known to be zero. If neither query returns UNSAT, we know that this bit can be both 0 and 1, and we move on to the next bit. This procedure gives the most precise result using at most 2n solver queries (for a value n bits wide). The gist of the argument for maximal precision is that known bits are independent of each other. This is a nice result because a naive search of all elements of the known bits lattice would require 3n solver calls and this is clearly infeasible.

Looking back at the first example above, we can issue these commands:

$ llvm-as test1.ll
$ souper $SOUPER_SOLVER --infer-known-bits test1.bc 
; Listing valid replacements.
; Using solver: Z3 + internal cache

; Function: foo
%0:i32 = var
%1:i32 = lshr 4:i32, %0
infer %1
; knownBits from souper: 00000000000000000000000000000xxx
$ 

Integer ranges, it turns out, are more difficult. First, due to a “wrapped range” feature that we’ll not discuss here (again, see our paper) this abstract domain doesn’t form a lattice, so the definition of “maximal precision” becomes problematic. We pave over this issue by relaxing the notion of maximal precision a bit. Instead of insisting on the unique maximally precise result that would have to exist if the domain had been a lattice (but does not, since it isn’t), we’ll accept any result as maximally precise if there does not exist an integer range that excludes a larger number of values than is excluded by the given range. (If you find static analysis using non-lattice abstract domains interesting, this very nice paper contains more details.) The second problem with integer ranges is that there’s no obvious bit-level divide and conquer strategy for finding the maximally precise result. We’ll need a different approach.

Ignoring a special “empty set” value which indicates dead code or unconditional undefined behavior, we’re looking for an integer range [x, x+c) such that c is minimized. For example, we can start with c=1, corresponding to the hypothesis that there is a valid dataflow result [x, x+1), which excludes all integer values but one. In other words, this hypothesis corresponds to the SSA value being a constant, x. But how can we find the concrete value of x in a large search space such as 264? We do this using solver-based constant synthesis. If synthesis succeeds, we have found the maximally precise result. If no x exists such that [x, x+1) is a valid dataflow result, we launch a binary search to find the smallest c for which a suitable x does exist. This range will satisfy the maximal precision condition from the previous paragraph. The details of synthesizing an integer value meeting constraints don’t matter here as long as we have a procedure that works.

Going back to the second example:

$ llvm-as test2.ll
$ souper $SOUPER_SOLVER --infer-range test2.bc 

... a few irrelevant lines elided ...

; Function: foo
%0:i8 = var
%1:i64 = zext %0
%2:i8 = var
%3:i64 = zext %2
%4:i64 = xor %1, %3
infer %4
; range from souper: [0,256)
$ 

So in both of these cases we get the obvious maximally precise result. However, these examples are contrived. What code should we look at, when doing this work for real? We looked at all of the bitcode from compiling the C and C++ programs in SPEC CPU 2017, but really any large collection of LLVM IR should be fine. For each SSA value, we use our code and LLVM’s to compute dataflow results and then compare them. The possibilities are:

  • Our result is more precise than LLVM’s: lost precision, flag for human inspection.
  • LLVM’s result is more precise than ours: LLVM is unsound, flag for human inspection.
  • Results have the same precision: ignore.

Across the LLVM IR from compiling SPEC CPU 2017, we find many examples where LLVM’s dataflow analyses leave some precision on the floor. Is it worth making the analyses more precise? This is subjective, and can only be answered by compiler developers. Overall — despite the fact that we find a lot of imprecision — it looks to us like LLVM’s dataflow analyses are quite mature and provide a lot of precision where it matters. This would not have been the case around 2010, but these analyses have gotten a lot of tweaks since then. Even so, there is still plenty of room for improvement.

We thought that while doing this work we might find examples where LLVM’s dataflow result is “more precise” than our maximally precise one. This would indicate soundness errors in LLVM. We did not find any such examples, but we did re-introduce some old soundness bugs in LLVM into the version we used for testing, in order to verify that we can automatically detect these. Details are in the paper.

Since our method requires concrete code fragments as the basis for comparing the precision of LLVM’s analyses and ours, we might ask: is this is an advantage or a disadvantage? On the plus side, we’ll find imprecisions and soundness errors that occur when compiling real programs, and presumably these are the ones that should be a higher priority for fixing. On the other side, we’ll miss cases not exercised by the test programs.

In summary, we have presented several solver-based algorithms for computing maximally precise dataflow results, with the goal of spotting soundness errors and lost precision in a real compiler, LLVM. This is part of a larger research program outlined here. The idea is to move compiler construction away from the current situation, where compilers are huge collections of bespoke code, and towards a situation where the mathematical foundations of compilation can be directly used to support their construction.