Taming Undefined Behavior in LLVM

Earlier I wrote that Undefined Behavior != Unsafe Programming, a piece intended to convince you that there’s nothing inherently wrong with undefined behavior as long as it isn’t in developer-facing parts of the system.

Today I want to talk about a new paper about undefined behavior in LLVM that’s going to be presented in June at PLDI 2017. I’m an author of this paper, but not the main one. This work isn’t about debating the merits of undefined behavior, its goal is to describe and try to fix some unintended consequences of the design of undefined behavior at the level of LLVM IR.

Undefined behavior in C and C++ is sort of like a bomb: either it explodes or it doesn’t. We never try to reason about undefined programs because a program becomes meaningless once it executes UB. LLVM IR contains this same kind of UB, which we’ll call “immediate UB.” It is triggered by bad operations such as an out-of-bounds store (which is likely to corrupt RAM) or a division by zero (which may cause the processor to trap).

Our problems start because LLVM also contains two kinds of “deferred UB” which don’t explode, but rather have a contained effect on the program. We need to reason about the meaning of these “slightly undefined” programs which can be challenging. There have been long threads on the LLVM developers’ mailing list going back and forth about this.

The first kind of deferred UB in LLVM is the undef value that acts like an uninitialized register: an undef evaluates to an arbitrary value of its type. Undef is useful because sometimes we want to say that a value doesn’t matter, for example because we know a location is going to be over-written later. If we didn’t have something like undef, we’d be forced to initialize locations like this to specific values, which costs space and time. So undef is basically a note to the compiler that it can choose whatever value it likes. During code generation, undef usually gets turned into “whatever was already in the register.”

Unfortunately, the semantics of undef don’t justify all of the optimizations that we’d like to perform on LLVM code. For example, consider this LLVM function:

define i1 @f(i32) {
  %2 = add nsw i32 %0, 1
  %3 = icmp sgt i32 %2, %0
  ret i1 %3
}

This is equivalent to “return x+1 > x;” in C and we’d like to be able to optimize it to “return true;”. In both languages the undefinedness of signed overflow needs to be recognized to make the optimization go. Let’s try to do that using undef. In this case the semantics of “add nsw” are to return undef if signed overflow occurs and to return the mathematical answer otherwise. So this example has two cases:

  1. The input is not INT_MAX, in which case the addition returns input + 1.
  2. The input is INT_MAX, in which case the addition returns undef.

In case 1 the comparison returns true. Can we make the comparison true for case 2, giving us the overall result that we want? Recall that undef resolves as an arbitrary value of its type. The compiler is allowed to choose this value. Alas, there’s no value of type i32 that is larger than INT_MAX, when we use a signed comparison. Thus, this optimization is not justified by the semantics of undef.

One choice we could make is to give up on performing this optimization (and others like it) at the LLVM level. The choice made by the LLVM developers, however, was to introduce a second, stronger, form of deferred UB called poison. Most instructions, taking a poison value on either input, evaluate to poison. If poison propagates to a program’s output, the result is immediate UB. Returning to the “x + 1 > x” example above, making “add nsw INT_MAX, 1” evaluate to poison allows the desired optimization: the resulting poison value makes the icmp also return poison. To justify the desired optimization we can observe that returning 1 is a refinement of returning poison. Another way to say the same thing is that we’re always allowed to make code more defined than it was, though of course we’re never allowed to make it less defined.

The most important optimizations enabled by deferred undefined behavior are those involving speculative execution such as hoisting loop-invariant code out of a loop. Since it is often difficult to prove that a loop executes at least once, loop-invariant code motion threatens to take a defined program where UB sits inside a loop that executes zero times and turn into into an undefined program. Deferred UB lets us go ahead and speculatively execute the code without triggering immediate UB. There’s no problem as long as the poisonous results don’t propagate somewhere that matters.

So far so good! Just to be clear: we can make the semantics of an IR anything we like. There will be no problem as long as:

  1. The front-ends correctly refine C, C++, etc. into IR.
  2. Every IR-level optimization implements a refinement.
  3. The backends correctly refine IR into machine code.

The problem is that #2 is hard. Over the years some very subtle mistakes have crept into the LLVM optimizer where different developers have made different assumptions about deferred UB, and these assumptions can work together to introduce bugs. Very few of these bugs can result in end-to-end miscompilation (where a well-formed source-level program is compiled to machine code that does the wrong thing) but even this can happen. We spent a lot of time trying to explain this clearly in the paper and I’m unlikely to do better here! But the details are all there in Section 3 of the paper. The point is that so far these bugs have resisted fixing: nobody has come up with a way to make everything consistent without giving up optimizations that the LLVM community is unwilling to give up.

The next part of the paper (Sections 4, 5, 6) introduces and evaluates our proposed fix, which is to remove undef, leaving only poison. To get undef-like semantics we introduce a new freeze instruction to LLVM. Freezing a normal value is a nop and freezing a poison value evaluates to an arbitrary value of the type. Every use of a given freeze instruction will produce the same value, but different freezes may give different values. The key is to put freezes in the right places. My colleagues have implemented a fork of LLVM 4.0 that uses freeze; we found that it more or less doesn’t affect compile times or the quality of the generated code.

We are in the process of trying to convince the LLVM community to adopt our proposed solution. The change is somewhat fundamental and so this is going to take some time. There are lots of details that need to be ironed out, and I think people are (rightfully) worried about subtle bugs being introduced during the transition. One secret weapon we have is Alive where Nuno has implemented the new semantics in the newsema branch and we can use this to test a large number of optimizations.

Finally, we noticed that there has been an interesting bit of convergent evolution in compiler IRs: basically all heavily optimizing AOT compilers (including GCC, MSVC, and Intel CC) have their own versions of deferred UB. The details differ from those described here, but the effect is the same: deferred UB gives the compiler freedom to perform useful transformations that would otherwise be illegal. The semantics of deferred UB in these compilers has not, as far as we know, been rigorously defined and so it is possible that they have issues analogous to those described here.

Do Expressive Programming Languages Always Have Undefined Behavior?

In the Hacker News comments on one of my previous posts about undefined behavior, someone said this:

AFAIK Gödel’s incompleteness theorems imply that _any_ language will have at least some undefined behaviour.

Let’s take a quick look at this statement, keeping in mind that incompleteness and undecidability can be remarkably tricky topics. Some years ago I read and enjoyed Gödel’s Theorem: An Incomplete Guide to Its Use and Abuse (and no doubt need to reread it).

First off, it is clear that there exist programming languages that are free of UB, such as one where the semantics of every program is to print “7”. Whatever it is that UB means (we have not formally defined it), it seems clear that the language that always prints “7” does not have it.

There are also useful languages that are obviously UB-free, such as an expression language that evaluates elementary functions over IEEE floats. These languages are particularly easy to reason about because they are not Turing-complete: all computations terminate and we simply have to ensure that they terminate with a defined result.

In contrast, the HN commenter may have intended to invoke Rice’s Theorem: “Any nontrivial property about the language recognized by a Turing machine is undecidable.” A consequence is that when f() is some arbitrary computation we cannot in general hope to decide whether a program like this invokes undefined behavior:

main() {
  f();
  return 1 / 0; // UB!
}

But this is a red herring. Rice’s Theorem only applies to non-trivial properties: “properties that apply neither to no programs nor all programs in the language.” To sidestep it, we only need to define a programming language where UB-freedom is a trivial property. This is done by ensuring that every operation that a program can perform is a total function: it is defined in all circumstances. Thus, programs in this language will either terminate in some defined state or else fail to terminate. This kind of extremely tight specification is not typically done for realistic programming languages because it is a lot of work, particularly if the language has open-ended interactions with other levels of the system, such as inline assembly. But the problem is only a practical one; there is no problem in principle. It is not too difficult to write an UB-free specification for a (Turing-complete) toy language or subset of a real language.

Now let’s return to the original HN comment: it was about the incompleteness theorems, not about the halting problem. I’m not sure what to do with that, as I don’t see that Gödel’s theorems have any direct bearing on undefined behavior in programming languages.

Undefined Behavior != Unsafe Programming

Undefined behavior (UB) in C and C++ is a clear and present danger to developers, especially when they are writing code that will execute near a trust boundary. A less well-known kind of undefined behavior exists in the intermediate representation (IR) for most optimizing, ahead-of-time compilers. For example, LLVM IR has undef and poison in addition to true explodes-in-your-face C-style UB. When people become aware of this, a typical reaction is: “Ugh, why? LLVM IR is just as bad as C!” This piece explains why that is not the correct reaction.

Undefined behavior is the result of a design decision: the refusal to systematically trap program errors at one particular level of a system. The responsibility for avoiding these errors is delegated to a higher level of abstraction. For example, it is obvious that a safe programming language can be compiled to machine code, and it is also obvious that the unsafety of machine code in no way compromises the high-level guarantees made by the language implementation. Swift and Rust are compiled to LLVM IR; some of their safety guarantees are enforced by dynamic checks in the emitted code, other guarantees are made through type checking and have no representation at the LLVM level. Either way, UB at the LLVM level is not a problem for, and cannot be detected by, code in the safe subsets of Swift and Rust. Even C can be used safely if some tool in the development environment ensures that it will not execute UB. The L4.verified project does exactly this.

The essence of undefined behavior is the freedom to avoid a forced coupling between error checks and unsafe operations. The checks, once decoupled, can be optimized, for example by being hoisted out of loops or eliminated outright. The remaining unsafe operations can be — in a well-designed IR — mapped onto basic processor operations with little or no overhead. As a concrete example, consider this Swift code:

func add(a : Int, b : Int)->Int {
  return (a & 0xffff) + (b & 0xffff);
}

Although a Swift implementation must trap on integer overflow, the compiler observes that overflow is impossible and emits this LLVM IR:

define i64 @add(i64 %a, i64 %b) {
  %0 = and i64 %a, 65535
  %1 = and i64 %b, 65535
  %2 = add nuw nsw i64 %0, %1
  ret i64 %2
}

Not only has the checked addition operation been lowered to an unchecked one, but also the add instruction has been marked with LLVM’s nsw and nuw attributes, indicating that both signed and unsigned overflow are undefined. In isolation these attributes provide no benefit, but they may enable additional optimizations after this function is inlined. When the Swift benchmark suite is compiled to LLVM, about one in eight addition instructions has an attribute indicating that integer overflow is undefined.

In this particular example the nsw and nuw attributes are redundant since an optimization pass could re-derive the fact that the add cannot overflow. However, in general these attributes and others like them add real value by avoiding the need for potentially expensive static analyses to rediscover known program facts. Also, some facts cannot be rediscovered later, even in principle, since information is lost at some compilation steps.

In summary, undefined behavior in programmer-visible abstractions represents an aggressive and dangerous tradeoff: it sacrifices program correctness in favor of performance and compiler simplicity. On the other hand, UB at lower levels of the system, such as machine code or a compiler IR, is an internal design choice that needn’t have any effect on the programmer-facing parts of the system. This kind of UB simply requires us to accept that safety checks can be usefully factored out of their corresponding unsafe operations to afford efficient execution.

Detecting Strict Aliasing Violations in the Wild

Type-based alias analysis, where pointers to different types are assumed to point to distinct objects, gives compilers a simple and effective way to disambiguate memory references in order to generate better code. Unfortunately, C and C++ make it easy for programmers to violate the assumptions upon which type-based alias analysis is built. “Strict aliasing” refers to a collection of rules in the C and C++ standards that restrict the ways in which you are allowed to modify and look at memory objects, in order to make type-based alias analysis work in these weakly-typed languages. The problem is that the strict aliasing rules contain tricky and confusing corner cases and also that they rule out many idioms that have historically worked, such as using a pointer type cast to view a float as an unsigned, in order to inspect its bits. Such tricks are undefined behavior. See the first part of this post for more of an introduction to these issues.

The purpose of this piece is to call your attention to a new paper, Detecting Strict Aliasing Violations in the Wild, by Pascal Cuoq and his colleagues. C and C++ programmers should read it. Sections 1 and 2 introduce strict aliasing, they’re quick and easy.

Section 3 shows what compilers think about strict aliasing problems by looking at how a number of C functions get translated to x86-64 assembly. This material requires perseverance but it is worth taking the time to understand the examples in detail, because compilers apply the same thinking to real programs that they apply to tiny litmus tests.

Section 4 is about a new tool, built as part of Trust-in-Soft’s static analyzer, that can diagnose violations of the strict aliasing rules in C code. As the paper says, “it works best when applied on definite inputs,” meaning that the tool should be used as an extended checker for tis-interpreter. Pascal tells me that a release containing the strict aliasing checker is planned, but the time frame is not definite. In any case, readers interested in strict aliasing, but not specifically in tools for dealing with it, can skip this section.

Section 5 applies the strict aliasing checker to open source software. This is good reading because it describes problems that are very common in the wild today. Finding a bug in zlib was a nice touch: zlib is small and has already been looked at closely. Some programs mitigate these bugs by asking the compiler to avoid enforcing the strict aliasing rules: LLVM, GCC, and Intel CC all take a -fno-strict-aliasing flag and MSVC doesn’t implement type-based alias analysis at all. Many other programs contain time bombs: latent UB bugs that don’t happen to be exploited now, that might be exploited later when the compiler becomes a bit brighter.

Also see libcrunch and its paper and a paper about SafeType.

Testing LLVM

[This piece is loosely a followup to this one.]

Background

Once a piece of software reaches a certain size, it is guaranteed to be loosely specified and not completely understood by any individual. It gets committed to many times per day by people who are only loosely aware of each others’ work. It has many dependencies including the compiler, operating system, and libraries, all of which are buggy in their own special ways, and all of which are updated from time to time. Moreover, it usually has to run atop several different platforms, each one individually quirky. Given the massive number of possibilities for flaky behavior, why should we expect our large piece of software to work as expected? One of the most important reasons is testing. That is, we routinely ensure that it works as intended in every important configuration and on every important platform, and when it doesn’t work we have smart people tracking down and fixing the issues.

Today we’re talking about testing LLVM. In some ways, a compiler makes a very friendly target for testing:

  • The input format (source code) and output format (assembly code) are well-understood and have independent specifications.
  • Many compilers have an intermediate representation (IR) that has its own documented semantics and can be dumped and parsed, making it easier (though not always easy) to test internals.
  • It is often the case that a compiler is one of several independent implementations of a given specification, such as the C++ standard, enabling differential testing. Even when multiple implementations are unavailable, we can often test a compiler against itself by comparing the output of different backends or different optimization modes.
  • Compilers are usually not networked, concurrent, or timing-dependent, and overall interact with the outside world only in very constrained ways. Moreover, compilers are generally intended to be deterministic.
  • Compilers usually don’t run for very long, so they don’t have to worry too much about resource leaks or recovering gracefully from error conditions.

But in other ways, compilers are not so easy to test:

  • Production compilers are supposed to be fast, so they are often written in an unsafe language and may skimp on assertions. They use caching and lazy evaluation when possible, adding complexity. Furthermore, splitting compiler functionality into lots of clean, independent little passes leads to slow compilers, so there tends to be some glomming together of unrelated or not-too-closely-related functionality, making it more difficult to understand, test, and maintain the resulting code.
  • The invariants on compiler-internal data structures can be hellish and are often not documented completely.
  • Some compiler algorithms are difficult, and it is almost never the case that a compiler implements a textbook algorithm exactly, but rather a close or distant relative of it.
  • Compiler optimizations interact in difficult ways.
  • Compilers for unsafe languages do not have lots of obligations when compiling undefined behaviors, placing the responsibility for avoiding UB outside of the compiler (and on the person creating test cases for the compiler). This complicates differential testing.
  • The standards for compiler correctness are high since miscompilations are tough to debug and also they can quietly introduce security vulnerabilities in any code that they compile.

So, with that background out of the way, how is LLVM tested?

Unit Tests and Regression Tests

LLVM’s first line of defense against bugs is a collection of tests that get run when a developer builds the check target. All of these tests should pass before a developer commits a patch to LLVM (and of course many patches should include some new tests). I have a fairly fast desktop machine that runs 19,267 tests in 96 seconds. The number of tests that run depends on what auxiliary LLVM projects you have downloaded (compiler-rt, libcxx, etc.) and, to a lesser extent, on what other software gets autodetected on your machine (e.g. the OCaml bindings don’t get tested unless you have OCaml installed). These tests need to be fast so developers can run them often, as mentioned here. Additional tests get run by some alternate build targets such as check-all and check-clang.

Some of the unit/regression tests are at the API level, these use Google Test, a lightweight framework that provides C++ macros for hooking into the test framework. Here’s a test:

TEST_F(MatchSelectPatternTest, FMinConstantZero) {
  parseAssembly(
      "define float @test(float %a) {\n"
      "  %1 = fcmp ole float %a, 0.0\n"
      "  %A = select i1 %1, float %a, float 0.0\n"
      "  ret float %A\n"
      "}\n");
  // This shouldn't be matched, as %a could be -0.0.
  expectPattern({SPF_UNKNOWN, SPNB_NA, false});
}

The first argument to the TEST_F macro indicates the name of the test case (a collection of tests) and the second names the actual test shown here. The parseAssembly() and expectPattern() methods respectively call into an LLVM API and then check that this had the expected result. This example is from ValueTrackingTest.cpp. Many tests can be put into a single file, keeping things fast by avoiding forks/execs.

The other infrastructure used by LLVM’s fast test suite is lit, the LLVM Integrated Tester. lit is shell-based: it executes commands found in a test case, and considers the test to have been successful if all of its sub-commands succeed.

Here’s a test case for lit (I grabbed the top of this file, which contains additional tests that don’t matter to us right now):

; RUN: opt < %s -instcombine -S | FileCheck %s

define i64 @test1(i64 %A, i32 %B) {
        %tmp12 = zext i32 %B to i64
        %tmp3 = shl i64 %tmp12, 32
        %tmp5 = add i64 %tmp3, %A
        %tmp6 = and i64 %tmp5, 123
        ret i64 %tmp6
; CHECK-LABEL: @test1(
; CHECK-NEXT: and i64 %A, 123
; CHECK-NEXT: ret i64
}

This test case is making sure that InstCombine, the LLVM-level peephole optimization pass, is able to notice some useless instructions: the zext, shl, and add are not needed here. The CHECK-LABEL line looks for the line of optimized code that begins the function, the first CHECK-NEXT makes sure that the and instruction is on the next line, and the second CHECK-NEXT makes sure the ret instruction is on the line following the and (thanks Michael Kuperstein for correcting an earlier explanation of this test).

To run this test case, the file is interpreted three times. First, lit scans it looking for lines containing RUN: and executes each associated command. Second, the file is interpreted by opt, the standalone optimizer for LLVM IR; this happens because lit replaces the %s variable with the name of the file being processed. Since comments in textual LLVM IR are preceded by a semicolon, the lit directives are ignored by opt. The output of opt is piped to the FileCheck utility which parses the file yet again, looking for commands such as CHECK and CHECK-NEXT; these tell it to look for strings in its stdin, and to return a non-zero status code if any of the specified strings isn't found. (CHECK-LABEL is used to divide up a file into a collection of logically separate tests.)

An important part of a long-term testing campaign is using coverage tools to find parts of the code base that aren't being tested. Here's a recent LLVM coverage report based on running the unit/regression tests. This data is pretty interesting to poke around in. Let's take a quick look at coverage of InstCombine, which is generally very good. An interesting project for someone wanting to get started with LLVM would be to write and submit test cases that cover untested parts of InstCombine. For example, here's the first uncovered code (colored red) in InstCombineAndOrXor.cpp:

The comment tells us what the transformation is looking for, it should be fairly easy to target this code with a test case. Code that can't be covered is dead; some dead code wants to be removed, other code such as this example (from the same file) is a bug if it isn't dead:

Trying to cover these lines is a good idea, but in that case you're trying to find bugs in LLVM, as opposed to trying to improve the test suite. It would probably be good to teach the coverage tool to not tell us about lines that are marked unreachable.

The LLVM Test Suite

In contrast with the regression/unit tests, which are part of the main LLVM repository and can be run quickly, the test suite is external and takes longer to run. It is not expected that developers will run these tests prior to committing; rather, these tests get run automatically and often, on the side, by LNT (see the next section). The LLVM test suite contains entire programs that are compiled and run; it isn't intended to look for specific optimizations, but rather to help ascertain the quality and correctness of the generated code overall.

For each benchmark, the test suite contains test inputs and their corresponding expected outputs. Some parts of the test suite are external, meaning that there is support for invoking the tests, but the tests themselves are not part of the test suite and must be downloaded separately, typically because the software being compiled is not free.

LNT

LNT (LLVM Nightly Test) doesn't contain any test cases; it is a tool for aggregating and analyzing test results, focusing on monitoring the quality of the compiler's generated code. It consists of local utilities for running tests and submitting results, and then there's a server side database and web frontend that makes it easy to look through results. The NTS (Nightly Test Suite) results are here.

BuildBot

The Linux/Windows BuiltBot and the Darwin one (I don't know why there are two) are used to make sure LLVM configures, builds, and passes its unit/regression tests on a wide variety of platforms and in a variety of configurations. The BuildBot has some blame support to help find problematic commits and will send mail to their authors.

Eclectic Testing Efforts

Some testing efforts originate outside of the core LLVM community and aren't as systematic in terms of which versions of LLVM get tested. These tests represent efforts by individuals who usually have some specific tool or technique to try out. For example, for a long time my group tested Clang+LLVM using Csmith and reported the resulting bugs. (See the high-level writeup.) Sam Liedes applied afl-fuzz to the Clang test suite. Zhendong Su and his group have been finding a very impressive number of bugs. Nuno Lopes has done some awesome formal-methods-based testing of optimization passes that he'll hopefully write about soon.

A testing effort that needs to be done is repeatedly generating a random (but valid) IR function, running a few randomly-chosen optimization passes on it, and then making sure the optimized function refines the original one (the desired relationship is refinement, rather than equivalence, because optimizations are free to make the domain of definedness of a function larger). This needs to be done in a way that is sensitive to LLVM-level undefined behavior. I've heard that something like this is being worked on, but don't have details.

Testing in the Wild

The final level of testing is, of course, carried out by LLVM's users, who occasionally run into crashes and miscompiles that have escaped other testing methods. I've often wanted to better understand the incidence of compiler bugs in the wild. For crashes this could be done by putting a bit of telemetry into the compiler, though few would use this if opt-in, and many would (legitimately) object if opt-out. Miscompiles in the wild are very hard to quantify. My hypothesis is that most miscompiles go unreported since reducing their triggers is so difficult. Rather, as people make pseudorandom code changes during debugging, they eventually work around the problem by luck and then promptly forget about it.

A big innovation would be to ship LLVM with a translation validation scheme that would optionally use an SMT solver to prove that the compiler's output refines its input. There are all sorts of challenges including undefined behavior and the fact that it's probably very difficult to scale translation validation up to the large functions that seem to be the ones that trigger miscompilations in practice.

Alternate Test Oracles

A "test oracle" is a way to decide whether a test passes or fails. Easy oracles include "compiler terminates with exit code 0" and "compiled benchmark produces the expected output." But these miss lots of interesting bugs, such as a use-after-free that doesn't happen to trigger a crash or an integer overflow (see page 7 of this paper for an example from GCC). Bug detectors like ASan, UBSan, and Valgrind can instrument a program with oracles derived from the C and C++ language standards, providing lots of useful bug-finding power. To run LLVM under Valgrind when executing it on its test suite, pass -DLLVM_LIT_ARGS="-v --vg" to CMake, but be warned that Valgrind will give some false positives that seem to be difficult to eliminate. To instrument LLVM using UBSan, pass -DLLVM_USE_SANITIZER=Undefined to CMake. This is all great but there's more work left to do since UBSan/ASan/MSan don't yet catch all undefined behaviors and also there are defined-but-buggy behaviors, such as the unsigned integer overflow in GCC mentioned above, that we'd like to flag when they are unintentional.

What Happens When a Test Fails?

A broken commit can cause test failure at any level. The offending commit is then either amended (if easy to fix) or backed out (if it turns out to be deeply flawed or otherwise undesirable in light of the new information supplied by failing tests). These things happen reasonably often, as they do in any project that is rapidly pushing changes into a big, complicated code base with many real-world users.

When a test fails in a way that is hard to fix right now, but that will get fixed eventually (for example when some new feature gets finished), the test can be marked XFAIL, or "expected failure." These are counted and reported separately by the testing tool and they do not count towards the test failures that must be fixed before a patch becomes acceptable.

Conclusions

Testing a large, portable, widely-used software system is hard; there are a lot of moving parts and a lot of ongoing work is needed if we want to prevent LLVM's users from being exposed to bugs. Of course there are other super-important things that have to happen to maintain high-quality code: good design, code reviews, tight semantics on the internal representation, static analysis, and periodic reworking of problematic areas.

A Tourist’s Guide to the LLVM Source Code

In my Advanced Compilers course last fall we spent some time poking around in the LLVM source tree. A million lines of C++ is pretty daunting but I found this to be an interesting exercise and at least some of the students agreed, so I thought I’d try to write up something similar. We’ll be using LLVM 3.9, but the layout isn’t that different for previous (and probably subsequent) releases.

I don’t want to spend too much time on LLVM background but here are a few things to keep in mind:

  • The LLVM core doesn’t contain frontends, only the “middle end” optimizers, a pile of backends, documentation, and a lot of auxiliary code. Frontends such as Clang live in separate projects.
  • The core LLVM representation lives in RAM and is manipulated using a large C++ API. This representation can be dumped to readable text and parsed back into memory, but this is only a convenience for debugging: during a normal compilation using LLVM, textual IR is never generated. Typically, a frontend builds IR by calling into the LLVM APIs, then it runs some optimization passes, and finally it invokes a backend to generate assembly or machine code. When LLVM code is stored on disk (which doesn’t even happen during a normal compilation of a C or C++ project using Clang) it is stored as “bitcode,” a compact binary representation.
  • The main LLVM API documentation is generated by doxygen and can be found here. This information is very difficult to make use of unless you already have an idea of what you’re doing and what you’re looking for. The tutorials (linked below) are the place to start learning the LLVM APIs.

So now on to the code. Here’s the root directory, it contains:

  • bindings that permit LLVM APIs to be used from programming languages other than C++. There exist more bindings than this, including C (which we’ll get to a bit later) and Haskell (out of tree).
  • cmake: LLVM uses CMake rather than autoconf now. Just be glad someone besides you works on this.
  • docs in ReStructuredText. See for example the Language Reference Manual that defines the meaning of each LLVM instruction (GitHub renders .rst files to HTML by default; you can look at the raw file here.) The material in the tutorial subdirectory is particularly interesting, but don’t look at it there, rather go here. This is the best way to learn LLVM!
  • examples: This is the source code that goes along with the tutorials. As an LLVM hacker you should grab code, CMakeLists.txt, etc. from here whenever possible.
  • include: The first subdirectory, llvm-c, contains the C bindings, which I haven’t used but look pretty reasonable. Importantly, the LLVM folks try to keep these bindings stable, whereas the C++ APIs are prone to change across releases, though the pace of change seems to have slowed down in the last few years. The second subdirectory, llvm, is a biggie: it contains 878 header files that define all of the LLVM APIs. In general it’s easier to use the doxygen versions of these files rather than reading them directly, but I often end up grepping these files to find some piece of functionality.
  • lib contains the real goodies, we’ll look at it separately below.
  • projects doesn’t contain anything by default but it’s where you checkout LLVM components such as compiler-rt (runtime library for things like sanitizers), OpenMP support, and the LLVM C++ library that live in separate repos.
  • resources: something for Visual C++ that you and I don’t care about (but see here).
  • runtimes: another placeholder for external projects, added only last summer, I don’t know what actually goes here.
  • test: this is a biggie, it contains many thousands of unit tests for LLVM, they get run when you build the check target. Most of these are .ll files containing the textual version of LLVM IR. They test things like an optimization pass having the expected result. I’ll be covering LLVM’s tests in detail in an upcoming blog post.
  • tools: LLVM itself is just a collection of libraries, there isn’t any particular main function. Most of the subdirectories of the tools directory contain an executable tool that links against the LLVM libraries. For example, llvm-dis is a disassembler from bitcode to the textual assembly format.
  • unittests: More unit tests, also run by the check build target. These are C++ files that use the Google Test framework to invoke APIs directly, as opposed to the contents of the “test” directory, which indirectly invoke LLVM functionality by running things like the assembler, disassembler, or optimizer.
  • utils: emacs and vim modes for enforcing LLVM coding conventions; a Valgrind suppression file to eliminate false positives when running make check in such a way that all sub-processes are monitored by Valgrind; the lit and FileCheck tools that support unit testing; and, plenty of other random stuff. You probably don’t care about most of this.

Ok, that was pretty easy! The only thing we skipped over is the “lib” directory, which contains basically everything important. Let’s look its subdirectories now:

  • Analysis contains a lot of static analyses that you would read about in a compiler textbook, such as alias analysis and global value numbering. Some analyses are structured as LLVM passes that must be run by the pass manager; others are structured as libraries that can be called directly. An odd member of the analysis family is InstructionSimplify.cpp, which is a transformation, not an analysis; I’m sure someone can leave a comment explaining what it is doing here (see this comment). I’ll do a deep dive into this directory in a followup post.
  • AsmParser: parse textual IR into memory
  • Bitcode: serialize IR into the compact format and read it back into RAM
  • CodeGen: the LLVM target-independent code generator, basically a framework that LLVM backends fit into and also a bunch of library functions that backends can use. There’s a lot going on here (>100 KLOC) and unfortunately I don’t know very much about it.
  • DebugInfo is a library for maintaining mappings between LLVM instructions and source code locations. There’s a lot of good info in these slides from a talk at the 2014 LLVM Developers’ Meeting.
  • ExecutionEngine: Although LLVM is usually translated into assembly code or machine code, it can be directly executed using an interpreter. The non-jitting interpreter wasn’t quite working the last time I tried to use it, but anyhow it’s a lot slower than running jitted code. The latest JIT API, Orc, is in here.
  • Fuzzer: this is libFuzzer, a coverage-guided fuzzer similar to AFL. It doesn’t fuzz LLVM components, but rather uses LLVM functionality in order to perform fuzzing of programs that are compiled using LLVM.
  • IR: sort of a grab-bag of IR-related code, with no other obvious unifying theme. There’s code for dumping IR to the textual format, for upgrading bitcode files created by earlier versions of LLVM, for folding constants as IR nodes are created, etc.
  • IRReader, LibDriver, LineEditor: almost nobody will care about these and they contain hardly any code anyway.
  • Linker: An LLVM module, like a compilation unit in C or C++, contains functions and variables. The LLVM linker combines multiple modules into a single, larger module.
  • LTO: Link-time optimization, the subject of many blog posts and PhD theses, permits compiler optimizations to see through boundaries created by separate compilation. LLVM can do link-time optimization “for free” by using its linker to create a large module and then optimize this using the regular optimization passes. This used to be the preferred approach, but it doesn’t scale to huge projects. The current approach is ThinLTO, which gets most of the benefit at a small fraction of the cost.
  • MC: compilers usually emit assembly code and let an assembler deal with creating machine code. The MC subsystem in LLVM cuts out the middleman and generates machine code directly. This speeds up compiles and is especially useful when LLVM is used as a JIT compiler.
  • Object: Deals with details of object file formats such as ELF.
  • ObjectYAML seems to support encoding object files as YAML. I do not know why this is desirable.
  • Option: Command line parsing
  • Passes: part of the pass manager, which schedules and sequences LLVM passes, taking their dependencies and invalidations into account.
  • ProfileData: Read and write profile data to support profile-guided optimizations
  • Support: Miscellaneous support code including APInts (arbitrary-precision integers that are used pervasively in LLVM) and much else.
  • TableGen: A wacky Swiss-army knife of a tool that inputs .td files (of which there are more than 200 in LLVM) containing structured data and uses a domain-specific backend to emit C++ code that gets compiled into LLVM. TableGen is used, for example, to take some of the tedium out of implementing assemblers and disassemblers.
  • Target: the processor-specific parts of the backends live here. There are lots of TableGen files. As far as I can tell, you create a new LLVM backend by cloning the one for the architecture that looks the most like yours and then beating on it for a couple of years.
  • Transforms: this is my favorite directory, it’s where the middle-end optimizations live. IPO contains interprocedural optimizations that work across function boundaries, they are typically not too aggressive since they have to look at a lot of code. InstCombine is LLVM’s beast of a peephole optimizer. Instrumentation supports sanitizers. ObjCARC supports this. Scalar contains a pile of compiler-textbooky kinds of optimizers, I’ll try to write a more detailed post about the contents of this directory at some point. Utils are helper code. Vectorize is LLVM’s auto-vectorizer, the subject of much work in recent years.

And that’s all for the high-level tour, hope it was useful and as always let me know what I’ve got wrong or left out.

Principles for Undefined Behavior in Programming Language Design

I’ve had a post with this title on the back burner for years but I was never quite convinced that it would say anything I haven’t said before. Last night I watched Chandler Carruth’s talk about undefined behavior at CppCon 2016 and it is good material and he says it better than I think I would have, and I wanted to chat about it a bit.

First off, this is a compiler implementor’s point of view. Other smart people, such as Dan Bernstein, have a very different point of view (but also keep in mind that Dan doesn’t believe compiler optimization is useful).

Chandler is not a fan of the term nasal demons, which he says is misleadingly hyperbolic, since the compiler isn’t going to maliciously turn undefined behavior (UB) into code for erasing your files or whatever. This is true, but Chandler leaves out the fact that our 28-year-long computer security train wreck (the Morris Worm seems like as good a starting point as any) has been fueled to a large extent by undefined behavior in C and (later) C++ code. In other words, while the compiler won’t emit system calls for erasing your files, a memory-related UB in your program will permit a random person on the Internet to insert instructions into your process that issue system calls doing precisely that. From this slightly broader point of view, nasal demons are less of a caricature.

The first main idea in Chandler’s talk is that we should view UB at the PL level as being analogous to narrow contracts on APIs. Let’s look at this in more detail. An API with a wide contract is one where you can issue calls in any order, and you can pass any arguments to API calls, and expect predictable behavior. One simple way that an API can have a wider contract is by quietly initializing library state upon the first call into the library, as opposed to requiring an explicit call to an init() function. Some libraries do this, but many libraries don’t. For example, an OpenSSL man page says “SSL_library_init() must be called before any other action takes place.” This kind of wording indicates that a severe obligation is being placed on users of the OpenSSL API, and failing to respect it would generally be expected to result in unpredictable behavior. Chandler’s goal in this first part of the talk is to establish the analogy between UB and narrow API contracts and convince us that not all APIs want to be maximally wide. In other words, narrow APIs may be acceptable when their risks are offset by, for example, performance advantages.

Coming back to programming languages (PL), we can look at something like the signed left shift operator as exposing an API. The signed left shift API in C and C++ is particularly narrow and while many people have by now internalized that it can trigger UB based on the shift exponent (e.g., 1 << -1 is undefined), fewer developers have come to terms with restrictions on the left hand argument (e.g., 0 << 31 is defined but 1 << 31 is not). Can we design a wide API for signed left shift? Of course! We might specify, for example, that the result is zero when the shift exponent is too large or is negative, and that otherwise the result is the same as if the signed left-hand argument was interpreted as unsigned, shifted in the obvious way, and then reinterpreted as signed.

At this point in the talk, we should understand that “UB is bad” is an oversimplification, that there is a large design space relating to narrow vs. wide APIs for libraries and programming language features, and that finding the best point in this design space is not straightforward since it depends on performance requirements, on the target platform, on developers’ expectations, and more. C and C++, as low-level, performance-oriented languages, are famously narrow in their choice of contracts for core language features such as pointer and integer operations. The particular choices made by these languages have caused enormous problems and reevaluation is necessary and ongoing. The next part of Chandler’s talk provides a framework for deciding whether a particular narrow contract is a good idea or not.

Chandler provides these four principles for narrow language contracts:

  1. Checkable (probabilistically) at runtime
  2. Provide significant value: bug finding, simplification, and/or optimization
  3. Easily explained and taught to programmers
  4. Not widely violated by existing code that works correctly and as intended

The first criterion, runtime checkability, is crucial and unarguable: without it, we get latent errors of the kind that continue to contribute to insecurity and that have been subject to creeping exploitation by compiler optimizations. Checking tools such as ASan, UBSan, and tis-interpreter reduce the problem of finding these errors to the problem of software testing, which is very difficult, but which we need to deal with anyhow since there’s more to programming than eliminating undefined behaviors. Of course, any property that can be checked at runtime can also be checked without running the code. Sound static analysis avoids the need for test inputs but is otherwise much more difficult to usefully implement than runtime checking.

Principle 2 tends to cause energetic discussions, with (typically) compiler developers strongly arguing that UB is crucial for high-quality code generation and compiler users equally strongly arguing for defined semantics. I find the bug-finding arguments to be the most interesting ones: do we prefer Java-style two’s complement integers or would we rather retain maximum performance as in C and C++ or mandatory traps as in Swift or a hybrid model as in Rust? Discussions of this principle tend to center around examples, which is mostly good, but is bad in that any particular example excludes a lot of other use cases and other compilers and other targets that are also important.

Principle 3 is an important one that tends to get neglected in discussions of UB. The intersection of HCI and PL is not incredibly crowded with results, as far as I know, though many of us have some informal experience with this topic because we teach people to program. Chandler’s talk contains a section on explaining signed left shift that’s quite nice.

Finally, Principle 4 seems pretty obvious.

One small problem you might have noticed is that there are undefined behaviors that fail one or more of Chandler’s criteria, that many C and C++ compiler developers will defend to their dying breath. I’m talking about things like strict aliasing and termination of infinite loops that violate (at least) principles 1 and 3.

In summary, the list of principles proposed by Chandler is excellent and, looking forward, it would be great to use it as a standard set of questions to ask about any narrow contract, preferably before deploying it. Even if we disagree about the details, framing the discussion is super helpful.