Testing LLVM

[This piece is loosely a followup to this one.]

Background

Once a piece of software reaches a certain size, it is guaranteed to be loosely specified and not completely understood by any individual. It gets committed to many times per day by people who are only loosely aware of each others’ work. It has many dependencies including the compiler, operating system, and libraries, all of which are buggy in their own special ways, and all of which are updated from time to time. Moreover, it usually has to run atop several different platforms, each one individually quirky. Given the massive number of possibilities for flaky behavior, why should we expect our large piece of software to work as expected? One of the most important reasons is testing. That is, we routinely ensure that it works as intended in every important configuration and on every important platform, and when it doesn’t work we have smart people tracking down and fixing the issues.

Today we’re talking about testing LLVM. In some ways, a compiler makes a very friendly target for testing:

  • The input format (source code) and output format (assembly code) are well-understood and have independent specifications.
  • Many compilers have an intermediate representation (IR) that has its own documented semantics and can be dumped and parsed, making it easier (though not always easy) to test internals.
  • It is often the case that a compiler is one of several independent implementations of a given specification, such as the C++ standard, enabling differential testing. Even when multiple implementations are unavailable, we can often test a compiler against itself by comparing the output of different backends or different optimization modes.
  • Compilers are usually not networked, concurrent, or timing-dependent, and overall interact with the outside world only in very constrained ways. Moreover, compilers are generally intended to be deterministic.
  • Compilers usually don’t run for very long, so they don’t have to worry too much about resource leaks or recovering gracefully from error conditions.

But in other ways, compilers are not so easy to test:

  • Production compilers are supposed to be fast, so they are often written in an unsafe language and may skimp on assertions. They use caching and lazy evaluation when possible, adding complexity. Furthermore, splitting compiler functionality into lots of clean, independent little passes leads to slow compilers, so there tends to be some glomming together of unrelated or not-too-closely-related functionality, making it more difficult to understand, test, and maintain the resulting code.
  • The invariants on compiler-internal data structures can be hellish and are often not documented completely.
  • Some compiler algorithms are difficult, and it is almost never the case that a compiler implements a textbook algorithm exactly, but rather a close or distant relative of it.
  • Compiler optimizations interact in difficult ways.
  • Compilers for unsafe languages do not have lots of obligations when compiling undefined behaviors, placing the responsibility for avoiding UB outside of the compiler (and on the person creating test cases for the compiler). This complicates differential testing.
  • The standards for compiler correctness are high since miscompilations are tough to debug and also they can quietly introduce security vulnerabilities in any code that they compile.

So, with that background out of the way, how is LLVM tested?

Unit Tests and Regression Tests

LLVM’s first line of defense against bugs is a collection of tests that get run when a developer builds the check target. All of these tests should pass before a developer commits a patch to LLVM (and of course many patches should include some new tests). I have a fairly fast desktop machine that runs 19,267 tests in 96 seconds. The number of tests that run depends on what auxiliary LLVM projects you have downloaded (compiler-rt, libcxx, etc.) and, to a lesser extent, on what other software gets autodetected on your machine (e.g. the OCaml bindings don’t get tested unless you have OCaml installed). These tests need to be fast so developers can run them often, as mentioned here. Additional tests get run by some alternate build targets such as check-all and check-clang.

Some of the unit/regression tests are at the API level, these use Google Test, a lightweight framework that provides C++ macros for hooking into the test framework. Here’s a test:

TEST_F(MatchSelectPatternTest, FMinConstantZero) {
  parseAssembly(
      "define float @test(float %a) {\n"
      "  %1 = fcmp ole float %a, 0.0\n"
      "  %A = select i1 %1, float %a, float 0.0\n"
      "  ret float %A\n"
      "}\n");
  // This shouldn't be matched, as %a could be -0.0.
  expectPattern({SPF_UNKNOWN, SPNB_NA, false});
}

The first argument to the TEST_F macro indicates the name of the test case (a collection of tests) and the second names the actual test shown here. The parseAssembly() and expectPattern() methods respectively call into an LLVM API and then check that this had the expected result. This example is from ValueTrackingTest.cpp. Many tests can be put into a single file, keeping things fast by avoiding forks/execs.

The other infrastructure used by LLVM’s fast test suite is lit, the LLVM Integrated Tester. lit is shell-based: it executes commands found in a test case, and considers the test to have been successful if all of its sub-commands succeed.

Here’s a test case for lit (I grabbed the top of this file, which contains additional tests that don’t matter to us right now):

; RUN: opt < %s -instcombine -S | FileCheck %s

define i64 @test1(i64 %A, i32 %B) {
        %tmp12 = zext i32 %B to i64
        %tmp3 = shl i64 %tmp12, 32
        %tmp5 = add i64 %tmp3, %A
        %tmp6 = and i64 %tmp5, 123
        ret i64 %tmp6
; CHECK-LABEL: @test1(
; CHECK-NEXT: and i64 %A, 123
; CHECK-NEXT: ret i64
}

This test case is making sure that InstCombine, the LLVM-level peephole optimization pass, is able to notice some useless instructions: the zext, shl, and add are not needed here. The CHECK-LABEL line looks for the line of optimized code that begins the function, the first CHECK-NEXT makes sure that the and instruction is on the next line, and the second CHECK-NEXT makes sure the ret instruction is on the line following the and (thanks Michael Kuperstein for correcting an earlier explanation of this test).

To run this test case, the file is interpreted three times. First, lit scans it looking for lines containing RUN: and executes each associated command. Second, the file is interpreted by opt, the standalone optimizer for LLVM IR; this happens because lit replaces the %s variable with the name of the file being processed. Since comments in textual LLVM IR are preceded by a semicolon, the lit directives are ignored by opt. The output of opt is piped to the FileCheck utility which parses the file yet again, looking for commands such as CHECK and CHECK-NEXT; these tell it to look for strings in its stdin, and to return a non-zero status code if any of the specified strings isn't found. (CHECK-LABEL is used to divide up a file into a collection of logically separate tests.)

An important part of a long-term testing campaign is using coverage tools to find parts of the code base that aren't being tested. Here's a recent LLVM coverage report based on running the unit/regression tests. This data is pretty interesting to poke around in. Let's take a quick look at coverage of InstCombine, which is generally very good. An interesting project for someone wanting to get started with LLVM would be to write and submit test cases that cover untested parts of InstCombine. For example, here's the first uncovered code (colored red) in InstCombineAndOrXor.cpp:

The comment tells us what the transformation is looking for, it should be fairly easy to target this code with a test case. Code that can't be covered is dead; some dead code wants to be removed, other code such as this example (from the same file) is a bug if it isn't dead:

Trying to cover these lines is a good idea, but in that case you're trying to find bugs in LLVM, as opposed to trying to improve the test suite. It would probably be good to teach the coverage tool to not tell us about lines that are marked unreachable.

The LLVM Test Suite

In contrast with the regression/unit tests, which are part of the main LLVM repository and can be run quickly, the test suite is external and takes longer to run. It is not expected that developers will run these tests prior to committing; rather, these tests get run automatically and often, on the side, by LNT (see the next section). The LLVM test suite contains entire programs that are compiled and run; it isn't intended to look for specific optimizations, but rather to help ascertain the quality and correctness of the generated code overall.

For each benchmark, the test suite contains test inputs and their corresponding expected outputs. Some parts of the test suite are external, meaning that there is support for invoking the tests, but the tests themselves are not part of the test suite and must be downloaded separately, typically because the software being compiled is not free.

LNT

LNT (LLVM Nightly Test) doesn't contain any test cases; it is a tool for aggregating and analyzing test results, focusing on monitoring the quality of the compiler's generated code. It consists of local utilities for running tests and submitting results, and then there's a server side database and web frontend that makes it easy to look through results. The NTS (Nightly Test Suite) results are here.

BuildBot

The Linux/Windows BuiltBot and the Darwin one (I don't know why there are two) are used to make sure LLVM configures, builds, and passes its unit/regression tests on a wide variety of platforms and in a variety of configurations. The BuildBot has some blame support to help find problematic commits and will send mail to their authors.

Eclectic Testing Efforts

Some testing efforts originate outside of the core LLVM community and aren't as systematic in terms of which versions of LLVM get tested. These tests represent efforts by individuals who usually have some specific tool or technique to try out. For example, for a long time my group tested Clang+LLVM using Csmith and reported the resulting bugs. (See the high-level writeup.) Sam Liedes applied afl-fuzz to the Clang test suite. Zhendong Su and his group have been finding a very impressive number of bugs. Nuno Lopes has done some awesome formal-methods-based testing of optimization passes that he'll hopefully write about soon.

A testing effort that needs to be done is repeatedly generating a random (but valid) IR function, running a few randomly-chosen optimization passes on it, and then making sure the optimized function refines the original one (the desired relationship is refinement, rather than equivalence, because optimizations are free to make the domain of definedness of a function larger). This needs to be done in a way that is sensitive to LLVM-level undefined behavior. I've heard that something like this is being worked on, but don't have details.

Testing in the Wild

The final level of testing is, of course, carried out by LLVM's users, who occasionally run into crashes and miscompiles that have escaped other testing methods. I've often wanted to better understand the incidence of compiler bugs in the wild. For crashes this could be done by putting a bit of telemetry into the compiler, though few would use this if opt-in, and many would (legitimately) object if opt-out. Miscompiles in the wild are very hard to quantify. My hypothesis is that most miscompiles go unreported since reducing their triggers is so difficult. Rather, as people make pseudorandom code changes during debugging, they eventually work around the problem by luck and then promptly forget about it.

A big innovation would be to ship LLVM with a translation validation scheme that would optionally use an SMT solver to prove that the compiler's output refines its input. There are all sorts of challenges including undefined behavior and the fact that it's probably very difficult to scale translation validation up to the large functions that seem to be the ones that trigger miscompilations in practice.

Alternate Test Oracles

A "test oracle" is a way to decide whether a test passes or fails. Easy oracles include "compiler terminates with exit code 0" and "compiled benchmark produces the expected output." But these miss lots of interesting bugs, such as a use-after-free that doesn't happen to trigger a crash or an integer overflow (see page 7 of this paper for an example from GCC). Bug detectors like ASan, UBSan, and Valgrind can instrument a program with oracles derived from the C and C++ language standards, providing lots of useful bug-finding power. To run LLVM under Valgrind when executing it on its test suite, pass -DLLVM_LIT_ARGS="-v --vg" to CMake, but be warned that Valgrind will give some false positives that seem to be difficult to eliminate. To instrument LLVM using UBSan, pass -DLLVM_USE_SANITIZER=Undefined to CMake. This is all great but there's more work left to do since UBSan/ASan/MSan don't yet catch all undefined behaviors and also there are defined-but-buggy behaviors, such as the unsigned integer overflow in GCC mentioned above, that we'd like to flag when they are unintentional.

What Happens When a Test Fails?

A broken commit can cause test failure at any level. The offending commit is then either amended (if easy to fix) or backed out (if it turns out to be deeply flawed or otherwise undesirable in light of the new information supplied by failing tests). These things happen reasonably often, as they do in any project that is rapidly pushing changes into a big, complicated code base with many real-world users.

When a test fails in a way that is hard to fix right now, but that will get fixed eventually (for example when some new feature gets finished), the test can be marked XFAIL, or "expected failure." These are counted and reported separately by the testing tool and they do not count towards the test failures that must be fixed before a patch becomes acceptable.

Conclusions

Testing a large, portable, widely-used software system is hard; there are a lot of moving parts and a lot of ongoing work is needed if we want to prevent LLVM's users from being exposed to bugs. Of course there are other super-important things that have to happen to maintain high-quality code: good design, code reviews, tight semantics on the internal representation, static analysis, and periodic reworking of problematic areas.

A Tourist’s Guide to the LLVM Source Code

In my Advanced Compilers course last fall we spent some time poking around in the LLVM source tree. A million lines of C++ is pretty daunting but I found this to be an interesting exercise and at least some of the students agreed, so I thought I’d try to write up something similar. We’ll be using LLVM 3.9, but the layout isn’t that different for previous (and probably subsequent) releases.

I don’t want to spend too much time on LLVM background but here are a few things to keep in mind:

  • The LLVM core doesn’t contain frontends, only the “middle end” optimizers, a pile of backends, documentation, and a lot of auxiliary code. Frontends such as Clang live in separate projects.
  • The core LLVM representation lives in RAM and is manipulated using a large C++ API. This representation can be dumped to readable text and parsed back into memory, but this is only a convenience for debugging: during a normal compilation using LLVM, textual IR is never generated. Typically, a frontend builds IR by calling into the LLVM APIs, then it runs some optimization passes, and finally it invokes a backend to generate assembly or machine code. When LLVM code is stored on disk (which doesn’t even happen during a normal compilation of a C or C++ project using Clang) it is stored as “bitcode,” a compact binary representation.
  • The main LLVM API documentation is generated by doxygen and can be found here. This information is very difficult to make use of unless you already have an idea of what you’re doing and what you’re looking for. The tutorials (linked below) are the place to start learning the LLVM APIs.

So now on to the code. Here’s the root directory, it contains:

  • bindings that permit LLVM APIs to be used from programming languages other than C++. There exist more bindings than this, including C (which we’ll get to a bit later) and Haskell (out of tree).
  • cmake: LLVM uses CMake rather than autoconf now. Just be glad someone besides you works on this.
  • docs in ReStructuredText. See for example the Language Reference Manual that defines the meaning of each LLVM instruction (GitHub renders .rst files to HTML by default; you can look at the raw file here.) The material in the tutorial subdirectory is particularly interesting, but don’t look at it there, rather go here. This is the best way to learn LLVM!
  • examples: This is the source code that goes along with the tutorials. As an LLVM hacker you should grab code, CMakeLists.txt, etc. from here whenever possible.
  • include: The first subdirectory, llvm-c, contains the C bindings, which I haven’t used but look pretty reasonable. Importantly, the LLVM folks try to keep these bindings stable, whereas the C++ APIs are prone to change across releases, though the pace of change seems to have slowed down in the last few years. The second subdirectory, llvm, is a biggie: it contains 878 header files that define all of the LLVM APIs. In general it’s easier to use the doxygen versions of these files rather than reading them directly, but I often end up grepping these files to find some piece of functionality.
  • lib contains the real goodies, we’ll look at it separately below.
  • projects doesn’t contain anything by default but it’s where you checkout LLVM components such as compiler-rt (runtime library for things like sanitizers), OpenMP support, and the LLVM C++ library that live in separate repos.
  • resources: something for Visual C++ that you and I don’t care about (but see here).
  • runtimes: another placeholder for external projects, added only last summer, I don’t know what actually goes here.
  • test: this is a biggie, it contains many thousands of unit tests for LLVM, they get run when you build the check target. Most of these are .ll files containing the textual version of LLVM IR. They test things like an optimization pass having the expected result. I’ll be covering LLVM’s tests in detail in an upcoming blog post.
  • tools: LLVM itself is just a collection of libraries, there isn’t any particular main function. Most of the subdirectories of the tools directory contain an executable tool that links against the LLVM libraries. For example, llvm-dis is a disassembler from bitcode to the textual assembly format.
  • unittests: More unit tests, also run by the check build target. These are C++ files that use the Google Test framework to invoke APIs directly, as opposed to the contents of the “test” directory, which indirectly invoke LLVM functionality by running things like the assembler, disassembler, or optimizer.
  • utils: emacs and vim modes for enforcing LLVM coding conventions; a Valgrind suppression file to eliminate false positives when running make check in such a way that all sub-processes are monitored by Valgrind; the lit and FileCheck tools that support unit testing; and, plenty of other random stuff. You probably don’t care about most of this.

Ok, that was pretty easy! The only thing we skipped over is the “lib” directory, which contains basically everything important. Let’s look its subdirectories now:

  • Analysis contains a lot of static analyses that you would read about in a compiler textbook, such as alias analysis and global value numbering. Some analyses are structured as LLVM passes that must be run by the pass manager; others are structured as libraries that can be called directly. An odd member of the analysis family is InstructionSimplify.cpp, which is a transformation, not an analysis; I’m sure someone can leave a comment explaining what it is doing here (see this comment). I’ll do a deep dive into this directory in a followup post.
  • AsmParser: parse textual IR into memory
  • Bitcode: serialize IR into the compact format and read it back into RAM
  • CodeGen: the LLVM target-independent code generator, basically a framework that LLVM backends fit into and also a bunch of library functions that backends can use. There’s a lot going on here (>100 KLOC) and unfortunately I don’t know very much about it.
  • DebugInfo is a library for maintaining mappings between LLVM instructions and source code locations. There’s a lot of good info in these slides from a talk at the 2014 LLVM Developers’ Meeting.
  • ExecutionEngine: Although LLVM is usually translated into assembly code or machine code, it can be directly executed using an interpreter. The non-jitting interpreter wasn’t quite working the last time I tried to use it, but anyhow it’s a lot slower than running jitted code. The latest JIT API, Orc, is in here.
  • Fuzzer: this is libFuzzer, a coverage-guided fuzzer similar to AFL. It doesn’t fuzz LLVM components, but rather uses LLVM functionality in order to perform fuzzing of programs that are compiled using LLVM.
  • IR: sort of a grab-bag of IR-related code, with no other obvious unifying theme. There’s code for dumping IR to the textual format, for upgrading bitcode files created by earlier versions of LLVM, for folding constants as IR nodes are created, etc.
  • IRReader, LibDriver, LineEditor: almost nobody will care about these and they contain hardly any code anyway.
  • Linker: An LLVM module, like a compilation unit in C or C++, contains functions and variables. The LLVM linker combines multiple modules into a single, larger module.
  • LTO: Link-time optimization, the subject of many blog posts and PhD theses, permits compiler optimizations to see through boundaries created by separate compilation. LLVM can do link-time optimization “for free” by using its linker to create a large module and then optimize this using the regular optimization passes. This used to be the preferred approach, but it doesn’t scale to huge projects. The current approach is ThinLTO, which gets most of the benefit at a small fraction of the cost.
  • MC: compilers usually emit assembly code and let an assembler deal with creating machine code. The MC subsystem in LLVM cuts out the middleman and generates machine code directly. This speeds up compiles and is especially useful when LLVM is used as a JIT compiler.
  • Object: Deals with details of object file formats such as ELF.
  • ObjectYAML seems to support encoding object files as YAML. I do not know why this is desirable.
  • Option: Command line parsing
  • Passes: part of the pass manager, which schedules and sequences LLVM passes, taking their dependencies and invalidations into account.
  • ProfileData: Read and write profile data to support profile-guided optimizations
  • Support: Miscellaneous support code including APInts (arbitrary-precision integers that are used pervasively in LLVM) and much else.
  • TableGen: A wacky Swiss-army knife of a tool that inputs .td files (of which there are more than 200 in LLVM) containing structured data and uses a domain-specific backend to emit C++ code that gets compiled into LLVM. TableGen is used, for example, to take some of the tedium out of implementing assemblers and disassemblers.
  • Target: the processor-specific parts of the backends live here. There are lots of TableGen files. As far as I can tell, you create a new LLVM backend by cloning the one for the architecture that looks the most like yours and then beating on it for a couple of years.
  • Transforms: this is my favorite directory, it’s where the middle-end optimizations live. IPO contains interprocedural optimizations that work across function boundaries, they are typically not too aggressive since they have to look at a lot of code. InstCombine is LLVM’s beast of a peephole optimizer. Instrumentation supports sanitizers. ObjCARC supports this. Scalar contains a pile of compiler-textbooky kinds of optimizers, I’ll try to write a more detailed post about the contents of this directory at some point. Utils are helper code. Vectorize is LLVM’s auto-vectorizer, the subject of much work in recent years.

And that’s all for the high-level tour, hope it was useful and as always let me know what I’ve got wrong or left out.

Principles for Undefined Behavior in Programming Language Design

I’ve had a post with this title on the back burner for years but I was never quite convinced that it would say anything I haven’t said before. Last night I watched Chandler Carruth’s talk about undefined behavior at CppCon 2016 and it is good material and he says it better than I think I would have, and I wanted to chat about it a bit.

First off, this is a compiler implementor’s point of view. Other smart people, such as Dan Bernstein, have a very different point of view (but also keep in mind that Dan doesn’t believe compiler optimization is useful).

Chandler is not a fan of the term nasal demons, which he says is misleadingly hyperbolic, since the compiler isn’t going to maliciously turn undefined behavior (UB) into code for erasing your files or whatever. This is true, but Chandler leaves out the fact that our 28-year-long computer security train wreck (the Morris Worm seems like as good a starting point as any) has been fueled to a large extent by undefined behavior in C and (later) C++ code. In other words, while the compiler won’t emit system calls for erasing your files, a memory-related UB in your program will permit a random person on the Internet to insert instructions into your process that issue system calls doing precisely that. From this slightly broader point of view, nasal demons are less of a caricature.

The first main idea in Chandler’s talk is that we should view UB at the PL level as being analogous to narrow contracts on APIs. Let’s look at this in more detail. An API with a wide contract is one where you can issue calls in any order, and you can pass any arguments to API calls, and expect predictable behavior. One simple way that an API can have a wider contract is by quietly initializing library state upon the first call into the library, as opposed to requiring an explicit call to an init() function. Some libraries do this, but many libraries don’t. For example, an OpenSSL man page says “SSL_library_init() must be called before any other action takes place.” This kind of wording indicates that a severe obligation is being placed on users of the OpenSSL API, and failing to respect it would generally be expected to result in unpredictable behavior. Chandler’s goal in this first part of the talk is to establish the analogy between UB and narrow API contracts and convince us that not all APIs want to be maximally wide. In other words, narrow APIs may be acceptable when their risks are offset by, for example, performance advantages.

Coming back to programming languages (PL), we can look at something like the signed left shift operator as exposing an API. The signed left shift API in C and C++ is particularly narrow and while many people have by now internalized that it can trigger UB based on the shift exponent (e.g., 1 << -1 is undefined), fewer developers have come to terms with restrictions on the left hand argument (e.g., 0 << 31 is defined but 1 << 31 is not). Can we design a wide API for signed left shift? Of course! We might specify, for example, that the result is zero when the shift exponent is too large or is negative, and that otherwise the result is the same as if the signed left-hand argument was interpreted as unsigned, shifted in the obvious way, and then reinterpreted as signed.

At this point in the talk, we should understand that “UB is bad” is an oversimplification, that there is a large design space relating to narrow vs. wide APIs for libraries and programming language features, and that finding the best point in this design space is not straightforward since it depends on performance requirements, on the target platform, on developers’ expectations, and more. C and C++, as low-level, performance-oriented languages, are famously narrow in their choice of contracts for core language features such as pointer and integer operations. The particular choices made by these languages have caused enormous problems and reevaluation is necessary and ongoing. The next part of Chandler’s talk provides a framework for deciding whether a particular narrow contract is a good idea or not.

Chandler provides these four principles for narrow language contracts:

  1. Checkable (probabilistically) at runtime
  2. Provide significant value: bug finding, simplification, and/or optimization
  3. Easily explained and taught to programmers
  4. Not widely violated by existing code that works correctly and as intended

The first criterion, runtime checkability, is crucial and unarguable: without it, we get latent errors of the kind that continue to contribute to insecurity and that have been subject to creeping exploitation by compiler optimizations. Checking tools such as ASan, UBSan, and tis-interpreter reduce the problem of finding these errors to the problem of software testing, which is very difficult, but which we need to deal with anyhow since there’s more to programming than eliminating undefined behaviors. Of course, any property that can be checked at runtime can also be checked without running the code. Sound static analysis avoids the need for test inputs but is otherwise much more difficult to usefully implement than runtime checking.

Principle 2 tends to cause energetic discussions, with (typically) compiler developers strongly arguing that UB is crucial for high-quality code generation and compiler users equally strongly arguing for defined semantics. I find the bug-finding arguments to be the most interesting ones: do we prefer Java-style two’s complement integers or would we rather retain maximum performance as in C and C++ or mandatory traps as in Swift or a hybrid model as in Rust? Discussions of this principle tend to center around examples, which is mostly good, but is bad in that any particular example excludes a lot of other use cases and other compilers and other targets that are also important.

Principle 3 is an important one that tends to get neglected in discussions of UB. The intersection of HCI and PL is not incredibly crowded with results, as far as I know, though many of us have some informal experience with this topic because we teach people to program. Chandler’s talk contains a section on explaining signed left shift that’s quite nice.

Finally, Principle 4 seems pretty obvious.

One small problem you might have noticed is that there are undefined behaviors that fail one or more of Chandler’s criteria, that many C and C++ compiler developers will defend to their dying breath. I’m talking about things like strict aliasing and termination of infinite loops that violate (at least) principles 1 and 3.

In summary, the list of principles proposed by Chandler is excellent and, looking forward, it would be great to use it as a standard set of questions to ask about any narrow contract, preferably before deploying it. Even if we disagree about the details, framing the discussion is super helpful.

Advanced Compilers Weeks 3-5

This continues a previous post.

We went through the lattice theory and introduction to dataflow analysis parts of SPA. I consider this extremely good and important material, but I’m afraid that the students looked pretty bored. It may be the case that this material is best approached by first looking at practical aspects and only later going into the theory.

One part of SPA that I’m not super happy with is the material about combining lattices (section 4.3). This is a useful and practical topic but the use cases aren’t really discussed. In class we went through some examples, for example this function that cannot be optimized by either constant propagation or dead code elimination alone, but can be optimized by their reduced product: conditional constant propagation. Which, as you can see, is implemented by both LLVM and GCC. Also, this example cannot be optimized by either sign analysis or parity analysis, but can be optimized using their reduced product.

We didn’t go into them, but I pointed the class to the foundational papers for dataflow analysis and abstract interpretation.

I gave an assignment to implement subtract and bitwise-and transfer functions for the interval abstract domain for signed 5-bit integers. The bitwidth is small so I can rapidly do exhausive testing of students’ code. Their subtract had to be correct and maximally precise — about half of the class accomplished this. Their bitwise-and had to be correct and more precise than always returning top, and about half of the class accomplished this as well (a maximally precise bitwise-and operator for intervals is not at all easy — try it!). Since not everyone got the code right, I had them fix bugs (if any) and resubmit their code for this week. I hope everyone will get it right this time! Also I will give prizes to students whose bitwise-and operator is on the Pareto frontier (out of all submitted solutions) for throughput vs precision and code size vs precision. Here are the results with the Pareto frontier in blue and the minimum and maximum precision in red (narrower intervals are better).

Impressively, student k implemented an optimally precise bitwise-and transfer function! Student c’s transfer function returned an answer other than top only for intervals of width 1. Mine (labeled JOHN) looked at the number of leading zeroes in both operands.

We looked at the LLVM implementation of the bitwise domain (“known bits”, they call it) which lives in ValueTracking.cpp. This analysis doesn’t have a real fixpoint computation, it rather simply walks up the dataflow graph in a recursive fashion, which is a bit confusing since it is a forward dataflow analysis that looks at nodes in the backward direction. The traversal stops at depth 6, and isn’t cached, so the code is really very easy to understand.

We started to look at how LLVM works, I went partway through some lecture notes by David Chisnall. We didn’t focus on the LLVM implementation yet, but rather looked at the design, with a bit of focus on SSA, which is worth spending some time on since it forms the foundation for most modern compilers. I had the students read the first couple of chapters of this drafty SSA book.

Something I’d appreciate feedback on is what (besides SSA) have been the major developments in ahead-of-time compiler technology over the last 25 years or so. Loop optimizations and vectorization have seen major advances of course, as have verified compilers. In this class I want to steer clear of PL-level innovations.

Finally, former Utah undergrad and current Googler Chad Brubaker visited the class and gave a guest lecture on UBSan in production Android: very cool stuff! Hopefully this motivated the class to care about using static analysis to remove integer overflow checks, since they will be doing assignments on that very topic in the future.

Advanced Compilers Weeks 1 and 2

This post will be of somewhat narrow interest; it’s a quick attempt to take my lecture notes for the first weeks of an advanced compilers course and turn them into something a bit more readable. I’m not using slides for this class.

Motivation

The great thing about an advanced course (on any topic) is that we have a lot of freedom in choosing the direction that the class takes. My class this fall is mainly about static program analysis: predicting the behavior of programs without running them. This is a broadly useful technology, it is the foundation for type checking, program verification, compiler optimization, and static bugfinding.

We can start off with a couple of observations about the role of compilers. First, hardware is getting weirder rather than getting clocked faster: almost all processors are multicores and it looks like there is increasing asymmetry in resources across cores. Processors come with vector units, crypto accelerators, bit twiddling instructions, and lots of features to make virtualization and concurrency work. We have DSPs, GPUs, big.little, and Xeon Phi. This is only scratching the surface. Second, we’re getting tired of low-level languages and their associated security disasters, we want to write new code, to whatever extent possible, in safer, higher-level languages. Compilers are caught right in the middle of these opposing trends: one of their main jobs is to help bridge the large and growing gap between increasingly high-level languages and increasingly wacky platforms. It’s effectively a perpetual employment act for solid compiler hackers.

The sufficiently smart compiler never seems to arrive. I told the class a story that I never tire of re-telling. My understanding is that while the death of the Cell processor was complicated (yields were bad, GPUs were on the rise, etc.) the lack of good tooling certainly didn’t help. Perhaps later on we’ll read this paper.

Semantics

One of the big ideas that enables static program analysis is that programs mean something, mathematically speaking. Of course this was understood very early by the people who created computer science, but in the early history of compilers people would get tripped up by the fact that they didn’t necessarily have a good idea what the programs being compiled actually meant. A new optimization would break programs and it wasn’t possible to assign blame cleanly: was the program within its rights to expect a certain behavior or not? This kind of question can only be answered by assigning meaning to programs. Alas, it is still common for a program to mean “whatever the (single) language implementation does with the program.” I’ve heard stories from Matlab users that the providers of the Matlab implementation have introduced subtle changes to the semantics over time, probably both intentionally and unintentionally. The alternative to defining the semantics using an implementation is to define the semantics of a language some other way, either in a standards document or in math. Then, both programs and implementations can be judged to be either in conformance, or not, with the standard. Obviously this is no panacea, as long experience with C and C++ has shown — but it’s better than nothing.

There are a lot of ways to write down the semantics of a programming language but an even more important issue is creating an appropriate semantics. For example, a language designed for implementing constant-time cryptography might include execution time in the semantics. A language for embedded systems might include memory allocation (or at least guarantees about the lack of implicit allocations) in the semantics. Even the simple parts of a language, such as arithmetic, contain many subtle corners. Here’s an example. We can also look at the behavior of shift operators when the shift exponent is at least as large as the width of the shifted value. Java and x86 reduce the shift amount modulo 32. ARM reduces the shift amount modulo 256 and then saturates (shift by 257 is equivalent to shift by 1 but shift by 100 is equivalent to clearing the register). C and C++ have (of course!) undefined semantics for shift by 100 or 257. Constraining the semantics is nice but too many constraints make efficient code generation difficult. The WebAsm people were discussing these issues not too long ago. I’ve always wanted shift left by -3 to be a shift right by 3, but nobody else has ever thought this was a good idea, as far as I know.

The recent DAO debacle provided an absolutely wonderful demonstration of why it might be risky to define the semantics of a language using a reference implementation. They put a lot of money on the line there, the hubris was impressive. One hopes that lessons were learned.

The overall point of this discussion is (1) we can’t do static program analysis unless we know what the programming language means and (2) designing meanings for programs is an interesting and difficult topic in itself.

Missed Optimizations

I asked the students to use the Compiler Explorer to demonstrate a case in which each of GCC and LLVM miss an optimization, and to provide the assembly code that the compiler should have generated. We went over a handful of submissions, discussing the issues: Was the proposed optimization correct? Would it be a good idea to implement it now? What kind of static analysis would be needed to make the optimization go?

As I had hoped, the codes written by the students exposed many interesting issues. One example that came up was similar to this one where LLVM cleverly realizes that the loop is squaring the function but then (apparently) fails to remove the subsequent conditional move. But really, since the loop fails to execute when the argument is negative, some sort of conditional really is needed. We also saw some good examples where potential aliasing was blocking optimizations. Playing with optimizations in compiler explorer is really a pleasure.

Intro to Static Analysis

Although there are a lot of slide decks that do a good job explaining static analysis, there’s only one book-length treatment of the subject that I like, which I’ll call SPA. SPA is clearly written, it avoids unnecessary notation, and it keeps the material grounded in practical use cases. It’s great!

I started out using everyone’s favorite tutorial abstract domains: parity (are values even or odd?) and signs (are values negative, zero, or positive?). I introduced what I consider to be the first key idea behind static analysis, which is that abstract values (odd, positive, etc.) are simply stand-ins for sets of concrete values. This is such a simple idea and yet it can get lost if the material is presented wrong. We discussed some transfer functions such as addition for the even/odd domain and multiplication for the signedness domain (as seen on p. 28 of SPA). Here the key idea is that we can always verify a transfer function by concretizing the abstract arguments, applying the concrete operation pairwise to the sets of concrete values (assuming a binary operator), and then lifting the result set back into the abstract domain. This now sets the stage for introducing the abstraction and concretization functions and then we’re ready for the Galois connection (which I showed the components of but did not explicitly name). David Schmidt’s slides on this material are awesome.

The thing that we’re working up to here is digging into some of the numerous static analyses that are part of LLVM. I’m trying to introduce the theory, which is very beautiful, while also warming the students up to the idea that it all sort of goes out the window when you’re confronted with the piles of C++ that actually make these analyses happen in practice.

Types

Everyone read Chapter 3 of SPA as well as the first section of Type Systems, another piece of writing that I like very much because it keeps the topics connected to the reasons why they are useful. I didn’t want to get into type systems too deeply (and in fact types are something of a non-speciality of mine) but did want to students to come away with the idea that type checking is an important use case of static program analysis.

The point of static typechecking is that “well typed programs can’t go wrong” but as Cardelli points out in some detail, we need to be pretty careful when saying what “go wrong” means. He includes some nice discussion of the standard static/dynamic and safe/unsafe language categorizations.

Solutions to Integer Overflow

Humans are typically not very good at reasoning about integers with limited range, whereas computers fundamentally work with limited-range numbers. This impedance mismatch has been the source of a lot of bugs over the last 50 years. The solution comes in multiple parts.

In most programming languages, the default integer type should be a bignum: an arbitrary-precision integer that allocates more space when needed. Efficient bignum libraries exist and most integers never end up needing more than one machine word anyway, except in domains like crypto. As far as I’m concerned, for ~95% of programming tasks integer overflow is a solved problem: it should never happen. The solution isn’t yet implemented widely enough, but happily there are plenty of languages such as Python that give bignums by default.

When performance and/or predictability is a major consideration, bignums won’t work and we’re stuck with fixed-width integers that wrap, trap, saturate, or trigger undefined behavior upon overflow. Saturation is a niche solution that we won’t discuss further. Undefined behavior is bad but at least it enables a few loop optimizations and also permits trapping implementations. Although wrapping is an extremely poor default, there are a few good things to say about it: wrapping is efficient, people have come to expect it, and it is a good match for a handful of application domains.

Swift is a modern programming language that traps instead of providing bignums, this is also a generally sensible behavior. Why not bignums? The About Swift web page says that Swift gives “the developer the control needed in a true systems programming language,” so perhaps the designers were worried about unpredictable allocations. I’d love to see a study of the performance of best-of-breed trapping and bignum implementations on important modern applications.

The Rust developers have adopted a hybrid solution where integer overflows trap in debug builds and wrap in optimized builds. This is pragmatic, especially since integer overflows do not compromise Rust’s memory safety guarantees. On the other hand, perhaps as MIR matures, Rust will gravitate towards checking in optimized builds.

For safety-critical programs, the solution to integer overflow is to prove that it cannot happen using some combination of manual reasoning, testing, and formal verification. SPARK Ada and the TrustInSoft analyzer are suitable for proving that integer overflows won’t occur. More work is needed to make this sort of verification scalable and less expert-intensive.

Systems programming tasks, such as building operating systems, language runtimes, and web browsers, are caught in the middle. Wrapping sucks, bignums and trapping are slow or at least perceived as slow (and you do not want to trap or allocate while handling a hardware interrupt anyway), and the codes are too large for formal verification and thorough testing. One answer is to work hard on making trapping fast. For example, Swift has a high-level optimization pass specifically for removing integer overflow checks, and then the LLVM optimization passes do more of this, and then the LLVM backends can lower checked math operations to efficient condition code checks, and then modern Intel processors fuse the resulting branch-on-overflow instructions away.

In summary, bignums should be the default whenever this is feasible, and trapping on overflow should be the backup default behavior. Continued work on the compilers and processors will ensure that the overhead of trapping overflow checks is down in the noise. Java-style wrapping integers should never be the default, this is arguably even worse than C and C++’s UB-on-overflow which at least permits an implementation to trap. In domains where wrapping, trapping, and allocation are all unacceptable, we need to be able to prove that overflow does not occur.

I’ll end up with a few random observations:

  • Dan Luu wrote a piece on the overhead of overflow checking.
  • Arbitrary (fixed) width bitvectors are a handy datatype and I wish more languages supported them. These can overflow but it’s not as big of a deal since we choose the number of bits.
  • Explicitly ranged integers as seen in Ada are also very nice, there’s no reason that traps should only occur at the 32-bit or 64-bit boundaries.
  • The formal verification community ignored integer overflow for far too long, there’s a long history of assuming that program integers behave like mathematical integers. Things are finally better though.

UPDATE: I didn’t want this piece to be about C and C++ but I should have clarified that it is only signed overflow in these languages that is undefined behavior; unsigned overflow is defined to be two’s complement wraparound. While it is possible to trap on unsigned overflow — UBSan has a flag that turns on these traps — this behavior does not conform to the standards. Even so, trapping unsigned wraparounds can — in some circumstances — be useful for finding software defects. The question is whether the wraparound was intentional or not.

Compilation and Hyperthreading

Hyperthreading (HT) may or may not be a performance win, depending on the workload. I had poor luck with HT in the Pentium 4 era and ever since then have just disabled it in the BIOS on the idea that the kind of software that I typically wait around for—compilers and SMT solvers—is going to get hurt if its L1 and L2 cache resources are halved. This post contains some data about that. I’ll just start off by saying that for at least one combination of CPU and workload, I was wrong.

The benchmark is compilation of LLVM, Clang, and compiler-rt r279412 using Ninja on an Intel i7-5820K, a reasonably modern but by no means new Haswell-E processor with six real cores. The compiler doing the compilation is a Clang 3.8.1 binary from the LLVM web site. The machine is running Ubuntu 14.04 in 64-bit mode.

Full details about the machine are here. As an inexpensive CPU workhorse I think it stands the test of time, though if you were building one today you would double (or more) the RAM and SSD sizes and of course choose newer versions of everything. I’m particularly proud of the crappy fanless video cards I found for these machines.

This is the build configuration command:

cmake -G Ninja -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_ASSERTIONS=1 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release ..

Then, on an otherwise idle machine, I built LLVM five times for each degree of parallelism up to 16, both with and without hyperthreading. Here are the results. Since the variation between runs was very low—a few seconds at worst—I’m not worrying about statistics.

What can we take away from this graph? The main conclusion is that hyperthreading wins handily, reducing the best-case build time from 11.75 minutes to 10.04 minutes: an improvement of 1 minute and 42 seconds. Also, I had been worried that simply enabling HT would be detrimental since Linux would sometimes schedule two threads on the same real core when a different core was idle. The graph shows that either this happens only rarely or else it doesn’t hurt much when it happens. Overloading the system (forking more compilers than there are processors) hurts performance by just a very small amount. Of course, at some point the extra processes would use all RAM and performance would suffer significantly. Finally, the speedup is impressively close to linear until we start running more than one thread per core:

I don’t know how much of the nonlinearity comes from resource contention and how much comes from lack of available parallelism.

Here are the first and second graphs as PDF.

Looking at the bigger picture, a huge amount of variation is possible in the compiler, the software being compiled, and the hardware platform. I’d be interested to hear about more data points if people have them.

Isolating a Free-Range Miscompilation

If we say that a compiler is buggy, we need to be able to back up that claim with reproducible, compelling, and understandable evidence. Usually, this evidence centers on a test case that triggers the buggy behavior; we’ll say something like “given this test case, compiler A produces an executable that prints 0 whereas compiler B produces an executable that prints 1, therefore there’s a bug.” (In practice compilers A and B are usually different versions or different optimization levels of the same compiler — this doesn’t matter.)

The problem is that when compilers A and B emit code that behaves differently, the divergence can happen for reasons other than compiler bugs:

  • the program might rely on undefined behavior (UB) or unspecified behavior,
  • the program might read non-constant data from its environment, for example by looking at the clock or at its process id,
  • the program might be concurrent and its output influenced by scheduling decisions.

A big part of convincing someone that the compiler is buggy is convincing them that the test program is free of problems such as these. Since concurrency and interactions with the environment are easy to spot, the problem is often undefined behavior. You can find a few examples in my previous post about invalid GCC bug reports.

This post is about partially automating the process of coming up with a small test case for a miscompilation bug. The bug that we’ll be studying shows up when you compile the latest version of GMP, 6.1.1, using Clang+LLVM revision 135022, from about five years ago. The bug itself is irrelevant and long fixed, I’m just using it to illustrate the process.

As a slight digression, I’ll add that out of the 90 versions of LLVM, spanning revisions 90,000-250,000, that I used to compile GMP (in its C-only mode, disabling use of assembly), only those in the 135,000-139,000 range miscompiled GMP 6.1.1, according to its test suite. The GMP web page says:

Most problems with GMP these days are due to problems not in GMP, but with the compiler used for compiling the GMP sources. This is a major concern to the GMP project, since an incorrect computation is an incorrect computation, whether caused by a GMP bug or a compiler bug. We fight this by making the GMP testsuite have great coverage, so that it should catch every possible miscompilation.

This is all well and good but GMP used to contain some integer undefined behaviors and my guess is that some of the apparent miscompilations were actually due to compiler exploitation of these UBs. Finding some evidence one way or the other about this might be a fun project for a student.

Anyway, back to our compiler bug. The process for isolating a test case is:

  1. Grab a failing unit test.
  2. Use tis-interpreter to verify that it is free of undefined behavior and anything else that might trigger non-deterministic execution.
  3. Use C-Reduce to create a minimal program triggering the bug.

This sounds really easy — and it would be if we were working with a tidy little test case generated by Csmith — but real software is messy and there are plenty of complications. Let’s get started. The unit test we’re using is called mpz_lucnum_ui. Here are the 128 C files that are required to run it.

A First Try

We have to choose what exactly to reduce. Usually we want to reduce preprocessed code, but given the painfully slow interestingness test that we have here (almost 40 seconds, argh) this is going to make very slow progress since C-Reduce would need to eliminate a lot of redundant included junk from each of the 128 files. So let’s rather reduce the files without preprocessing and see what happens.

I had to make a few easy changes to get the GMP files to go through tis-interpreter. In an ideal future, the GMP project (and everyone else) will ensure that their unit tests are clean with respect to tis-interpreter.

Finally we’re ready to run the reduction:

$ creduce ./test1.sh *.c

After a few days the reduction finishes, see the results here. 119 of the 128 C files have become empty and the remaining 9 files contain a total of about 5 KB of code. A bit more than 99% of the code has been eliminated. Perhaps surprisingly, C-Reduce has managed to eliminate all conditional compilation directives, but some #defines remain as do a few #includes.

We’ve succeeded in creating a modestly small test case, but it isn’t yet standalone (due to the includes) and really we would like everything to live in a single file. We can kill two birds with one stone by using CIL or Frama-C to merge the C files into a single compilation unit.

$ frama-c -print *.c > merged.c

The merged file isn’t quite buildable — there’s some junk at the top and there’s a minor problem handling a builtin — but this is easy to fix. You can find the result here. Unfortunately, the merged file doesn’t trigger the bug any longer. Either the testcase relied on separate compilation or else Frama-C’s rewriting of the source code perturbed things badly. That’s sort of a bummer. I could easily have omitted this part of the post, but I wanted to show the whole process here, including missteps.

A Second Try

This time we’ll preprocess and merge the 128 files first, and reduce second. Again, some manual patching was necessary since the merger doesn’t quite work. The result is an 800 KB compilation unit.

Again, the reduction goes slowly:

The final result is a bit less than 4 KB. It’s still too big to be easily understood. Since we’ve run up against the limits of our tooling, further reduction will have to be by hand. Since I’m not actually going to report this long-ago-fixed bug, I didn’t bother with this step. (Back before C-Reduce existed I did a lot of manual testcase reduction and while it was a somewhat satisfying and mindless activity, I ran out of patience for it.) But in any case, 4 KB of self-contained C code is quite manageable.

Is Undefined Behavior Checking Necessary?

Is it really necessary to worry about undefined behavior? In my experience, reducing a miscompilation while disregarding UB is roughly 100% likely to result in a testcase that misbehaves due to UB instead of triggering a compiler bug. Here you can see our interestingness test w/o the UB checking. If we run a reduction using it, the resulting 1 KB testcase (hacked a bit by hand so that we can use tis-interpreter to discover its fatal flaw) misbehaves via an out-of-bounds store:

$ tis-interpreter.sh merged.c
merged.c:53:[kernel] warning: Calling undeclared function cu. Old style K&R code?
[value] Analyzing a complete application starting at main
[value] Computing initial state
merged.c:11:[value] warning: during initialization of variable 'cm', size of type 'struct a []' cannot be
                 computed (Size of array without number of elements.)
merged.c:11:[kernel] imprecise size for variable cm
[value] Initial state computed
merged.c:51:[kernel] warning: out of bounds write. assert \valid(&cp->b);
                  stack: main
[value] Stopping at nth alarm
[value] user error: Degeneration occurred:
                    results are not correct for lines of code that can be reached from the degeneration point.

So we definitely need some sort of UB detection. But do we need something as heavyweight as tis-interpreter? I ran a few reductions using UBSan + ASan (see one of them here) and didn’t have good luck in getting a defined final testcase. The reduction kept introducing issues such as uninitialized storage and function declarators with empty parentheses and incompatible uses of function pointers, all of which can lead to real trouble. Most likely there is a combination of compilers and flags that would have let me run this reduction successfully but I ran out of energy to find it. UBSan and ASan and MSan are excellent but fundamentally they do not add up to a completely reliable UB detector.

Recommendations

More developers should:

  • Make sure unit tests go through tis-interpreter. Though not always practical (tis-interpreter doesn’t understand assembly or C++) this has many benefits since tis-interpreter implements very thorough checking. Also, when a change in compiler or compiler options breaks a test case that is clean with respect to tis-interpreter, the compiler can be blamed with high reliability.
  • Instead of working around compiler bugs, reduce and report them. This can be a lot of work but tools like tis-interpreter and C-Reduce make it easier and when these bugs get fixed life is better for everyone.

A Month of Invalid GCC Bug Reports, and How to Eliminate Some of Them

During July 2016 the GCC developers marked 38 bug reports as INVALID. Here’s the full list. They fall into these (subjective) categories:

  • 8 bug reports stemmed from undefined behavior in the test case (71753, 71780, 71803, 71813, 71885, 71955, 71957, 71746)
  • 1 bug report was complaining about UB exploitation in general (71892)
  • 15 bug reports came from a misunderstanding (or disagreement) about the non-UB semantics of a programming language, usually C++ but also C and Fortran (71786, 71788, 71794, 71804, 71809, 71890, 71914, 71939, 71963, 71967, 72580, 72750, 72761, 71844, 71772)
  • 4 bug reports stemmed from a misunderstanding of something besides the language semantics, such as command line flags (72736, 71729, 71995, 71777)
  • 5 bug reports were caused by an unrelated problem on the reporter’s system such as an out-of-memory error, a borked Cygwin installation, out-of-date files in a build tree, etc (71735, 71770, 71903, 71918, 71978)
  • 1 bug report was about a bug that the devs didn’t want to fix since it was in an inactive branch and had been fixed in all active branches (72051)
    4 bug reports didn’t end up demonstrating any reproducible problem (71940, 71944, 71986, 72076)

I’ve often thought that it would be nice for a compiler bug reporting system to be active instead of passively serving up files and discussions. An active bug reporting system would be able to run:

  • a wide variety of compiler versions,
  • the compiler’s output, and
  • tools for finding undefined behaviors.

Of course not all bug reports would be able to make use of these features. However, one can imagine that there is a significant subset of compiler bug reports where the reporter, cooperating with the system, would be able to conclusively demonstrate that the compiler crashes or generates wrong code. In cases where this cannot be demonstrated, the process of interacting with an active bug reporting system will help the reporter understand what the actual issue is without wasting a compiler developer’s time.

An active bug reporting system can run lots of experiments to determine how many compiler versions, how many target platforms, and how many optimization levels are affected by the bug. It can also determine which revision introduced the problem and who committed the breaking change — suggesting an initial owner for the bug. It can run a testcase reducer to make the program triggering the bug smaller. All of these things will help compiler developers prioritize among reported bugs. The system should also be able to automatically flag duplicate bug reports. When a bug stops reproducing, the bug reporting system will notice this and flag the PR as being ready to close, and could also help out by packaging up an addition to the regression test suite.

Update:

A few additional details. During July a total of 328 bugs were reported, ignoring those marked as spam. 143 of these were resolved: 22 as duplicates, 81 as fixed, 38 as invalid, 1 as wontfix, and 1 as worksforme. Out of the remaining 185 unresolved bugs, 15 are assigned, 86 are new, 1 is reopened, 74 are unconfirmed, and 9 are waiting.

I believe that an active bug reporting system will make many of these 290 non-invalid bug reports easier to deal with, as opposed to only helping with the invalid ones!

Note: In the initial version of this post I mentioned 36 invalid bugs, not 38, because I was only searching for bugs that were marked as resolved. Also searching for closed bugs brings the total to 38.

Pointer Overflow Checking

Most programming languages have a lot of restrictions on the kinds of pointers that programs can create. C and C++ are unusually permissive in this respect: pointers to arbitrary objects and subobjects, usually all the way down to bytes, can be constructed. Consequently, most address computations can be expressed either in terms of integer arithmetic or pointer arithmetic. For example, a function based on array lookup:


[c]
void *memcpy(void *dst, const void *src, size_t n) {
const char *s = src;
char *d = dst;
for (int i = 0; i < n; ++i)
d[i] = s[i];
return dst;
}
[/c]

can just as easily be expressed in terms of pointers:


[c]
void *memcpy(void *dst, const void *src, size_t n) {
const char *s = src;
char *d = dst;
while (n–)
*d++ = *s++;
return dst;
}
[/c]

Idiomatic C tends to favor pointer-based code. For one thing, pointers are more expressive: a pointer can name any memory location while an integer index only makes sense when combined with a base address. Also, developers have a sense that the lower-level code will execute faster since it is closer to how the machine thinks. This may or may not be true: the tradeoffs are complex due to details of the semantics of pointers and integers, and also because different compiler optimizations will tend to fire for pointer code and integer code. Modern compilers can be pretty bright, at least for very simple codes: the version of GCC that I happen to be using for testing (GCC 5.3.0 at -O2) turns both functions above into exactly the same object code.

It is undefined behavior to perform pointer arithmetic where the result is outside of an object, with the exception that it is permissible to point one element past the end of an array:


[c]
int a[10];
int *p1 = a – 1; // UB
int *p2 = a; // ok
int *p3 = a + 9; // ok
int *p4 = a + 10; // ok, but can’t be dereferenced
int *p5 = a + 11; // UB
[/c]

Valgrind and ASan are intended to catch dereferences of invalid pointers, but not their creation or use in comparisons. UBSan catches the creation of invalid pointers in simple cases where bounds information is available, but not in the general case. tis-interpreter can reliably detect creation of invalid pointers.

A lot of C programs (I think it’s safe to say almost all non-trivial ones) create and use invalid pointers, and often they get away with it in the sense that C compilers usually give sensible results when illegal pointers are compared (but not, of course, dereferenced). On the other hand, when pointer arithmetic overflows, the resulting pointers can break assumptions being made in the code.

For the next part of this piece I’ll borrow some examples from a LWN article from 2008. We’ll start with a buffer length check implemented like this:


[c]
void awesome_length_check(unsigned len) {
char *buffer_end = buffer + BUFLEN;
if (buffer + len >= buffer_end)
die_a_gory_death();
}
[/c]

Here the arithmetic for computing buffer_end is totally fine (assuming the buffer actually contains BUFLEN elements) but the subsequent buffer + len risks UB. Let’s look at the generated code for a 32-bit platform:

awesome_length_check:
        cmpl    $100, 4(%esp)
        jl      .LBB0_1
        jmp     die_a_gory_death
.LBB0_1:
        retl

In general, pointer arithmetic risks overflowing when either the base address lies near the end of the address space or when the offset is really big. Here the compiler has factored the pointer out of the computation, making overflow more difficult, but let’s assume that the offset is controlled by an attacker. We’ll need a bit of a driver to see what happens:


[c]
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>

#define BUFLEN 100
char buffer[BUFLEN];

void die_a_gory_death(void) { abort(); }

void awesome_length_check(unsigned len) {
char *buffer_end = buffer + BUFLEN;
if (buffer + len >= buffer_end)
die_a_gory_death();
}

int main(void) {
// volatile to suppress constant propagation
volatile unsigned len = UINT_MAX;
awesome_length_check(len);
printf("length check went well\n");
return 0;
}
[/c]

And then:

$ clang -O -m32 -Wall ptr-overflow5.c
$ ./a.out 
length check went well
$ gcc-5 -O -m32 -Wall ptr-overflow5.c
$ ./a.out 
length check went well

The problem is that once the length check succeeds, subsequent code is going to feel free to process up to UINT_MAX bytes of untrusted input data, potentially causing massive buffer overruns.

One thing we can do is explicitly check for a wrapped pointer:


[c]
void awesome_length_check(unsigned len) {
char *buffer_end = buffer + BUFLEN;
if (buffer + len >= buffer_end ||
buffer + len < buffer)
die_a_gory_death();
}
[/c]

But this is just adding further reliance on undefined behavior and the LWN article mentions that compilers have been observed to eliminate the second part of the check. As the article points out, a better answer is to just avoid pointer arithmetic and do the length check on unsigned integers:


[c]
void awesome_length_check(unsigned len) {
if (len >= BUFLEN)
die_a_gory_death();
}
[/c]

The problem is that we can’t very well go and retrofit all the C code to use integer checks instead of pointer checks. We can, on the other hand, use compiler support to catch pointer overflows as they happen: they are always UB and never a good idea.

Will Dietz, one of my collaborators on the integer overflow checking work we did a few years ago, extended UBSan to catch pointer overflows and wrote a great blog post showing some bugs that it caught. Unfortunately, for whatever reason, these patches didn’t make it into Clang. The other day Will reminded me that they exist; I dusted them off and submitted them for review — hopefully they will get in this time around.

Recently I’ve been thinking about using UBSan for hardening instead of just bug finding. Android is doing this, for example. Should we use pointer overflow checking in production? I believe that after the checker has been thoroughly tested, this makes sense. Consider the trapping mode of this sanitizer:

clang -O3 -fsanitize=pointer-overflow -fsanitize-trap=pointer-overflow

The runtime overhead on SPEC CINT 2006 is about 5%, so probably acceptable for code that is security-critical but not performance-critical. I’m sure we can reduce this overhead with some tuning of the optimizers. The 400.perlbench component of SPEC 2006 contained two pointer overflows that I had to fix.

Pointer overflow isn’t one of the UBs that we can finesse away by adjusting the standard: it’s a real machine-level phenomenon that is hard to prevent without runtime checks.

There’s plenty more work we could do in this sanitizer, such as catching arithmetic involving null pointers.

Update: I built the latest releases of PHP and FFmpeg using the pointer overflow sanitizer and both of them execute pointer overflows while running their own test suites.