Undefined Behavior: Not Just for Programming Languages

This is an oldie but goodie. Start with this premise:
a = b
Multiply both sides by a:
a2 = ab
Subtract b2 from both sides:
a2 – b2 = ab – b2
Factor the left side:
(a + b)(a – b) = ab – b2
Factor the right side:
(a + b)(a – b) = b(a – b)
Divide both sides by (a – b) and cancel:
a + b = b
Substitute b for a:
b + b = b
Finally, let b = 1 and simplify:
2 = 1

I ran into this derivation when I was nine or ten years old and it made me deeply uneasy. The explanation, that you’re not allowed to divide by (a – b) because this term is equal to zero, seemed to raise more questions than it answered. How are we supposed to keep track of which terms are equal to zero? What if something is equal to zero but we don’t know it yet? What other little traps are lying out there, waiting to invalidate a derivation? This was one of many times where I noticed that in school they seemed willing to teach the easy version, and that the real world was never so nice, even in a subject like math where — you would think — everything is clean and precise.

Anyway, the point is that undefined behavior has been confusing people for well over a thousand years — we shouldn’t feel too bad that we haven’t gotten it right in programming languages yet.

Principles for Undefined Behavior in Programming Language Design

I’ve had a post with this title on the back burner for years but I was never quite convinced that it would say anything I haven’t said before. Last night I watched Chandler Carruth’s talk about undefined behavior at CppCon 2016 and it is good material and he says it better than I think I would have, and I wanted to chat about it a bit.

First off, this is a compiler implementor’s point of view. Other smart people, such as Dan Bernstein, have a very different point of view (but also keep in mind that Dan doesn’t believe compiler optimization is useful).

Chandler is not a fan of the term nasal demons, which he says is misleadingly hyperbolic, since the compiler isn’t going to maliciously turn undefined behavior (UB) into code for erasing your files or whatever. This is true, but Chandler leaves out the fact that our 28-year-long computer security train wreck (the Morris Worm seems like as good a starting point as any) has been fueled to a large extent by undefined behavior in C and (later) C++ code. In other words, while the compiler won’t emit system calls for erasing your files, a memory-related UB in your program will permit a random person on the Internet to insert instructions into your process that issue system calls doing precisely that. From this slightly broader point of view, nasal demons are less of a caricature.

The first main idea in Chandler’s talk is that we should view UB at the PL level as being analogous to narrow contracts on APIs. Let’s look at this in more detail. An API with a wide contract is one where you can issue calls in any order, and you can pass any arguments to API calls, and expect predictable behavior. One simple way that an API can have a wider contract is by quietly initializing library state upon the first call into the library, as opposed to requiring an explicit call to an init() function. Some libraries do this, but many libraries don’t. For example, an OpenSSL man page says “SSL_library_init() must be called before any other action takes place.” This kind of wording indicates that a severe obligation is being placed on users of the OpenSSL API, and failing to respect it would generally be expected to result in unpredictable behavior. Chandler’s goal in this first part of the talk is to establish the analogy between UB and narrow API contracts and convince us that not all APIs want to be maximally wide. In other words, narrow APIs may be acceptable when their risks are offset by, for example, performance advantages.

Coming back to programming languages (PL), we can look at something like the signed left shift operator as exposing an API. The signed left shift API in C and C++ is particularly narrow and while many people have by now internalized that it can trigger UB based on the shift exponent (e.g., 1 << -1 is undefined), fewer developers have come to terms with restrictions on the left hand argument (e.g., 0 << 31 is defined but 1 << 31 is not). Can we design a wide API for signed left shift? Of course! We might specify, for example, that the result is zero when the shift exponent is too large or is negative, and that otherwise the result is the same as if the signed left-hand argument was interpreted as unsigned, shifted in the obvious way, and then reinterpreted as signed.

At this point in the talk, we should understand that “UB is bad” is an oversimplification, that there is a large design space relating to narrow vs. wide APIs for libraries and programming language features, and that finding the best point in this design space is not straightforward since it depends on performance requirements, on the target platform, on developers’ expectations, and more. C and C++, as low-level, performance-oriented languages, are famously narrow in their choice of contracts for core language features such as pointer and integer operations. The particular choices made by these languages have caused enormous problems and reevaluation is necessary and ongoing. The next part of Chandler’s talk provides a framework for deciding whether a particular narrow contract is a good idea or not.

Chandler provides these four principles for narrow language contracts:

  1. Checkable (probabilistically) at runtime
  2. Provide significant value: bug finding, simplification, and/or optimization
  3. Easily explained and taught to programmers
  4. Not widely violated by existing code that works correctly and as intended

The first criterion, runtime checkability, is crucial and unarguable: without it, we get latent errors of the kind that continue to contribute to insecurity and that have been subject to creeping exploitation by compiler optimizations. Checking tools such as ASan, UBSan, and tis-interpreter reduce the problem of finding these errors to the problem of software testing, which is very difficult, but which we need to deal with anyhow since there’s more to programming than eliminating undefined behaviors. Of course, any property that can be checked at runtime can also be checked without running the code. Sound static analysis avoids the need for test inputs but is otherwise much more difficult to usefully implement than runtime checking.

Principle 2 tends to cause energetic discussions, with (typically) compiler developers strongly arguing that UB is crucial for high-quality code generation and compiler users equally strongly arguing for defined semantics. I find the bug-finding arguments to be the most interesting ones: do we prefer Java-style two’s complement integers or would we rather retain maximum performance as in C and C++ or mandatory traps as in Swift or a hybrid model as in Rust? Discussions of this principle tend to center around examples, which is mostly good, but is bad in that any particular example excludes a lot of other use cases and other compilers and other targets that are also important.

Principle 3 is an important one that tends to get neglected in discussions of UB. The intersection of HCI and PL is not incredibly crowded with results, as far as I know, though many of us have some informal experience with this topic because we teach people to program. Chandler’s talk contains a section on explaining signed left shift that’s quite nice.

Finally, Principle 4 seems pretty obvious.

One small problem you might have noticed is that there are undefined behaviors that fail one or more of Chandler’s criteria, that many C and C++ compiler developers will defend to their dying breath. I’m talking about things like strict aliasing and termination of infinite loops that violate (at least) principles 1 and 3.

In summary, the list of principles proposed by Chandler is excellent and, looking forward, it would be great to use it as a standard set of questions to ask about any narrow contract, preferably before deploying it. Even if we disagree about the details, framing the discussion is super helpful.

Solutions to Integer Overflow

Humans are typically not very good at reasoning about integers with limited range, whereas computers fundamentally work with limited-range numbers. This impedance mismatch has been the source of a lot of bugs over the last 50 years. The solution comes in multiple parts.

In most programming languages, the default integer type should be a bignum: an arbitrary-precision integer that allocates more space when needed. Efficient bignum libraries exist and most integers never end up needing more than one machine word anyway, except in domains like crypto. As far as I’m concerned, for ~95% of programming tasks integer overflow is a solved problem: it should never happen. The solution isn’t yet implemented widely enough, but happily there are plenty of languages such as Python that give bignums by default.

When performance and/or predictability is a major consideration, bignums won’t work and we’re stuck with fixed-width integers that wrap, trap, saturate, or trigger undefined behavior upon overflow. Saturation is a niche solution that we won’t discuss further. Undefined behavior is bad but at least it enables a few loop optimizations and also permits trapping implementations. Although wrapping is an extremely poor default, there are a few good things to say about it: wrapping is efficient, people have come to expect it, and it is a good match for a handful of application domains.

Swift is a modern programming language that traps instead of providing bignums, this is also a generally sensible behavior. Why not bignums? The About Swift web page says that Swift gives “the developer the control needed in a true systems programming language,” so perhaps the designers were worried about unpredictable allocations. I’d love to see a study of the performance of best-of-breed trapping and bignum implementations on important modern applications.

The Rust developers have adopted a hybrid solution where integer overflows trap in debug builds and wrap in optimized builds. This is pragmatic, especially since integer overflows do not compromise Rust’s memory safety guarantees. On the other hand, perhaps as MIR matures, Rust will gravitate towards checking in optimized builds.

For safety-critical programs, the solution to integer overflow is to prove that it cannot happen using some combination of manual reasoning, testing, and formal verification. SPARK Ada and the TrustInSoft analyzer are suitable for proving that integer overflows won’t occur. More work is needed to make this sort of verification scalable and less expert-intensive.

Systems programming tasks, such as building operating systems, language runtimes, and web browsers, are caught in the middle. Wrapping sucks, bignums and trapping are slow or at least perceived as slow (and you do not want to trap or allocate while handling a hardware interrupt anyway), and the codes are too large for formal verification and thorough testing. One answer is to work hard on making trapping fast. For example, Swift has a high-level optimization pass specifically for removing integer overflow checks, and then the LLVM optimization passes do more of this, and then the LLVM backends can lower checked math operations to efficient condition code checks, and then modern Intel processors fuse the resulting branch-on-overflow instructions away.

In summary, bignums should be the default whenever this is feasible, and trapping on overflow should be the backup default behavior. Continued work on the compilers and processors will ensure that the overhead of trapping overflow checks is down in the noise. Java-style wrapping integers should never be the default, this is arguably even worse than C and C++’s UB-on-overflow which at least permits an implementation to trap. In domains where wrapping, trapping, and allocation are all unacceptable, we need to be able to prove that overflow does not occur.

I’ll end up with a few random observations:

  • Dan Luu wrote a piece on the overhead of overflow checking.
  • Arbitrary (fixed) width bitvectors are a handy datatype and I wish more languages supported them. These can overflow but it’s not as big of a deal since we choose the number of bits.
  • Explicitly ranged integers as seen in Ada are also very nice, there’s no reason that traps should only occur at the 32-bit or 64-bit boundaries.
  • The formal verification community ignored integer overflow for far too long, there’s a long history of assuming that program integers behave like mathematical integers. Things are finally better though.

UPDATE: I didn’t want this piece to be about C and C++ but I should have clarified that it is only signed overflow in these languages that is undefined behavior; unsigned overflow is defined to be two’s complement wraparound. While it is possible to trap on unsigned overflow — UBSan has a flag that turns on these traps — this behavior does not conform to the standards. Even so, trapping unsigned wraparounds can — in some circumstances — be useful for finding software defects. The question is whether the wraparound was intentional or not.

Isolating a Free-Range Miscompilation

If we say that a compiler is buggy, we need to be able to back up that claim with reproducible, compelling, and understandable evidence. Usually, this evidence centers on a test case that triggers the buggy behavior; we’ll say something like “given this test case, compiler A produces an executable that prints 0 whereas compiler B produces an executable that prints 1, therefore there’s a bug.” (In practice compilers A and B are usually different versions or different optimization levels of the same compiler — this doesn’t matter.)

The problem is that when compilers A and B emit code that behaves differently, the divergence can happen for reasons other than compiler bugs:

  • the program might rely on undefined behavior (UB) or unspecified behavior,
  • the program might read non-constant data from its environment, for example by looking at the clock or at its process id,
  • the program might be concurrent and its output influenced by scheduling decisions.

A big part of convincing someone that the compiler is buggy is convincing them that the test program is free of problems such as these. Since concurrency and interactions with the environment are easy to spot, the problem is often undefined behavior. You can find a few examples in my previous post about invalid GCC bug reports.

This post is about partially automating the process of coming up with a small test case for a miscompilation bug. The bug that we’ll be studying shows up when you compile the latest version of GMP, 6.1.1, using Clang+LLVM revision 135022, from about five years ago. The bug itself is irrelevant and long fixed, I’m just using it to illustrate the process.

As a slight digression, I’ll add that out of the 90 versions of LLVM, spanning revisions 90,000-250,000, that I used to compile GMP (in its C-only mode, disabling use of assembly), only those in the 135,000-139,000 range miscompiled GMP 6.1.1, according to its test suite. The GMP web page says:

Most problems with GMP these days are due to problems not in GMP, but with the compiler used for compiling the GMP sources. This is a major concern to the GMP project, since an incorrect computation is an incorrect computation, whether caused by a GMP bug or a compiler bug. We fight this by making the GMP testsuite have great coverage, so that it should catch every possible miscompilation.

This is all well and good but GMP used to contain some integer undefined behaviors and my guess is that some of the apparent miscompilations were actually due to compiler exploitation of these UBs. Finding some evidence one way or the other about this might be a fun project for a student.

Anyway, back to our compiler bug. The process for isolating a test case is:

  1. Grab a failing unit test.
  2. Use tis-interpreter to verify that it is free of undefined behavior and anything else that might trigger non-deterministic execution.
  3. Use C-Reduce to create a minimal program triggering the bug.

This sounds really easy — and it would be if we were working with a tidy little test case generated by Csmith — but real software is messy and there are plenty of complications. Let’s get started. The unit test we’re using is called mpz_lucnum_ui. Here are the 128 C files that are required to run it.

A First Try

We have to choose what exactly to reduce. Usually we want to reduce preprocessed code, but given the painfully slow interestingness test that we have here (almost 40 seconds, argh) this is going to make very slow progress since C-Reduce would need to eliminate a lot of redundant included junk from each of the 128 files. So let’s rather reduce the files without preprocessing and see what happens.

I had to make a few easy changes to get the GMP files to go through tis-interpreter. In an ideal future, the GMP project (and everyone else) will ensure that their unit tests are clean with respect to tis-interpreter.

Finally we’re ready to run the reduction:

$ creduce ./test1.sh *.c

After a few days the reduction finishes, see the results here. 119 of the 128 C files have become empty and the remaining 9 files contain a total of about 5 KB of code. A bit more than 99% of the code has been eliminated. Perhaps surprisingly, C-Reduce has managed to eliminate all conditional compilation directives, but some #defines remain as do a few #includes.

We’ve succeeded in creating a modestly small test case, but it isn’t yet standalone (due to the includes) and really we would like everything to live in a single file. We can kill two birds with one stone by using CIL or Frama-C to merge the C files into a single compilation unit.

$ frama-c -print *.c > merged.c

The merged file isn’t quite buildable — there’s some junk at the top and there’s a minor problem handling a builtin — but this is easy to fix. You can find the result here. Unfortunately, the merged file doesn’t trigger the bug any longer. Either the testcase relied on separate compilation or else Frama-C’s rewriting of the source code perturbed things badly. That’s sort of a bummer. I could easily have omitted this part of the post, but I wanted to show the whole process here, including missteps.

A Second Try

This time we’ll preprocess and merge the 128 files first, and reduce second. Again, some manual patching was necessary since the merger doesn’t quite work. The result is an 800 KB compilation unit.

Again, the reduction goes slowly:

The final result is a bit less than 4 KB. It’s still too big to be easily understood. Since we’ve run up against the limits of our tooling, further reduction will have to be by hand. Since I’m not actually going to report this long-ago-fixed bug, I didn’t bother with this step. (Back before C-Reduce existed I did a lot of manual testcase reduction and while it was a somewhat satisfying and mindless activity, I ran out of patience for it.) But in any case, 4 KB of self-contained C code is quite manageable.

Is Undefined Behavior Checking Necessary?

Is it really necessary to worry about undefined behavior? In my experience, reducing a miscompilation while disregarding UB is roughly 100% likely to result in a testcase that misbehaves due to UB instead of triggering a compiler bug. Here you can see our interestingness test w/o the UB checking. If we run a reduction using it, the resulting 1 KB testcase (hacked a bit by hand so that we can use tis-interpreter to discover its fatal flaw) misbehaves via an out-of-bounds store:

$ tis-interpreter.sh merged.c
merged.c:53:[kernel] warning: Calling undeclared function cu. Old style K&R code?
[value] Analyzing a complete application starting at main
[value] Computing initial state
merged.c:11:[value] warning: during initialization of variable 'cm', size of type 'struct a []' cannot be
                 computed (Size of array without number of elements.)
merged.c:11:[kernel] imprecise size for variable cm
[value] Initial state computed
merged.c:51:[kernel] warning: out of bounds write. assert \valid(&cp->b);
                  stack: main
[value] Stopping at nth alarm
[value] user error: Degeneration occurred:
                    results are not correct for lines of code that can be reached from the degeneration point.

So we definitely need some sort of UB detection. But do we need something as heavyweight as tis-interpreter? I ran a few reductions using UBSan + ASan (see one of them here) and didn’t have good luck in getting a defined final testcase. The reduction kept introducing issues such as uninitialized storage and function declarators with empty parentheses and incompatible uses of function pointers, all of which can lead to real trouble. Most likely there is a combination of compilers and flags that would have let me run this reduction successfully but I ran out of energy to find it. UBSan and ASan and MSan are excellent but fundamentally they do not add up to a completely reliable UB detector.


More developers should:

  • Make sure unit tests go through tis-interpreter. Though not always practical (tis-interpreter doesn’t understand assembly or C++) this has many benefits since tis-interpreter implements very thorough checking. Also, when a change in compiler or compiler options breaks a test case that is clean with respect to tis-interpreter, the compiler can be blamed with high reliability.
  • Instead of working around compiler bugs, reduce and report them. This can be a lot of work but tools like tis-interpreter and C-Reduce make it easier and when these bugs get fixed life is better for everyone.

A Month of Invalid GCC Bug Reports, and How to Eliminate Some of Them

During July 2016 the GCC developers marked 38 bug reports as INVALID. Here’s the full list. They fall into these (subjective) categories:

  • 8 bug reports stemmed from undefined behavior in the test case (71753, 71780, 71803, 71813, 71885, 71955, 71957, 71746)
  • 1 bug report was complaining about UB exploitation in general (71892)
  • 15 bug reports came from a misunderstanding (or disagreement) about the non-UB semantics of a programming language, usually C++ but also C and Fortran (71786, 71788, 71794, 71804, 71809, 71890, 71914, 71939, 71963, 71967, 72580, 72750, 72761, 71844, 71772)
  • 4 bug reports stemmed from a misunderstanding of something besides the language semantics, such as command line flags (72736, 71729, 71995, 71777)
  • 5 bug reports were caused by an unrelated problem on the reporter’s system such as an out-of-memory error, a borked Cygwin installation, out-of-date files in a build tree, etc (71735, 71770, 71903, 71918, 71978)
  • 1 bug report was about a bug that the devs didn’t want to fix since it was in an inactive branch and had been fixed in all active branches (72051)
    4 bug reports didn’t end up demonstrating any reproducible problem (71940, 71944, 71986, 72076)

I’ve often thought that it would be nice for a compiler bug reporting system to be active instead of passively serving up files and discussions. An active bug reporting system would be able to run:

  • a wide variety of compiler versions,
  • the compiler’s output, and
  • tools for finding undefined behaviors.

Of course not all bug reports would be able to make use of these features. However, one can imagine that there is a significant subset of compiler bug reports where the reporter, cooperating with the system, would be able to conclusively demonstrate that the compiler crashes or generates wrong code. In cases where this cannot be demonstrated, the process of interacting with an active bug reporting system will help the reporter understand what the actual issue is without wasting a compiler developer’s time.

An active bug reporting system can run lots of experiments to determine how many compiler versions, how many target platforms, and how many optimization levels are affected by the bug. It can also determine which revision introduced the problem and who committed the breaking change — suggesting an initial owner for the bug. It can run a testcase reducer to make the program triggering the bug smaller. All of these things will help compiler developers prioritize among reported bugs. The system should also be able to automatically flag duplicate bug reports. When a bug stops reproducing, the bug reporting system will notice this and flag the PR as being ready to close, and could also help out by packaging up an addition to the regression test suite.


A few additional details. During July a total of 328 bugs were reported, ignoring those marked as spam. 143 of these were resolved: 22 as duplicates, 81 as fixed, 38 as invalid, 1 as wontfix, and 1 as worksforme. Out of the remaining 185 unresolved bugs, 15 are assigned, 86 are new, 1 is reopened, 74 are unconfirmed, and 9 are waiting.

I believe that an active bug reporting system will make many of these 290 non-invalid bug reports easier to deal with, as opposed to only helping with the invalid ones!

Note: In the initial version of this post I mentioned 36 invalid bugs, not 38, because I was only searching for bugs that were marked as resolved. Also searching for closed bugs brings the total to 38.

C-Reduce 2.5

In May we released C-Reduce 2.5 which builds against LLVM/Clang 3.8. New in this release:

  • Improved reduction of non-preprocessed C/C++ code. C-Reduce now includes #included files that are below a certain size and also uses unifdef to remove #ifdef/#endif pairs. Specialization of #define directives is not implemented yet.
  • Support for reducing multiple files at once. This is useful for reducing inputs that trigger LTO bugs or that are not preprocessed.
  • Support for reducing OpenCL files, contributed by the authors of this paper.
  • Improved output for creduce --help.
  • Lots of cleanups and bug fixes including a rewrite of the pass that removes groups of lines from a file.

Many thanks to those who contributed to this release!

Pointer Overflow Checking

Most programming languages have a lot of restrictions on the kinds of pointers that programs can create. C and C++ are unusually permissive in this respect: pointers to arbitrary objects and subobjects, usually all the way down to bytes, can be constructed. Consequently, most address computations can be expressed either in terms of integer arithmetic or pointer arithmetic. For example, a function based on array lookup:

void *memcpy(void *dst, const void *src, size_t n) {
  const char *s = src;
  char *d = dst;
  for (int i = 0; i < n; ++i)
    d[i] = s[i];
  return dst;

can just as easily be expressed in terms of pointers:

void *memcpy(void *dst, const void *src, size_t n) {
  const char *s = src;
  char *d = dst;
  while (n--)
    *d++ = *s++;
  return dst;

Idiomatic C tends to favor pointer-based code. For one thing, pointers are more expressive: a pointer can name any memory location while an integer index only makes sense when combined with a base address. Also, developers have a sense that the lower-level code will execute faster since it is closer to how the machine thinks. This may or may not be true: the tradeoffs are complex due to details of the semantics of pointers and integers, and also because different compiler optimizations will tend to fire for pointer code and integer code. Modern compilers can be pretty bright, at least for very simple codes: the version of GCC that I happen to be using for testing (GCC 5.3.0 at -O2) turns both functions above into exactly the same object code.

It is undefined behavior to perform pointer arithmetic where the result is outside of an object, with the exception that it is permissible to point one element past the end of an array:

int a[10];
int *p1 = a - 1; // UB
int *p2 = a; // ok
int *p3 = a + 9; // ok
int *p4 = a + 10; // ok, but can't be dereferenced
int *p5 = a + 11; // UB

Valgrind and ASan are intended to catch dereferences of invalid pointers, but not their creation or use in comparisons. UBSan catches the creation of invalid pointers in simple cases where bounds information is available, but not in the general case. tis-interpreter can reliably detect creation of invalid pointers.

A lot of C programs (I think it’s safe to say almost all non-trivial ones) create and use invalid pointers, and often they get away with it in the sense that C compilers usually give sensible results when illegal pointers are compared (but not, of course, dereferenced). On the other hand, when pointer arithmetic overflows, the resulting pointers can break assumptions being made in the code.

For the next part of this piece I’ll borrow some examples from a LWN article from 2008. We’ll start with a buffer length check implemented like this:

void awesome_length_check(unsigned len) {
  char *buffer_end = buffer + BUFLEN;
  if (buffer + len >= buffer_end)

Here the arithmetic for computing buffer_end is totally fine (assuming the buffer actually contains BUFLEN elements) but the subsequent buffer + len risks UB. Let’s look at the generated code for a 32-bit platform:

        cmpl    $100, 4(%esp)
        jl      .LBB0_1
        jmp     die_a_gory_death

In general, pointer arithmetic risks overflowing when either the base address lies near the end of the address space or when the offset is really big. Here the compiler has factored the pointer out of the computation, making overflow more difficult, but let’s assume that the offset is controlled by an attacker. We’ll need a bit of a driver to see what happens:

#include <limits.h>
#include <stdio.h>
#include <stdlib.h>

#define BUFLEN 100
char buffer[BUFLEN];

void die_a_gory_death(void) { abort(); }

void awesome_length_check(unsigned len) {
  char *buffer_end = buffer + BUFLEN;
  if (buffer + len >= buffer_end)

int main(void) {
  // volatile to suppress constant propagation
  volatile unsigned len = UINT_MAX;
  printf("length check went well\n");
  return 0;

And then:

$ clang -O -m32 -Wall ptr-overflow5.c
$ ./a.out 
length check went well
$ gcc-5 -O -m32 -Wall ptr-overflow5.c
$ ./a.out 
length check went well

The problem is that once the length check succeeds, subsequent code is going to feel free to process up to UINT_MAX bytes of untrusted input data, potentially causing massive buffer overruns.

One thing we can do is explicitly check for a wrapped pointer:

void awesome_length_check(unsigned len) {
  char *buffer_end = buffer + BUFLEN;
  if (buffer + len >= buffer_end ||
      buffer + len < buffer)

But this is just adding further reliance on undefined behavior and the LWN article mentions that compilers have been observed to eliminate the second part of the check. As the article points out, a better answer is to just avoid pointer arithmetic and do the length check on unsigned integers:

void awesome_length_check(unsigned len) {
  if (len >= BUFLEN)

The problem is that we can’t very well go and retrofit all the C code to use integer checks instead of pointer checks. We can, on the other hand, use compiler support to catch pointer overflows as they happen: they are always UB and never a good idea.

Will Dietz, one of my collaborators on the integer overflow checking work we did a few years ago, extended UBSan to catch pointer overflows and wrote a great blog post showing some bugs that it caught. Unfortunately, for whatever reason, these patches didn’t make it into Clang. The other day Will reminded me that they exist; I dusted them off and submitted them for review — hopefully they will get in this time around.

Recently I’ve been thinking about using UBSan for hardening instead of just bug finding. Android is doing this, for example. Should we use pointer overflow checking in production? I believe that after the checker has been thoroughly tested, this makes sense. Consider the trapping mode of this sanitizer:

clang -O3 -fsanitize=pointer-overflow -fsanitize-trap=pointer-overflow

The runtime overhead on SPEC CINT 2006 is about 5%, so probably acceptable for code that is security-critical but not performance-critical. I’m sure we can reduce this overhead with some tuning of the optimizers. The 400.perlbench component of SPEC 2006 contained two pointer overflows that I had to fix.

Pointer overflow isn’t one of the UBs that we can finesse away by adjusting the standard: it’s a real machine-level phenomenon that is hard to prevent without runtime checks.

There’s plenty more work we could do in this sanitizer, such as catching arithmetic involving null pointers.

Update: I built the latest releases of PHP and FFmpeg using the pointer overflow sanitizer and both of them execute pointer overflows while running their own test suites.

Teaching C

The other day Neel Krishnaswami mentioned that he’s going to be teaching the C class at Cambridge in the fall, and asked if I had any advice about that topic. Of course I do! In fact the response got so long that it ended up being a blog post.

My main idea is that we need to teach C in a way that helps students understand why a very large fraction of critical software infrastructure, including almost all operating systems and embedded systems, is written in C, while also acknowledging the disastrously central role that it has played in our ongoing computer security nightmare.

There’s a lot of reading material out there. For the basics, I still recommend that students purchase K&R. People say good things about C Programming: A Modern Approach; I’ve only skimmed it. For advanced C I’ve not read a better book than Expert C Programming, though like K&R it is fairly old. The Practice of Programming is a really great book though it’s not completely specific to C. I haven’t read all of it but from what I’ve seen Modern C is a very good resource, with AFAIK the best treatment of undefined behavior of any C book. The C FAQ contains lots of good material.

For supplemental reading, of course the students need to look at all three parts of Chris Lattner’s writeup about undefined behavior, and mine as well.

What version of C should we teach? Probably a common subset of C99 and C11. In a first C class there’s no need to go into advanced C11 features such as the concurrent memory model.

We’d like students to be able to answer the question: Is C an appropriate choice for solving this problem? We’ll want some lecture material about C’s place in the modern world and we also need to spend time reading some high-quality C code, perhaps starting with Redis, Musl, or Xv6. Musl, in particular, is a good match for teaching since it contains lots of cute little functions that can be understood in isolation. From any such function we can launch a discussion about tradeoffs between portability, efficiency, maintainability, testability, etc. If Rich Felker (the Musl author) did something a certain way, there’s probably a good reason for it and we should be able to puzzle it out. We can also use Matt Godbolt’s super awesome compiler explorer to look at the code generated by various compilers. C’s lightweight-to-nonexistent runtime support is one of its key advantages for real-world system building, and it also means that generated code can be understood without thinking about something like a garbage collector.

We probably also need to spend a bit of time looking at bad old C, the kind that makes the world work even though we’re not proud of it. We can find files in OpenSSL and in the PHP interpreter that would singe your brain despite getting run billions of times a day, or we can always pick on an old standby like glibc — worth looking at just for the preprocessor abuse. But perhaps I am being uncharitable: Pascal Cuoq (reading a draft of this piece) correctly points out that “even what seems like plain stupidity often stems from engineering trade-offs. Does the project try to remain compilable under MS-DOS with DJGPP, with C90 compilers, under VMS, or all three at the same time?” And it is true that we would do well to help students understand that real-world engineering constraints do not often resemble the circumstances that we lead them to expect in school.

The second big thing each student should learn is: How can I avoid being burned by C’s numerous and severe shortcomings? This is a development environment where only the paranoid can survive; we want to emphasize a modern C programming style and heavy reliance on the (thankfully excellent) collection of tools that is available for helping us develop good C.

Static analysis is the first line of defense; the students need to use a good selection of -W flags and then get used to making things compile without warnings. A stronger tool such as the Clang static analyzer should also be used. On the dynamic side, all code handed in by students must be clean as far as ASan, UBSan, and MSan are concerned. tis-interpreter holds code to an even higher standard; I haven’t had students use this tool yet but I think it’s a great thing to try. Since dynamic testing is limited by the quality of the test cases, the students need to get used to using the output of a code coverage tool to find gaps in test coverage. Lots of coverage tools for C are available but I usually just use gcov since it is ubiquitous and hassle-free.

Teaching undefined behavior using sanitizers is a piece of cake: the tool gives students exactly the feedback that they need. The other way of teaching undefined behavior, by looking at its consequences, is something that we should spend a bit of time on, but it requires a different kind of thinking and we probably won’t expect the majority of students to pick up on all the subtleties — even seasoned professional C programmers are often unaware of these.

Detecting errors and doing something about them is a really important part of programming that we typically don’t teach much about in school. Since C is designed to avoid sweeping these problems under the rug, a C class is a great place to get students started on the right track. They should have to implement a goto chain.

Something I’m leaving out of this post is the content of the assignments that we give the students — this mostly depends on the specific goals of the course and how it fits into the broader curriculum (In what year are students expected to take the class? What kind of background do they have in math and science? What languages do they already know?). I’ve always taught C as a side effect of teaching operating systems, embedded systems, or something along those lines. In a course where the primary goal is C we have more freedom, and could look at more domains. Image processing and cryptographic algorithms would be really fun, for example, and even the old standby, data structures, can be used to good effect in class.

I’m also leaving out build systems and version control. They should use these.

In some courses I will give students access to the test infrastructure that will be used to grade their code. This makes assignments a lot more fun, and makes students a lot more happy. Other times I will give them a few test cases and keep the good tests (and the fuzzers) for myself. The idea is to make the assignment not only about implementation but also about testing. This stresses students out but it’s far more realistic.

Pascal remarks that “C is mostly taught very badly, and a student who aims at becoming good at maintaining C code will need to unlearn much that they have (typically) been told in class.” This is regrettably true — a lot of instructors learned C in previous decades and then they teach an outdated language, for example failing to discourage preprocessor abuse. The most serious common failing is to leave students unaware of their side of the bargain when the deal with a C compiler. I am talking of course about undefined behavior (and, to a lesser extent, unspecified and implementation-defined behavior). As a concrete example, I have taught numerous classes based on Computer Systems: A Programmer’s Perspective. In most respects this is an excellent book, but (even in the 3rd edition) it not only ignores undefined behavior but, worse, explicitly teaches students that signed integers in C have two’s complement behavior on overflow:

This claim that positive signed overflow wraps around is neither correct by the C standard nor consistent with the observed behavior of either GCC or LLVM. This isn’t an acceptable claim to make in a popular C-based textbook published in 2015. While I can patch problems in the book during lecture, that isn’t very satisfying, and not all instructors have the time and expertise.

One might argue that we shouldn’t be teaching C any longer, and I would certainly agree that C is probably a poor first or second language. On the other hand, even if we were in a position where no new projects should be written in C (that day is coming, but slowly — probably at least a decade off), we’re still going to be stuck maintaining C for many decades. A random CS graduate has pretty good odds of running into C during her career. But beyond that, even after we replace C, the systems programming niche will remain. A lot of what we learn when we think we’re learning C is low-level programming and that stuff is important.

Thanks to Pascal Cuoq and Robby Findler for commenting on drafts of this piece.

Checking Up on Dataflow Analyses

An important tool for reasoning about programs without running them is dataflow analysis, which can be used to compute facts such as:

  • an integer-valued variable is non-negative at a particular program point
  • a conditional branch always goes one way
  • an indirect write cannot modify a particular array

Dataflow facts drive optimizations such as constant propagation and dead code elimination. Outside of compilers, dataflow analyses are important in tools that prove safety properties, such as absence of undefined behaviors in C or C++ programs. Dataflow analysis is backed by a rich academic literature and tool writers have decades of experience implementing it. This set of lecture notes is my favorite introduction to dataflow analysis.

Despite all of the nice theorems, dataflow analyses are hard to get right, in part because people are just not very good at reasoning about properties that hold over sets of program executions. As a simple exercise in this kind of thinking, try this: you have a variable A whose value is known to be in the range [0..1000] and a variable B that is known to be in the range [10000..10050]. The program computes the bitwise xor of A and B and assigns the result into C. Provide tight bounds on the values that C can take. If you find this to be easy, that’s great — but consider that you then have to make this work for all intervals and after that there are about 9,999 similarly fiddly things left to implement and any mistakes you make are likely to result in miscompilations.

So now let’s finally get to the point of this post. The point is that since dataflow is hard, if you implement a dataflow analysis, you should also implement a dynamic checker that looks for cases where the analysis has come to the wrong conclusion. To keep going with the running example, the program being analyzed contains this line of code:

// A in [0..1000] and B in [10000..10050]
C = A ^ B;

If our interval analysis says that C has to be in the range [9225..12255], we would rewrite the code to add an assertion:

// A in [0..1000] and B in [10000..10050]
C = A ^ B;
assert(C >= 9225 && C <= 12255);

Now we can run the instrumented program and, if we’re very lucky, it will execute this code with values such as A = 830 and B = 10041, making C = 9223, triggering an assertion violation telling us that our analysis is unsound and giving us something to debug. The other way to debug this sort of problem is to backtrack from a miscompilation — an experience enjoyed by basically nobody.

Every dataflow analysis that I’ve implemented or supervised the implementation of has gotten a dynamic checker. This is a great testing technique that has numerous advantages. It finds bugs in practice. It can be automated, either using a program generator like Csmith or else using a regular test suite. The annotated programs serve as a useful way to look at the precision of an analysis: if not very many assertions show up, or if the bounds are very loose, then we probably need to do some tuning for precision. Finally, if we don’t find any bugs, then the dataflow analysis is probably (please excuse me, formal methods people) correct enough for practical purposes.

Of course, not everything that can be computed by a dataflow analysis can be expressed in code. For example, in C and C++ there’s no lightweight way to assert that a pointer refers to valid storage or that a variable is uninitialized. E-ACSL is a neat piece of related work in this general area.

Just recently I’ve been looking at the efficiency of integer overflow checking using LLVM and since then I wrote some code that uses integer range dataflow facts to remove overflow checks that can be proved to not fire. Here’s the under-review patch. The ranges are computed by an LLVM pass called LazyValueInfo (LVI), which better be correct if we’re going to rely on it, so I wrote a little LLVM pass that does the necessary checking. It processes each function in between the optimization and code generation phases by:

(If you build this pass, you’ll need a slightly hacked LLVM, see the README.) Although it might seem tempting to run the optimizers again on the instrumented code in order to clean it up a bit, this would be a very bad idea: the very dataflow analyses that we’re trying to check up on would be used to drive optimizations, which could easily end up incorrectly deleting our checks.

So far, some testing using Csmith hasn’t turned up any bugs in LVI, which is great. Less great is the fact that it drops precision all over the place: a lot of tuning up is needed before it is the basis for a really strong redundant overflow check remover.

The technique I’m describing is not as widely known or as widely used as it should be, considering how easy and effective it is. The best answer is that C = [9216..10239].

UPDATE: The latest version of my LLVM dataflow checker also validates the results of computeKnownBits(), which tries to figure out which bits of a value must be either zero or one.

Here’s a bit of further reading about how these techniques (and others) were applied to the Frama-C static analyzer.

Efficient Integer Overflow Checking in LLVM

(Here’s some optional background reading material.)

We want fast integer overflow checking. Why? First, if the undefined behavior sanitizers go faster then testing goes faster. Second, when overhead drops below a certain point people will become willing to use UBSan to harden production code against integer overflows. This is already being done in parts of Android. It isn’t the kind of thing to do lightly: some of LLVM’s sanitizers, such as ASan, will increase an application’s attack surface. Even UBSan in trapping mode — which does not increase attack surface that I know of — could easily enable remote DoS attacks. But this is beside the point.

Let’s start with this function:

int foo(int x, int y) {
  return x + y; 

When compiled with trapping integer overflow checks (as opposed to checks that provide diagnostics and optionally continue executing) Clang 3.8 at -O2 gives:

        addl    %esi, %edi
        jo      .LBB0_1
        movl    %edi, %eax

This code is efficient; here Chandler Carruth explains why. On the other hand, signed integer overflow checking slows down SPEC CINT 2006 by 11.8% overall, with slowdown ranging from negligible (GCC, Perl, OMNeT++) to about 20% (Sjeng, H264Ref) to about 40% (HMMER).

Why do fast overflow checks add overhead? The increase in object code size due to overflow checking is less than 1% so there’s not going to be much trouble in the icache. Looking at HMMER, for example, we see that it spends >95% of its execution time in a function called P7Viterbi(). This function can be partially vectorized, but the version with integer overflow checks doesn’t get vectorized at all. In other words, most of the slowdown comes from integer overflow checks interfering with loop optimizations. In contrast, GCC and Perl probably don’t get much benefit from advanced loop optimizations in the first place, hence the lack of slowdown there.

Here I’ll take a second to mention that I had to hack SPEC slightly so that signed integer overflows wouldn’t derail my experiments. The changes appear to be performance-neutral. Only GCC, Perl, and H254Ref have signed overflows. Here’s the patch for SPEC CPU 2006 V1.2. All performance numbers in this post were taken on an i7-5820K (6-core Haswell-E at 3.3 GHz), running Ubuntu 14.04 in 64-bit mode, with frequency scaling disabled.

Now for the fun part: making code faster by removing overflow checks that provably don’t fire. At -O0 the SPEC INT 2006 benchmarks have 67,678 integer overflow checks, whereas at -O3 there are 38,527. So that’s nice: LLVM 3.8 can already get rid of 43% of naively inserted checks.

Let’s look at some details. First, the good news: all of the overflow checks in these functions are optimized away by LLVM:

int foo2(int x, int y) {
  return (short)x + (short)y;

int foo3(int x, int y) {
  return (long)x + (long)y;

int foo4(int x, int y) {
  return (x >> 1) + (y >> 1);

int32_t foo5(int32_t x, int32_t y) {
  const int32_t mask = ~(3U << 30);
  return (x & mask) + (y & mask);

int32_t foo6(int32_t x, int32_t y) {
  const int32_t mask = 3U << 30;
  return (x | mask) + (y | mask);

int32_t foo7(int32_t x, int32_t y) {
  const int32_t mask = 1U << 31;
  return (x | mask) + (y & ~mask);

See for yourself here. The common theme across these functions is that each overflow check can be seen to be unnecessary by looking at the high-order bits of its inputs. The code for these optimizations is here.

Now on the other hand, LLVM is unable to see that these functions don’t require overflow checks:

int foo8(int x) {
  return x < INT_MAX ? x + 1 : INT_MAX;

int foo9(int x, int y) {
  if (x < 10 && x > -10 && y < 10 && y > -10)
    return x + y;
  return 0;

Here’s another one where the check doesn’t get eliminated:

void foo10(char *a) {
  for (int i = 0; i < 10; i++)
    a[i] = 0;

There’s good news: Sanjoy Das has a couple of patches that, together, solve this problem. Their overall effect on SPEC is to reduce the overhead of signed integer overflow checking to 8.7%.

Finally, this function should get one overflow check rather than three:

unsigned foo11(unsigned *a, int i) {
  return a[i + 3] + a[i + 5] + a[i + 2];

This example (or something like it) is from Nadav Rotem on twitter; his redundant overflow check removal pass for Swift eliminates this sort of thing at the SIL level. I’ve done a bit of work on bringing these ideas to LLVM, and will hopefully have more to write about that later on.

In summary, signed integer overflow checks in LLVM are fast but they get in the way of the optimizers, which haven’t yet been systematically taught to see through them or eliminate them. There’s plenty of low-hanging fruit, and hopefully we can get the overhead down to the point where people can turn on overflow checking in production codes without thinking too hard about the tradeoffs.


I haven’t forgotten about Souper! We’ve taught it to eliminate integer overflow checks. I’ll write about that later too.

See Dan Luu’s article on this topic.