Skip to content

This Is Not a Defect

In several previous blog entries I’ve mentioned that in some recent versions of C and C++, left-shifting a 1 bit into the high-order bit of a signed integer is an undefined behavior. In other words, if you have code that computes INT_MIN by evaluating 1<<31 (or 1<<(sizeof(int)*CHAR_BIT-1) if you want to be that way) or code that does a byte swap using a subexpression like c<<24, then your program most likely has no meaning according to the standards. And in fact, Clang's integer sanitizer confirms that most non-trivial codes, including several crypto libraries, are undefined according to this rule.

An obvious fix is to tighten up the standard a bit and specify that shifting a 1 into the sign bit has the expected effect, which is what all compilers that I am aware of already do. This is what the C++ standards committee is doing, though as far as I can tell the fix doesn't officially take effect until a TC, a "technical corrigendum," is issued -- and even that doesn't finalize the thing, but it seems near enough in practice.

Anyhow, today Nevin Liber pointed out that there's a bit of news here, which is that last month the C standards committee decided that this same issue is not a defect in C and that they'll reconsider it later on, which I guess is fine since compiler implementers are ignoring this particular undefined behavior, but it seems like a bit of a missed opportunity to (1) make the language slightly saner and (2) bring it into line with the existing practice. Also you might consider perusing the full set of defect reports if you want to be thankful that you did something other than attend a standards meeting last month.

Cedar Mesa

For years I’d heard people talk about Cedar Mesa, a remote part of southern Utah containing so many Anazazi ruins that it’s basically a huge outdoor museum. Recently my family spent a few days exploring this area. Despite the fact that Cedar Mesa is well-known — it was popularized, in large part, by a book by David Roberts in the 1990s — as far as we could tell nobody was camped within several miles of our campsite off of a high-clearance track near the head of Lime Canyon, seen here in the evening light:

April is a great time to be in the desert but this area is pretty high elevation (6400 feet or almost 2000 m) and it was well below freezing on our first night out. Here the sun is finally starting to warm us up the next morning:

Yep, the kids are wearing their snow pants. Later that morning we visited the Moon House, one of the larger ruins in the area. Although the hike to it is short, the route is circuitous, first dropping over a small pouroff, following a ledge around a corner, and then following a talus slope to the bottom of the canyon, passing between some huge boulders in the bottom of the canyon, climbing most of the way up the other side, and following another ledge behind a big pinnacle. There are good views along the way:

The ruins are impressive:

Life in the desert, though a bit sparse, is often pretty:

The next day we hiked in Natural Bridges National Monument; as you might expect it contains some big natural bridges:

And there were other things to see as well:

Although metates (stone mortars) are a common sight on the Colorado Plateau, something I haven’t seen elsewhere are the manos (grinding stones) since they are so easy to pick up and carry away:

We had great weather in the early part of the trip but got chased home a day early by a rainy night with a forecast for more rain: the roads in this part of the world tend to turn into grease that is impassable even with 4WD when they get wet enough, and we really did not want to get stuck while pulling our tent trailer:

All in all a nice short vacation.

I’m slowly ratcheting down the number of personal blog posts but I will continue to throw in this sort of thing every now and then.

Too Much Milk: The Secret Origin Story

When I first taught operating systems 12 years ago, I based my teaching materials on a set of slides inherited from John Carter, the previous instructor at Utah. I was generally happy with these slides, and I’ve continued to evolve them since then, but one thing I was always curious about was the “too much milk” example that is used to introduce concurrency problems. It goes something like: “You see that you’re out of milk and leave to go buy some. A few minutes later your roommate comes home and sees that there’s no milk. Later: OH NO THERE’S TOO MUCH MILK because of a synchronization error between humans.” This always struck me as a bit of an idiosyncratic way to go about introducing synchronization but I didn’t have any reason to replace it and didn’t think about it much until one time while procrastinating on lecture prep I Googled “too much milk” and learned the shocking truth: some large fraction of operating systems courses at research universities use this same example. See this for yourself.

The intriguing possibility is that too much milk could be used as sort of a mitochondrial DNA to trace the evolution of OS course notes. I started to envision creating a tool for analyzing the lineage of Powerpoint files, but then totally forgot about it until a few weeks ago. While chatting with Ryan Stutsman about teaching OS, for some reason I mentioned the Too Much Milk phenomenon. He thought that this had probably originated with his advisor John Ousterhout. John then explained:

I believe that the “too much milk” example owes its popularity to me, in that I have been using it ever since the early 1980s in my courses. For many years in the 1980s and 1990s, whenever a PhD student graduated in the systems area, Dave Patterson gave them a complete set of lecture videos for a couple of systems classes, including my CS 162 lectures. As a result, a lot of the material from my CS 162 lectures has spread all across the country through Berkeley graduates, including the “too much milk” example.

However, I did not invent this example. Before I taught my first operating systems course at Berkeley, Mike Powell gave me a copy of his notes and the “too much milk” example was there. So, it goes back at least to Mike Powell. I don’t know if he invented it, or if he got it from someone else.

So there you have it.

This post doesn’t have too much of a point, but I thought this was a nice story and also it provides a bit of insight into how teaching actually works: figuring out how to teach course material is hard and in practice, most of the time we borrow a framework and adapt it to fit our needs. Every now and then I’ll discuss this in class and invariably one or two undergraduates are so surprised that I’ll get dinged on the course evaluations with comments like “Lazy professor doesn’t even develop his own course materials.” However, I’ve done three or four courses the other way — starting from scratch — and have found that it takes multiple iterations before the material converges, so it’s not really clear to me that it’s better to start from scratch in cases where a good set of existing material can be found.

Hints for Computer System Design

On the last day of my advanced OS course this spring we discussed one of my all-time favorite computer science papers: Butler Lampson’s Hints for Computer System Design. Why is it so good?

  • It’s hard-won advice. Designing systems is not easy — a person can spend a lifetime learning to do it well — and any head start we can get is useful.

  • The advice is broadly applicable and is basically correct, there aren’t really any places where I need to tell students “Yeah… ignore Section 3.5.”

  • There are many, many examples illustrating the hints. Some of them require a bit of historical knowledge (the paper is 30 years old) but by and large the examples stand the test of time.

  • It seems to me that a lot of Lampson’s hints were routinely ignored by the developers of large codes that we use today. I think the reason is pretty obvious: the massive increases in throughput and storage capacity over the last 30 years have permitted a great deal of sloppy code to be created. It’s nice to read a collection of clear thinking about how things could be done better.

Something that bums me out is that it’s now impossible to publish a paper like this at a top conference such as SOSP.

How Should You Write a Fast Integer Overflow Check?

Detecting integer overflow in languages that have wraparound semantics (or, worse, undefined behavior on overflow, as in C/C++) is a pain. The issue is that programming languages do not provide access to the hardware overflow flag that is set as a side effect of most ALU instructions. With the condition codes hidden, overflow checks become not only convoluted but also error-prone. An obvious solution — writing overflow checks in assembly — is undesirable for portability reasons and also it is not as performant as we would hope: assembly has opaque semantics to the higher-level language and the overflow checks cannot be removed in cases where the overflow can be shown to not happen.

In this post we’ll look at various implementations for this addition function:

/*
 * perform two's complement addition on the first two arguments;
 * put the result in the location pointed to by the third argument;
 * return 1 if overflow occurred, 0 otherwise
 */
int checked_add(int_t, int_t, int_t *);

This operation closely mimics a hardware add instruction. Here’s an optimal (at least in the number of instructions) implementation of it for the 64-bit case for x86-64 using the GCC/Linux calling convention:

checked_add:
        xor     %eax, %eax
        addq    %rsi, %rdi
        movq    %rdi, (%rdx)
        seto    %al
        ret

The problem is that compilers aren’t very good at taking an implementation of checked_add() in C/C++ and turning it into this code.

This is perhaps the simplest and most intuitive way to check for overflow:

int checked_add_1(int_t a, int_t b, int_t *rp) {
  wideint_t lr = (wideint_t)a + (wideint_t)b;
  *rp = lr;
  return lr > MAX || lr < MIN;
}

The only problem here is that we require a wider integer type than the one being added. When targeting x86-64, both GCC and Clang support 128-bit integers:

typedef int64_t int_t;
typedef __int128 wideint_t;
#define MAX INT64_MAX
#define MIN INT64_MIN

On the other hand, neither compiler supports 128-bit integers when generating code for a 32-bit target.

If we can’t or don’t want to use the wider type, we can let a narrow addition wrap around and then compare the signs of the operands and result:

int checked_add_2(int_t a, int_t b, int_t *rp) {
  int_t r = (uint_t)a + (uint_t)b;
  *rp = r;
  return 
    (a < 0 && b < 0 && r >= 0) ||
    (a >= 0 && b >= 0 && r < 0);
}

This code is the equivalent bitwise version:

#define BITS (8*sizeof(int_t))

int checked_add_3(int_t a, int_t b, int_t *rp) {
  uint_t ur = (uint_t)a + (uint_t)b;
  uint_t sr = ur >> (BITS-1);
  uint_t sa = (uint_t)a >> (BITS-1);
  uint_t sb = (uint_t)b >> (BITS-1);
  *rp = ur;
  return 
    (sa && sb && !sr) ||
    (!sa && !sb && sr);
}

What if we don’t have a wider type and also we don’t want to use unsigned math? Then we call this function:

int checked_add_4(int_t a, int_t b, int_t *rp) {
  if (b > 0 && a > MAX - b) {
    *rp =  (a + MIN) + (b + MIN);
    return 1;
  }
  if (b < 0 && a < MIN - b) {
    *rp =  (a - MIN) + (b - MIN);
    return 1;
  }
  *rp = a + b;
  return 0;
}

It looks odd but I believe it to be correct. You can find all of these functions (plus a crappy test harness) in this file and x86-64 assembly code for all widths in this file. Let me know if you have a checked add idiom that’s different from the ones here.

As a crude measure of code quality, let’s look at the number of instructions that each compiler emits, taking the smallest number of instructions across -O0, -O1, -O2, -Os, and -O3. Here I’m targeting x86-64 on Linux and using the latest releases of the compilers: Clang 3.4 and GCC 4.9.0. The results from compilers built from SVN the other day are not noticeably different.

8 bits 16 bits 32 bits 64 bits
GCC checked_add_1() 9 9 11 18
GCC checked_add_2() 14 21 21 21
GCC checked_add_3() 18 18 18 18
GCC checked_add_4() 12 12 16 23
Clang checked_add_1() 5 5 5 14
Clang checked_add_2() 16 16 14 14
Clang checked_add_3() 20 23 14 14
Clang checked_add_4() 18 18 18 22
optimal 5 5 5 5

The only real success story here is Clang and checked_add_1() for 8, 16, and 32-bit adds. The optimization responsible for this win can be found in a function called ProcessUGT_ADDCST_ADD() at line 1901 of this file from the instruction combiner pass. Its job — as you might guess — is to pattern-match on code equivalent to (but not identical to — some other passes have already transformed the code somewhat) checked_add_1(), replacing these operations with an intrinsic function that does a narrower add and separately returns the overflow bit. This intrinsic maps straightforwardly onto a machine add and an access to the hardware overflow flag, permitting optimal code generation.

Across both compilers, checked_add_1() gives the best results. Since it is also perhaps the most difficult check to mess up, that’s the one I recommend using, and also it has fairly straightforward equivalents for subtract, multiply, and divide. The only issue is that a double-width operation cannot be used to perform safe addition of 64-bit quantities on a 32-bit platform, since neither Clang nor GCC supports 128-bit math on that platform.

Ideally, authors of safe math libraries would spend some time creating code that can be optimized effectively by both GCC and LLVM, and then we could all just use that library. So far there’s one such example: Xi Wang’s libo, which avoids leaning so hard on the optimizer by directly invoking the llvm.x.with.overflow intrinsics. Unfortunately there’s no equivalent mechanism in GCC (that I know of, at least) for guaranteeing fast overflow checks.

Testing with Small Capacities

The goal of testing is to expose bugs. To find a bug we must drive the software under test into a part of its state space where the bug becomes evident, for example by causing a crash. Many bugs lurk in dark corners of the state space that are ordinarily difficult to reach. This piece is about a nice technique for making those dark corners easier to reach using either random and non-random testing. Although the method is one that I have used before, I hadn’t really thought about it explicitly until a conversation with Colin Runciman the other day brought it out into the open.

The issue is the small scope hypothesis, which states that most bugs can be triggered using small test cases. Here we’re talking about a class of bug that falls outside of this hypothesis: these bugs are triggered when some implementation limit is exceeded. For example, if we have a bounded buffer with length 10,000, and this buffer contains a bug in its handling of the buffer-full case, then it is clear that no test case can trigger the bug unless it results in more than 10,000 insertions into the buffer. Similarly, a bug in an operating system’s swapping code will not be hit until we put memory pressure on the machine — this will be difficult if we’re testing on a machine with lots of RAM. A bug in a compiler’s register spill logic will not be hit until we run out of registers, which may happen only rarely on a RISC with a large register file.

The testing technique, then, is simply to recognize capacities in the system under test and to artificially reduce them in order to make capacity bugs much easier to trigger. I think that probalby every accomplished programmer has used this technique, perhaps without even thinking about it: a simple example is testing the code for an auto-resizing hash table by making its initial capacity very small. This is just good sense.

Of course, while some capacities appear as explicit constants in the code that can be easily changed, others, such as the number of registers available to a compiler, will tend to be encoded in a more implicit fashion, and will therefore be more difficult to find and change. Also, the fact that a constant is hard-coded into a program doesn’t mean this constant can be freely changed: in some cases the constant is fake (how many programs that reference the CHAR_BIT constant can really tolerate values other than 8?) and in others the program is only designed to operate for a certain range of choices. To continue with the compiler example, the calling convention (and other factors) will almost certainly cause failures when the number of registers falls below a certain number. Similarly, a pipe or a thread pool can easily cause the system to deadlock when it is instantiated with a capacity that is too small. These things are design constraints, not the implementation errors that we are looking for.

In summary, if you have a system that offers a choice of capacity for an internal resource, it is often very useful to reduce this capacity for testing purposes, even if this results in poor throughput. This effect is particularly pronounced during fuzzing, where the random walks through the state space that are induced by random test cases will tend to hover around mean values with high probability.

Find the Integer Bug

Not all of the functions below are correct. The first person to leave a comment containing a minimal fix to a legitimate bug will get a small prize. I’m posting this not because the bug is particularly subtle or interesting but rather because I wrote this code for a piece about integer overflow and thought — wrongly, as usual — that I could get this stuff right the first time.

By “legitimate bug” I mean a bug that would cause the function to execute undefined behavior or return an incorrect result using GCC or Clang on Linux on x86 or x64 — I’m not interested in unexpected implementation-defined integer truncation behaviors, patches for failures under K&R C on a VAX-11, or style problems. For convenience, here are GCC’s implementation-defined behaviors for integers. I don’t know that Clang has a section in the manual about this but in general we expect its integers to behave like GCC’s.

#include <limits.h>
#include <stdint.h>

/*
 * specification:
 *
 * perform 32-bit two's complement addition on arguments a and b
 * store the result into *rp
 * return 1 if overflow occurred, 0 otherwise
 */

int32_t checked_add_1(int32_t a, int32_t b, int32_t *rp) {
  int64_t lr = (int64_t)a + (int64_t)b;
  *rp = lr;
  return lr > INT32_MAX || lr < INT32_MIN;
}

int32_t checked_add_2(int32_t a, int32_t b, int32_t *rp) {
  int32_t r = (uint32_t)a + (uint32_t)b;
  *rp = r;
  return 
    (a < 0 && b < 0 && r > 0) ||
    (a > 0 && b > 0 && r < 0);
}

int32_t checked_add_3(int32_t a, int32_t b, int32_t *rp) {
  if (b > 0 && a > INT32_MAX - b) {
    *rp =  (a + INT32_MIN) + (b + INT32_MIN);
    return 1;
  }
  if (b < 0 && a < INT32_MIN - b) {
    *rp =  (a - INT32_MIN) + (b - INT32_MIN);
    return 1;
  }
  *rp = a + b;
  return 0;
}

UPDATE: The prize has been awarded, see comment 5. The prize is an Amazon gift certificate for $27.49 — the cost of the kindle edition of Hacker’s Delight.

Research Advice from Alan Adler

Although I am a happy French press user, I enjoyed reading an article about Alan Adler and the AeroPress that showed up recently on Hacker News. In particular, I love Adler’s advice to inventors:

  1. Learn all you can about the science behind your invention.
  2. Scrupulously study the existing state of your idea by looking at current products and patents.
  3. Be willing to try things even if you aren’t too confident they’ll work. Sometimes you’ll get lucky.
  4. Try to be objective about the value of your invention. People get carried away with the thrill of inventing and waste good money pursuing something that doesn’t work any better than what’s already out there.
  5. You don’t need a patent in order to sell an invention. A patent is not a business license; it’s a permission to be the sole maker of product (even this is limited to 20 years).

Now notice that (disregarding the last suggestion) we can simply replace “invention” with “research project” and Adler’s suggestions become a great set of principles for doing research. I think #4 is particularly important: lacking the feedback that people in the private sector get from product sales (or not), us academics are particularly susceptible to falling in love with pretty ideas that don’t improve anything.

A New Development for Coverity and Heartbleed

As a consequence of my last post, I spent some time on the phone Friday with Andy Chou, Coverity’s CTO and former member of Dawson Engler’s research group at Stanford, from which Coverity spun off over a decade ago. Andy confirmed that Coverity does not spot the heartbleed flaw and said that it remained stubborn even when they tweaked various analysis settings. Basically, the control flow and data flow between the socket read() from which the bad data originates and the eventual bad memcpy() is just too complicated.

Let’s be clear: it is trivial to create a static analyzer that runs fast and flags heartbleed. I can accomplish this, for example, by flagging a taint error in every line of code that is analyzed. The task that is truly difficult is to create a static analysis tool that is performant and that has a high signal to noise ratio for a broad range of analyzed programs. This is the design point that Coverity is aiming for, and while it is an excellent tool there is obviously no general-purpose silver bullet: halting problem arguments guarantee the non-existence of static analysis tools that can reliably and automatically detect even simple kinds of bugs such as divide by zero. In practice, it’s not halting problem stuff that stops analyzers but rather code that has a lot of indirection and a lot of data-dependent control flow. If you want to make a program that is robustly resistant to static analysis, implement some kind of interpreter.

Anyhow, Andy did not call me to admit defeat. Rather, he wanted to show off a new analysis that he and his engineers prototyped over the last couple of days. Their insight is that we might want to consider byte-swap operations to be sources of tainted data. Why? Because there is usually no good reason to call ntohs() or a similar function on a data item unless it came from the network or from some other outside source. This is one of those ideas that is simple, sensible, and obvious — in retrospect. Analyzing a vulnerable OpenSSL using this new technique gives this result for the heartbleed code:

Nice! As you might guess, additional locations in OpenSSL are also flagged by this analysis, but it isn’t my place to share those here.

What issues are left? Well, for one thing it may not be trivial to reliably recognize byte-swapping code; for example, some people may use operations on a char array instead of shift-and-mask. A trickier issue is reliably determining when untrusted data items should be untainted: do they need to be tested against both a lower and upper bound, or just an upper bound? What if they are tested against a really large bound? Does every bit-masking operation untaint? Etc. The Coverity people will need to work out these and other issues, such as the effect of this analysis on the overall false alarm rate. In case this hasn’t been clear, the analysis that Andy showed me is a proof-of-concept prototype and he doesn’t know when it will make it into the production system.

UPDATE: Also see Andy’s post at the Coverity blog.

Heartbleed and Static Analysis

Today in my Writing Solid Code class we went through some of the 151 defects that Coverity Scan reports for OpenSSL. I can’t link to these results but take my word for it that they are a pleasure to read — the interface clearly explains each flaw and the reasoning that leads up to it, even across multiple function calls. Some of the problems were slightly alarming but we didn’t see anything that looks like the next heartbleed. Unfortunately, Coverity did not find the heartbleed bug itself. (UPDATE: See this followup post.) This is puzzling; here’s a bit of speculation about what might be going on. There are basically two ways to find heartbleed using static analysis (here I’ll assume you’re familiar with the bug; if not, this post is useful). First, a taint analysis should be able to observe that two bytes from an untrusted source find their way into the length argument of a memcpy() call. This is clearly undesirable. The Coverity documentation indicates that it taints the buffer stored to by a read() system call (unfortunately you will need to login to Coverity Scan before you can see this). So why don’t we get a defect report? One guess is that since the data buffer is behind two pointer dereferences (the access is via s->s3->rrec.data), the scanner’s alias analysis loses track of what is going on. Another guess is that two bytes of tainted data are not enough to trigger an alarm. Only someone familiar with the Coverity implementation can say for sure what is going on — the tool is highly sophisticated not just in its static analysis but also in its defect ranking system.

The other kind of static analysis that would find heartbleed is one that insists that the length argument to any memcpy() call does not exceed the size of either the source or destination buffer. Frama-C in value analysis mode is a tool that can do this. It is sound, meaning that it will not stop complaining until it can prove that certain defect classes are not present, and as such it requires far more handholding than does Coverity, which is designed to unsoundly analyze huge quantities of code. To use Frama-C, we would make sure that its own header files are included instead of the system headers. In one of those files we would find a model for memcpy():

/*@ requires \valid(((char*)dest)+(0..n - 1));
  @ requires \valid_read(((char*)src)+(0..n - 1));
  @ requires \separated(((char *)dest)+(0..n-1),((char *)src)+(0..n-1));
  @ assigns ((char*)dest)[0..n - 1] \from ((char*)src)[0..n-1];
  @ assigns \result \from dest;
  @ ensures memcmp((char*)dest,(char*)src,n) == 0;
  @ ensures \result == dest;
  @*/
extern void *memcpy(void *restrict dest,
                    const void *restrict src, 
                    size_t n);

The comments are consumed by Frama-C. Basically they say that src and dest are pointers to valid storage of at least the required size, that the buffers do not overlap (recall that memcpy() has undefined behavior when called with overlapping regions), that it moves data in the proper direction, that the return value is dest, and that a subsequent memcmp() of the two regions will return zero.

The Frama-C value analyzer tracks an integer variable using an interval: a representation of the smallest and largest value that the integer could contain at some program point. Upon reaching the problematic memcpy() call in t1_lib.c, the value of payload is in the interval [0..65535]. This interval comes from the n2s() macro which turns two arbitrary-valued bytes from the client into an unsigned int:

#define n2s(c,s)        ((s=(((unsigned int)(c[0]))<< 8)| \
                            (((unsigned int)(c[1]))    )),c+=2)

The dest argument of this memcpy() turns out to be large enough. However, the source buffer is way too small. Frama-C would gripe about this and it would not shut up until the bug was really fixed.

How much effort would be required to push OpenSSL through Frama-C? I don’t know, but wouldn’t plan on getting this done in a week or three. Interestingly, a company spun off by Frama-C developers has recently used the tool to create a version of PolarSSL that they promise is immune to CWEs 119, 120, 121, 122, 123, 124, 125, 126, 127, 369, 415, 416, 457, 562, and 690. I think it would be reasonable for the open source security community to start thinking more seriously about what this kind of tool can do for us.

UPDATES:

  • In a comment below, Masklinn states that OpenSSL’s custom allocators would defeat the detection of the too-large argument to memcpy(). This is indeed a danger. To avoid it, as part of applying Frama-C to OpenSSL, the custom malloc/free functions would be marked as being malloc-like using the “allocates” and “frees” keywords supported by ACSL 1.8. Coverity lets you do the same thing, and so would any competent analyzer for C. Custom allocators are regrettably common in oldish C code.
  • I’m interested to see what other tools have to say about heartbleed. If you have a link to results, please put it in a comment.
  • I re-ran Coverity after disabling OpenSSL’s custom freelist and also hacking CRYPTO_malloc() and friends to just directly call the obvious function from the malloc family. This caused Coverity to report 173 new defects: mostly use-after-free and resource leaks. Heartbleed wasn’t in the list, however, so I stand by my guess (above) that perhaps something related to indirection caused this defect to not be ranked highly enough to be reported.
  • HN has some discussion of PolarSSL and of this blog post. Also Reddit.