Overflows in SafeInt

Update from Friday 9/23: The SafeInt developers have already uploaded a new version that fixes the problems described in this post. Nice!

I have a minor obsession with undefined behaviors in C and C++. Lately I was tracking down some integer overflows in Firefox — of which there are quite a few — and some of them seemed to originate in the well-known SafeInt library that it uses to avoid performing unsafe integer operations. Next, I poked around in the latest version of SafeInt and found that while executing its (quite good) test suite there are 43 different places in the code where an undefined integer operation is performed. A substantial number of them stem from code like this:

bool negative = false;
if (a<0) {
  a = -a;
  negative = true;
}

... assume a is positive ...

if (negative) {
  return -result;
} else {
  return result;
}

To see a real example of this, take a look at the code starting at line 2100 of SafeInt3.hpp. The problem here, of course, is that for the most common choices of implementation-defined behaviors in C++, negating INT_MIN is an undefined behavior.

Now we have to ask a couple of questions. First, does this code operate properly if the compiler happens to implement -INT_MIN using 2’s complement arithmetic? I’m not sure about all of the overflows in SafeInt, but I believe this function actually does work in that case. The second question is, do compilers (despite not being required to do so) give 2’s complement semantics to -INT_MIN? Every compiler I tried has this behavior when optimizations are disabled. On the other hand, not all compilers have this behavior when optimizations are turned on. A simple test you can do is to compile this function with maximum optimization:

void bar (void);

void foo (int x) {
 if (x<0) x = -x;
 if (x<0) bar();
}

If the resulting assembly code contains no call to bar(), then the compiler has (correctly) observed that every path through this function either does not call bar() or else relies on undefined behavior. Once the compiler sees this, it is free to eliminate the call to bar() as dead code. Most of the compilers I tried — even ones that are known to exploit other kinds of integer undefined behavior — don’t perform this optimization. However, here’s what I get from a recent GCC snapshot:

[regehr@gamow ~]$ current-gcc -c -O2 overflow2.c        
[regehr@gamow ~]$ objdump -d overflow2.o
0000000000000000 <foo>:
 0:    f3 c3                    repz retq 

Now, does this same optimization ever fire when compiling SafeInt code, causing it to return a wrong result? Here’s a bit of test code:

#include <cstdio>
#include <climits>
#include "SafeInt3.hpp"

void test (__int32 a, __int64 b) {
  __int32 ret;
  bool res = SafeMultiply (a, b, ret);
  if (res) {
    printf ("%d * %lld = %d\n", a, b, ret);
  } else {
    printf ("%d * %lld = INVALID\n", a, b);
  }
}

int main (void) {
  test (INT_MIN, -2);
  test (INT_MIN, -1);
  test (INT_MIN, 0);
  test (INT_MIN, 1);
  test (INT_MIN, 2);  
  return 0;
}

Next we compile it with the recent g++ at a couple of different optimization levels and run the resulting executables:

[regehr@gamow safeint]$ current-g++ -O1 -Wall safeint_test.cpp
[regehr@gamow safeint]$ ./a.out
-2147483648 * -2 = INVALID
-2147483648 * -1 = INVALID
-2147483648 * 0 = 0
-2147483648 * 1 = -2147483648
-2147483648 * 2 = INVALID
[regehr@gamow safeint]$ current-g++ -O2 -Wall safeint_test.cpp
[regehr@gamow safeint]$ ./a.out
-2147483648 * -2 = INVALID
-2147483648 * -1 = INVALID
-2147483648 * 0 = 0
-2147483648 * 1 = INVALID
-2147483648 * 2 = INVALID

The first set of results is correct. The second set is wrong for the INT_MIN * 1 case, which should not overflow. Basically, at -O2 and higher gcc and g++ turn on optimization passes that try to generate better code by taking integer undefined behaviors into account. Let’s be clear: this is not a compiler bug; g++ is simply exploiting a bit of leeway given to it by the C++ standards.

What can we take away from this example?

  • It’s a little ironic that the SafeInt library (a widely used piece of software, and not a new one) is itself performing operations with undefined behavior. This is not the only safe math library I’ve seen that does this — it is simply very hard to avoid running afoul of C/C++’s integer rules, particularly without good tool support.
  • It’s impressive that G++ was able to exploit this undefined behavior. GCC is getting to be a very strongly optimizing compiler and also SafeInt was carefully designed to not get in the way of compiler optimizations.

If you have security-critical code that uses SafeInt to manipulate untrusted data, should you be worried? Hard to say. I was able to get SafeInt to malfunction, but only in a conservative direction (rejecting a valid operation as invalid, instead of the reverse) and only using a recent G++ snapshot (note that I didn’t try a lot of compilers — there could easily be others that do the same thing). Also, there are 42 more integer overflow sites that I didn’t look at in detail. One way to be safe would be to use GCC’s -fwrapv option, which forces 2’s complement semantics for signed integer overflows. Clang also supports this option, but most other compilers don’t have an equivalent one. Also unfortunately, SafeInt is structured as a header file instead of a true library, so it’s not like just this one file can be recompiled separately.

This issue has been reported here.

Update: I checked CERT’s IntegerLib (download here) and it is seriously broken. Its own test suite changes behavior across -O0 … -O3 for each of Clang, GCC, and Intel CC. The undefined behaviors are:

UNDEFINED at <add.c, (24:12)> : Op: +, Reason : Signed Addition Overflow
UNDEFINED at <add.c, (24:28)> : Op: +, Reason : Signed Addition Overflow
UNDEFINED at <add.c, (29:12)> : Op: +, Reason : Signed Addition Overflow
UNDEFINED at <add.c, (43:14)> : Op: +, Reason : Signed Addition Overflow
UNDEFINED at <add.c, (43:30)> : Op: +, Reason : Signed Addition Overflow
UNDEFINED at <add.c, (47:14)> : Op: +, Reason : Signed Addition Overflow
UNDEFINED at <add.c, (61:12)> : Op: +, Reason : Signed Addition Overflow
UNDEFINED at <add.c, (61:28)> : Op: +, Reason : Signed Addition Overflow
UNDEFINED at <add.c, (65:15)> : Op: +, Reason : Signed Addition Overflow
UNDEFINED at <mult.c, (52:13)> : Op: *, Reason : Signed Multiplication Overflow
UNDEFINED at <mult.c, (69:47)> : Op: *, Reason : Signed Multiplication Overflow
UNDEFINED at <sub.c, (24:23)> : Op: -, Reason : Signed Subtraction Overflow
UNDEFINED at <sub.c, (28:12)> : Op: -, Reason : Signed Subtraction Overflow
UNDEFINED at <sub.c, (42:23)> : Op: -, Reason : Signed Subtraction Overflow
UNDEFINED at <sub.c, (46:12)> : Op: -, Reason : Signed Subtraction Overflow
UNDEFINED at <sub.c, (60:23)> : Op: -, Reason : Signed Subtraction Overflow
UNDEFINED at <sub.c, (64:12)> : Op: -, Reason : Signed Subtraction Overflow
UNDEFINED at <unary.c, (28:9)> : Op: -, Reason : Signed Subtraction Overflow
UNDEFINED at <unary.c, (37:9)> : Op: -, Reason : Signed Subtraction Overflow
UNDEFINED at <unary.c, (46:9)> : Op: -, Reason : Signed Subtraction Overflow

I cannot imagine this library is suitable for any purpose if you are using an optimizing compiler.

Update from Friday 9/23: I looked into the problems in IntegerLib in more detail. The different output across different compilers and optimization levels isn’t even due to the integer problems — it’s due to the fact that a couple of functions have no “return” statement, and therefore return garbage. Do not use IntegerLib! Use SafeInt.

Better Random Testing by Leaving Features Out

[I wrote this post, but it describes joint work, principally with Alex Groce at Oregon State.]

This piece is about a research result that I think is genuinely surprising — a rare thing. The motivating problem is the difficulty of tuning a fuzz tester, or random test case generator. People like to talk trash about random testing but what they usually mean is that random testing doesn’t work very well when it is done poorly. Applied thoughtfully and tuned well, random testers are often extremely powerful system-breaking devices.

Previous to our current work, my approach to tuning a random tester was basically:

  1. Aim for maximum expressiveness in test cases. In other words, maximize the features and combinations of features seen in test cases.
  2. Repeatedly inspect the test cases and their effect on the system under test. Use the results to tune the test case generator.

I even wrote a blog post about this last spring. Alex had a different idea which is basically:

For each test case, randomly omit a subset of the features that might appear.

The funny thing about this idea is that it doesn’t give us any test cases that we couldn’t generate before. All it does is change the distribution of test cases. For example, if we’re generating test cases for a stack ADT, pure random testing will eventually generate a test case with all “push” operations. But this may be very, very rare. On the other hand, swarm testing (what we are calling Alex’s idea) will generate a test case containing only push operations once every 2N tests, where N is the number of features. An exponentially unlikely event may not seem very likely, but keep in mind the number of features is generally not going to be very big compared to the number of choices made during generation of an individual test case.

The first nice thing about swarm testing is that it’s easy. Many random testers (including my most recent one, Csmith) already support omission of many features, and if not it’s simple to add in all test case generators I can think of.

The second nice thing about swarm testing is that it works. We took a reasonably fast quad-core machine and let it spend a week trying to crash a collection of 17 C compilers for x86-64 (several versions of GCC, several versions of LLVM, and one version each of Intel CC, Sun CC, and Open64) using the latest and greatest Csmith. It found 73 distinct ways to crash these compilers. A “distinct way” means the compiler, while dying, emitted some sort of message that we can use to distinguish this kind of crash from other crashes. For example, here are two distinct ways that GCC 4.6.0 can be crashed:

  • internal compiler error: in vect_enhance_data_refs_alignment, at tree-vect-data-refs.c:1550
  • internal compiler error: in vect_create_epilog_for_reduction, at tree-vect-loop.c:3725

Sometimes the compiler isn’t so friendly as to emit a nice ICE message and rather it prints something like “Segmentation fault.” In those cases we’re not able to distinguish different ways of producing segfaults and so we just count it once.

After using Csmith to find 73 distinct crash bugs, we ran another week-long test using swarm. This time, we told Csmith to randomly omit:

  • declaration of main() with argc and argv
  • the comma operator, as in x = (y, 1);
  • compound assignment operators, as in x += y;
  • embedded assignments, as in x = (y = 1);
  • the auto-increment and auto-decrement operators ++ and --
  • goto and labels
  • integer division
  • integer multiplication
  • long long integers
  • 64-bit math operations
  • structs
  • bitfields
  • packed structs
  • unions
  • arrays
  • pointers
  • const-qualified objects
  • volatile-qualified objects
  • volatile-qualified pointers

For each test case, each feature had a 50% chance of being included. In the swarm test, 104 distinct compiler crash bugs were found: an improvement of more than 40%. Moreover (I’ll spare you the details) the improvement is statically significant at a very high confidence level. This is a really nice improvement and we intend to not only bake swarm into Csmith’s default test scripts, but also to add support to Csmith for omitting a lot more kinds of features — the list above is pretty limited.

Why does swarm testing work? Our hypothesis is that there are two things going on. First, sometimes a test case just contains too many different features to be good at triggering bugs. For example, if I have a compiler bug that can only be triggered by highly complex mathematical expressions, then the presence of pointers, arrays, structs, etc. just gets in the way. The stuff clutters up the test case, making it hard to trigger the bug. The second thing that can happen is that a feature actively prevents a bug from being triggered. Consider the example outlined above where a stack ADT contains a bug that only manifests when the stack is full. If my test cases contain lots of pushes and pops, we’re going to have a hard time randomly walking all the way to a full stack. On the other hand, swarm testing will frequently create tests that only contain push operations. We’ve looked at a lot of swarm testing outcomes and found examples of both of these phenomena. I’ll post some more detailed results if I get a chance.

The overall outcome of this line of research, Alex and I hope, is to take some of the pain out of developing a highly tuned fuzz tester. This post has focused on compilers because that’s the domain I know. We have another case study in the works, but Alex is managing that effort and I’m not going to publish his results. Rather, we’re working on a paper about this topic that we hope to submit in the near future.

Google House

Although I use Google Earth fairly often, I generally leave “3D buildings” turned off since my machines tend to have the crappiest possible graphics cards. But the other day I randomly turned it on and was surprised to find that Salt Lake City is now heavily populated with building models, even including some residential neighborhoods. The house above is the one that Sarah and I lived in between 2001 and 2008. At some level this model is poorly done: the front and side porches are rendered as blocks and the textures mostly just show the trees that prevented the Street View van from seeing the house. On the other hand, the model is impressive in that it got the sizes of the dormers right, both chimneys are there, and the swamp cooler (the block on the roof behind the dormers) is just about right. The rest of the houses in the neighborhood are similarly rendered.

At this point I started to wonder: Who is it that put a model of my old house into Google Earth? The model was uploaded by “buildsaltlake2,” an entity that lacks a Google profile but has uploaded more than 12,000 models. Clearly someone is systematically machine-generating and uploading models. Is this going on everywhere? Does anyone know the technology involved? It has to be based on pictures because any data on this 110-year-old house in the city offices would be sorely out of date, and wouldn’t know about the swamp cooler in any case. Also I assume there’s a profit motive, but don’t know what it is. Here’s a link to the view above.

Fuzzing Linux Kernel Modules?

I’ve been thinking about what would be the best way to fuzz-test a Linux kernel module, for example a filesystem. Of course this can be done in the context of a live kernel, but for a variety of reasons I’d prefer to run the LKM in user space. At the source level, the interface to an LKM seems a little hairy, but at the object level they are really simple. So, a reasonable approach would seem to be to write a user-space loader for compiled LKMs and then just call the object code directly. At that point it would become necessary to write a set of shims to support each class of device driver and then fuzzing could start.

Anyway, I’m curious to see what people think about this idea before I go off and hack. I did some random web searching and didn’t turn up an existing implementation of this idea, though of course there are plenty of resources on testing various parts of Linux.

Online University

Yesterday someone in my department’s main office got a request from a student to receive credit for taking the now-infamous free online AI course from Stanford. It is routine for a university to award transfer credit for a course taken at a different school, but this case is trickier since a student taking the AI course isn’t enrolled at Stanford and doesn’t get credit there. This post — which will be disorganized because my thinking on this subject is not yet organized — looks at what the Stanford course, the Khan academy, MIT’s Open Courseware initiative, and related efforts might mean for the future of college education.

Will there be a single, best course filling every niche, and everyone just takes that course? The analogy I’d like to make is between online courses and textbooks. Most subject areas are not dominated by a single textbook, but there is usually a small collection of textbooks that, together, are used as a basis for probably 80% of the courses. Personally I’d much rather learn from a textbook than from an online course — listening to someone talk is exceptionally inefficient. Why didn’t mass-market textbooks wipe out universities sometime during the 20th century? Because, of course, taking a class adds value beyond what can be found in the book. This value takes many forms:

  • A course comes as part of a broader “college experience” that many people want.
  • A course is part of an eventual degree that serves as a kind of certification.
  • Instructors are available to provide additional help.
  • Putting classmates in close proximity creates a sense of community and at least in some cases promotes learning.
  • A course is often part of a curriculum that has been designed in an integrated way.
  • A course serves as a forcing function, making it more difficult to put off learning the material.

I think we have to accept the conclusion that universities as we understand them today will be destroyed more or less to the extent that these sources of value can be provided by online education.

Let’s look at a couple of extremes. First, a course like Calculus I — a big lecture course at most universities. The experience of trying to learn integration while sitting in a big, crowded lecture is so bad that watching the lecture online almost seems attractive. It’s not hard to imagine these courses going away over the next 20 years. There seem to be various possibilities for how this will play out. First, a big university could offer an online version of Calc I, but this is very inefficient because only a few hundred or thousand people take it each year. Rather, courses like this will be handled by a few large organizations (companies or forward-thinking universities) and most institutions will simply contract out Calc I by giving some fraction of the tuition to the course provider. Course providers will make money by scaling — first through outsourcing and increasingly through AI-based techniques for assisting and assessing students. My fear is that these online courses will suck very badly, in the same way that so many web applications suck today. However, realistically, the not-online Calc I course I took 20 years ago sucked too: lectures were boring and recitation was at 7:30am with a TA I truly could not understand.

At the other extreme from big service courses, we have things like the “Senior Software Project” course offered by many departments or the Android Projects class that I’m teaching now. These have a large amount of instructor involvement, a large amount of in-person interaction between students, grading cannot easily be automated, etc. I don’t want to say that online versions of these classes are impossible, but certainly they would have a very different character than the current versions. These courses represent the part of a college education that directly descends from the old apprenticeship system and it would — in the long run — be a big problem if this part of the college experience went away. Of course the most serious students would still apprentice themselves through hackerspaces, internships, and such — but people in for example the 50th through 80th percentiles would likely be left poorly prepared for their future careers by an online-only education.

The picture I am painting is that at least in the near term, universities and traditional college degrees survive, but some of their bread and butter gets eaten by highly scalable providers of low-level courses. There will be fierce competition among providers — similar to the current competition between textbook providers, but the stakes will be higher. As we move forward, some fraction of students — in particular, non-traditional students and those who otherwise don’t want the traditional college experience — will move towards online-only degree programs. At first these will provide an inferior education and therefore they will be sought out by students who just cannot make regular classes work, or who are primarily interested in a degree for its own sake. Perhaps, as time passes, telepresence and related technologies will manage to become solid enough that a real education can be attained online.

Mt Nebo

Over Labor Day weekend I climbed Mount Nebo, the highest point in the Wasatch Range, with Dave Hanscom and Bill Stenquist. Dave has run me into the ground before and Bill came close to winning the Wasatch 100 a couple of times — so I should have known something was up when we met in Mona, UT to set up a car shuttle, instead of meeting nearer to one of Nebo’s established trailheads. It turns out we were starting and ending the hike in Mona, leading to a climb with 6000 feet of net elevation gain and probably well over 7000′ total gain.

We started at the mouth of Pole Canyon and walked up a jeep road for a couple of miles, gaining about 2000 feet. Then, we bushwhacked another 3000 vertical feet until we ran across a good trail on the shoulder of North Peak. From there we climbed to North Nebo and then started the exposed, mile-long class 2/3 traverse to South Nebo. After that it was a pleasant but surprisingly long walk out Willow Creek to the other car. This was an awesome hike, but strenuous and four liters of water was barely adequate.