Memory Safe C/C++: Time to Flip the Switch

For a number of years I’ve been asking:

If the cost of memory safety bugs in C/C++ codes is significant, and if solutions are available, why aren’t we using them in production systems?

Here’s a previous blog post on the subject and a quick summary of the possible answers to my question:

  • The cost of enforcement-related slowdowns is greater than the cost of vulnerabilities.
  • The cost due to slowdown is not greater than the cost of vulnerabilities, but people act like it is because the performance costs are up-front whereas security costs are down the road.
  • Memory safety tools are not ready for prime time for other reasons, like maybe they crash a lot or raise false alarms.
  • Plain old inertia: unsafety was good enough 40 years ago and it’s good enough now.

I’m returning to this topic for two reasons. First, there’s a new paper SoK: Eternal War in Memory that provides a useful survey and analysis of current methods for avoiding memory safety bugs in legacy C/C++ code. (I’m probably being dense but can someone explain what “SoK” in the title refers to? In any case I like the core war allusion.)

When I say “memory safety” I’m referring to relatively comprehensive strategies for trapping the subset of undefined behaviors in C/C++ that are violations of the memory model and that frequently lead to RAM corruption (I say “relatively comprehensive” since even the strongest enforcement has holes, for example due to inline assembly or libraries that can’t be recompiled). The paper, on the other hand, is about a broader collection of solutions to memory safety problems including weak ones like ASLR, stack canaries, and NX bits that catch small but useful subsets of memory safety errors with very low overhead.

The SoK paper does two things. First, it analyzes the different pathways that begin with an untrapped undefined behavior and end with an exploit. This analysis is useful because it helps us understand the situations in which each kind of protection is helpful. Second, the paper evaluates a collection of modern protection schemes along the following axes:

  • protection: what policy is enforced, and how effective is it at stopping memory-based attacks?
  • cost: what is the resource cost in terms of slowdown and memory usage?
  • compatibility: does the source code need to be changed? does it need to be recompiled? can protected and unprotected code interact freely?

As we might expect, stronger protection generally entails higher overhead and more severe compatibility problems.

The second reason for this post is that I’ve reached the conclusion that 30 years of research on memory safe C/C++ should be enough. It’s time to suck it up, take the best available memory safety solution, and just turn it on by default for a major open-source OS distribution such as Ubuntu. For those of us whose single-user machines are quad-core with 16 GB of RAM, the added resource usage is not going to make a difference. I promise to be an early adopter. People running servers might want to turn off safety for the more performance-critical parts of their workloads (though of course these might be where safety is most important). Netbook and Raspberry Pi users probably need to opt out of safety for now.

If the safe-by-default experiment succeeded, we would have (for the first time) a substantial user base for memory-safe C/C++. There would then be an excellent secondary payoff in research aimed at reducing the cost of safety, increasing the strength of the safety guarantees, and dealing with safety exceptions in interesting ways. My guess is that progress would be rapid. If the experiment failed, the new OS would fail to gain users and the vendor would have to back off to the unsafe baseline.

Please nobody leave a comment suggesting that it would be better to just stop using C/C++ instead of making them safe.

Reading Code

Reading code is an important skill that doesn’t get enough emphasis in CS programs. There are three main aspects:

  • the external view of the code: documentation, comments, APIs, white papers, information from developers, etc.
  • the static view: reading the code like a book
  • the dynamic view: reading the code as it executes, probably with help from a debugging tool

Of course these aren’t totally exclusive. For example, reading code like a book has a dynamic aspect because our brains serve as crude interpreters.

External View

Ideally we’ll be able to focus on a module that is small enough to understand all at once and that has relatively clean interfaces to the rest of the system. If the code seems too big or complicated to understand as a whole, perhaps it can be broken up. If the modularity that we want just isn’t there, we may be stuck trying to understand a piece of functionality that is buried is an ocean of complexity—not very fun.

We should be explicit about our goals. Are we reading the code for general enlightenment? In order to decide whether to hire the person who wrote the code? In order to begin refactoring or adding functionality? To look for bugs? Keep in mind that if we’re looking for bugs, code reviews are a whole separate thing.

Before reading any code we’ll certainly want to look over any documentation and also skim the comments in the code: sometimes there’s halfway-decent documentation hiding in the middle of a big source file. If the code is very well-known, like the Linux kernel, there’ll be plenty of books and web pages for us to look at. If not, perhaps there’s a specification, a white paper, a README, or similar. Often, even if there’s no documentation, the code will be implementing known algorithms that we can brush up on. Some kinds of domain-specific codes (signal processing, feedback control, storage management) will be very hard to understand if we lack basic domain knowledge, so again we may need to hit the books for a little while before getting back to the code.

Often it’s a good idea to build and run the code before starting to read it, or at least be sure that someone else has done it. This is because we often run across attractive-looking open source code on the web that isn’t worth reading because it’s not even finished—we might as well discover that fact as early as possible. Just the other day I got suckered by a Github project that promised to do exactly what I wanted, but that was badly broken.

Static View

Reading code is far easier if we come into it with an understanding of the patterns that are being used. Therefore, we can expect that in any given domain the first few pieces of code we read are going to require a lot of work and after that things should be easier. The tricky thing about patterns is that sometimes they are sitting there on the surface (like gotos in kernel code or explicit reference counts) but other times they are buried deeply enough that a lot of digging is needed in order to uncover them. Making matters worse, it’s not uncommon to see badly implemented patterns the author of the code didn’t even understand they were trying to implement. In systems code we sometimes see code that is half-assedly transactional. Also, see Greenspun’s 10th Rule of Programming.

It is important to learn to recognize code that contains little information. Java seems to be particularly prone to this (for example I assume this class is an obscure joke, but perhaps not…) but all languages have it. A pernicious kind of low-information code occurs in C where the verbose code is actually easy to get wrong, as in the well-known bug struct foo *x = (struct foo *) malloc (sizeof (struct foo *)).

One way to start reading a piece of code is to create an annotated call graph. If tool support is available then great, but if not this isn’t too much trouble to do by hand. Also, a by-hand callgraph is a good way to start getting a general feel for the code. Annotations on the call graph might include:

  • potential trouble spots: extra-snarly code, inline assembly, code containing comments such as “you are not expected to understand this
  • entry points to the module we’re reading, and exit points from it
  • resource allocations and deallocations
  • error-handling paths
  • accesses to important mutable global state

Putting together a good static callgraph may not be so easy if the code is functional, OO, event driven, or uses function pointers. In this case building the callgraph may become a dynamic problem.

With callgraph in hand (or without it) we should try to get a sense of the code’s control flow structure. Is it a library? An event loop? One out of a stack of layers? Does it use threads, and how? What sort of error handling does it use?

What are the main data structures used by the code we’re looking at? Which of these are shared with callers or callees? Where are they allocated, freed, and modified? What are the crucial data invariants? Are the data structures (and their algorithms) basically the textbook versions or are there interesting quirks in the implementations?

Dynamic View

Strictly speaking, we don’t need to run code in order to understand it. In practice, being able to run code is a lifesaver for several reasons. First, an actual execution follows a single path through the code, permitting us to ignore code not touched on that path. Second, if the computer is executing the code then the interpreter in our brains can take a rest and we can focus on other things. Third, if we have formed a bad hypothesis about the code, running the code is a good way to conclusively refute that hypothesis.

What is the dynamic view in practice? We can use a debugger to single-step through code, we can set breakpoints and watchpoints, we can add debugging printouts, we can add assertions corresponding to conjectured invariants, and we can write unit tests for the code to make sure we really understand it.

Finally, it’s often a good idea to change the code: add a bit of functionality, fix a bug, do some refactoring, etc. If we’ve successfully understood the code, we’ll be able to do this without too much trouble.

Conclusion

This has been a bit of a brain dump, not a checklist as much as a collection of things to keep in mind when starting out on a code-reading project. I’d be happy to get suggestions for improvement.

Labyrinth Rims

The Green River’s Labyrinth Canyon begins south of the town of Green River UT; the Labyrinth Rims refers to the area of BLM land on either side of this canyon. We spent four days in this somewhat isolated area of the San Rafael Desert without seeing any other people except for a group in the now-popular Bluejohn Canyon. Although this area is only about 30 miles from Moab, it’s at least 1.5 hours away by car.

On the first day of this trip we explored upper Keg Spring Canyon, which was pretty but a bit nondescript as canyons go, which perhaps explains why we saw no evidence of recent human traffic. We camped for three nights off a 4wd track near the head of Keg Spring Canyon.

Generally this campsite was perfect, with a bit of shelter from the wind, soft dirt to sleep on, and a large area of flat slickrock for parking and cooking. However, on our last night some cows moved through and one of them woke me up around 4am by licking my tent. In a befuddled state I tried to scare it off without scaring it so much that it trampled me and the tent.

On the second day of this trip we drove down to Bluejohn Canyon. Although this is most often done as a technical canyoneering route, much of the canyon can be hiked. The hiking route, however, crosses some confusing terrain before dropping into the canyon; map and compass are definitely needed here.

The main fork of Bluejohn is maybe the best slot canyon I’ve seen so far. Here, a chockstone some 20′ above the canyon bottom has a bunch of tree limbs and other debris jammed around it due to flash floods.

This is another of Bill’s pictures; he’s a better photographer than I am and shooting in slot canyons is not so easy due to the enormous dynamic range. A lot of slot canyons have a sense of intimacy; Bluejohn is more about grandeur and this photo captures that aspect nicely. For reference, I believe the chockstone is the same 20′ one from the previous picture.

There were plenty of minor obstacles like this before we finally got stopped by technical climbing.

A little bit of snow and a little bit of sun.

Here’s John Veranth in the “cathedral” section of Bluejohn where it’s maybe 150′ deep and only 1-2′ wide at the top. For a very short time, sun makes it to the bottom.

Anyway, Bluejohn was fantastic. The next day we explored the area around Bowknot Bend, a huge meander in the Green River. Like many things in this part of the world, it was named by John Wesley Powell or another member of his expedition. One interesting feature in this area is a large arch with five openings.

This is Labyrinth Canyon with the La Sal mountains beyond. We didn’t see any boaters on the Green, perhaps March is too early.

Finally we are overlooking Bowknot Bend. If you’re floating the river, it’s 7 miles from the left side of this photo to the right side. Although it’s only ~4 miles hiking each way to this overlook from the end of the 4wd track, we made a full day out of it by taking a couple of detours and a couple of wrong turns.

On our last day we explored Moonshine Wash, another slot canyon. This one is supposed to be nontechnical but after downclimbing a couple of short drops like the one shown here, we got stopped by a 20′ drop that did not look super easy to reverse.

Moonshine is really pretty.

A sheep bridge crosses Moonshine Wash near its deepest, narrowest point. Given my general dislike of heights (probably more than 100′ to the bottom here) this is as close as I was able to get to the bridge, which is generally falling apart and unsafe.

The Labyrinth Rims area is relatively easy to get to given how remote it feels. Much of this area could be explored with a 2wd vehicle, although in that case you’d end up doing more walking.

Fuzzers Need Taming

[This post explains a paper that we recently made available; it’s going to be presented at PLDI 2013.]

Random testing tools, or fuzzers, are excellent at finding bugs that human testers miss. A particularly important use case for fuzzing is finding exploitable bugs, and companies such as Google use clusters to do high-throughput fuzzing. Whether you have a cluster or just a laptop, the usual workflow for fuzzing is something like this:

  1. Build the latest version of the code.
  2. Start the fuzzer and then go home for the night or the weekend.
  3. Later on, report a bug for each important new issue discovered by the fuzzer.

The problem is that step 3 can be a lot of work if, for example, the fuzzer discovers a few thousand test cases that cause failures. The situation is a bit complicated and next I’ll try to walk through some of the issues.

First, it is typical for some bugs to be triggered thousands of times more often than other bugs. Therefore, if you randomly choose a handful of failure-triggering test cases to look at, they are likely to all trigger the same bug or the same small set of bugs. A corollary is that in a fuzzing run of any length, there will be bugs that are triggered only once. (Clearly this cannot be true in the limit since there can only be a finite number of bugs in any code base. But it holds true for surprisingly long runs.) This graph shows how many times each of a collection of bugs was triggered, for three kinds of bugs, due to fuzzing GCC 4.3.0 and SpiderMonkey 1.6 for a few days:

powerlaw

Second, the job of triaging failure-triggering test cases can be made much easier if test case reduction is used. For example, here’s a fairly generic test-case reduction tool and here’s my group’s domain-specific tool that does a much better job for test cases that trigger bugs in C/C++ compilers. It is possible (though this wasn’t our experience) that a sufficiently good test case reduction tool makes it easy to examine and categorize every failure-triggering test case discovered by a fuzzer.

Third, if we can always fix bugs rapidly, then we can simply fix the first bug found by the fuzzer and then re-run the fuzzer. This is easy to do for small projects, but completely impractical for large projects that often have hundreds or thousands of open bugs at any given time, many of which are non-critical and therefore aren’t going to get fixed in the short term.

Fourth, it is sometimes the case that each bug has an unambiguous symptom. For example, if we have a bug that breaks a data structure’s invariants that is closely followed by an assertion that catches the bug, then we’ll reliably get a message such as “assertion violation at line 712 of foo.c” when this bug executes. If this is the case for all of the bugs in our code base, then we can rapidly find the true set of bugs by doing something simple like “grep Assert fuzzer-output/* | sort | uniq”. On the other hand, in practice there are often many bugs that result in ambiguous symptoms such as crashes or arbitrary incorrect output.

To summarize the issues so far: the problem of triaging a collection of test cases that trigger bugs may be easy or difficult depending on a number of factors. If this is easy, then great, you don’t need our work. If it is hard, then this can make fuzzing a lot less attractive. We have noticed this ourselves and also we heard from Csmith users who decided to stop using the tool not because it stopped finding bugs, but rather because it kept finding bugs that were already known and weren’t going to get fixed in the short term. So we decided to look for an automated solution to this problem:

Fuzzer taming problem: Given a potentially large collection of test cases, each of which triggers a bug, rank them in such a way that test cases triggering distinct bugs are early in the list.

Sub-problem: If there are test cases that trigger bugs previously flagged as undesirable, place them late in the list.

The obvious approach is to use clustering to divide the test cases into groups and then to hand the user one test case from each cluster (or, to solve the sub-problem, one bug from each cluster not containing a test case that triggers an undesirable bug). We tried this and it worked ok, but the approach was slow, it seemed overly complicated, and the clustering approach has no built-in way to perform a ranking among clusters. Moreover, we got the sense (but never really proved) that the clustering software we were using was probably trying to deal with outliers in some graceful way. On the other hand, the power-law distribution of bugs triggered by random test cases means that the outliers are likely to be interesting.

Suresh pointed us to Clustering to Minimize the Maximum Intercluster Distance, which presents the technique that we ended up using, which is to rank test cases in such a way that each test case in the list is the one that maximizes the distance to the closest previously listed test case. In other words, each test case is the one that is least similar to any test case listed so far. This is called the furthest point first (FPF) ordering and it can be used to create a clustering if you run FPF for a while and then form clusters by adding each unlisted point to the cluster defined by the nearest listed point. For our purposes clustering is uninteresting: we just want a ranking. Therefore, we run FPF all the way to the end, giving a list that contains diverse test cases at the head and mostly very similar test cases at the end.

So far I have left out something important: how to compute the distance between two test cases. This is the input to a clustering algorithm or to the FPF computation. The short answer is that we tried a variety of distance functions ranging from Levenshtein distance (or edit distance: the number of character additions, deletions, and alterations that suffices to change one string into another) over test case text to euclidean distance over vectors of code coverage data. Why try multiple distance functions? This is because bugs are diverse and we don’t have any reason to think that we can come up with the one true distance function that accurately captures test case similarity for all bugs. The paper contains more details.

The way to evaluate something like a fuzzer tamer is with a discovery curve: a graph that shows how quickly a particular ranking shows a person inspecting the items in ranked order one member of each class of objects being examined. Here are some discovery curves for SpiderMonkey bugs:

js

And here are the results for some bugs that cause GCC 4.3.0 to generate incorrect object code:

gcc

The upshot is that the baseline—looking at deduplicated reduced test cases in random order—is not very attractive and that you can discover all bugs much faster using fuzzer taming. The most interesting part of each graph is at the left edge because we would expect a human to inspect maybe 10 or 15 test cases, not 500 or 700. In that region, the tamed results will show you about twice as many bugs as you’d see by examining test cases in random order (some more detailed graphs are in the paper). The curves for clustering in these graphs represent our best attempt to get a ranking out of x-means, a modern clustering algorithm. Of course it is possible that some other clustering technique would do better.

Our ranking approach was inspired by similar approaches that are used to rank static analysis results. Notice that they also show bug discovery curves.

So what do you do with a fuzzer taming tool? My idea is to bolt together Csmith, C-Reduce, and the fuzzer tamer into a  nice integrated package. Someone who wants to fuzz a compiler should be able to build it, point it at a compiler (or better yet, the compiler’s repo) and then every day or so, glance at a web page that shows them interesting, newly discovered defects in the compiler. The toolchain will look something like this:

workflow

My guess is that other fuzzing workflows can also benefit from this sort of integration. We’d appreciate feedback on this material. The work described in this post was done by my group in conjunction with Alex Groce and others at Oregon State University.

Stochastic Superoptimization

“Stochastic Superoptimization” is a fancy way to say “randomized search for fast machine code.” It is also the title of a nice paper that was presented recently at ASPLOS. Before getting into the details, let’s look at some background. At first glance the term “superoptimization” sounds like nonsense because the optimum point is already the best one. Compiler optimizers, however, don’t tend to create optimal code; rather, they make code better than it was. A superoptimizer, then, is either an optimizer that may be better than regular optimizers or else a code generator with optimality guarantees.

How can a superoptimizer (or a skilled human, for that matter) create assembly code that runs several times faster than the output of a state of the art optimizing compiler? This is not accomplished by beating the compiler at its own game, which is applying a long sequence of simple transformations to the code, each of which improves it a little. Rather, a superoptimizer takes a specification of the desired functionality and then uses some kind of systematic search to find a good assembly-level implementation of that functionality. Importantly, the output of a superoptimizer may implement a different algorithm than the one used in the specification. (Of course, a regular optimizer could also change the algorithm, but it would tend to do so via canned special cases, not through systematic search.)

Over the last 25 years a thin stream of research papers on superoptimization has appeared:

  • The original superoptimizer paper from 1987 describes a brute-force search for short sequences of instructions. Since candidate solutions were verified by testing, this superoptimizer was perfectly capable of emitting incorrect code. Additionally, only very short instruction sequences could be generated.
  • The Denali superoptimizer works on an entirely different principle: the specification of the desired functionality, and the specification of the available instructions, are encoded as a satisfiability problem. Frustratingly, both the original paper and the followup paper on Denali-2 are short on evidence that this approach actually works. As idea papers, these are great. As practical compiler papers, I think they need to be considered to be negative results.
  • A more practical superoptimizer was described in 2006. This one is modeled after the “peephole” passes found in all optimizing compilers, which perform low-level (and often architecture-specific) rewrites to improve the quality of short sequences of instructions or IR nodes. Although using exhaustive search makes it expensive to discover individual rewrites, the cost is amortized by storing rewrites in a fast lookup structure. Another cool feature of this work was to use testing to rapidly eliminate unsuitable code sequences, but then to verify the correctness of candidate solutions using a SAT solver. Verifying equivalence of already-discovered code is much more practical than asking the SAT solver to find a good code sequence on its own. The punchline of this paper was that the peephole superoptimizer could sometimes take unoptimized code emitted by GCC and turn it into code of about -O2 quality. Also, without any special help, it found interesting ways to use SSE instructions.

Ok, enough background. There’s more but this is the core.

The new paper on stochastic superoptimization, which is from the same research group that produced the peephole superoptimizer, throws out the database of stored optimizations. However, that aspect could easily be added back in. The idea behind STOKE (the stochastic superoptimizer tool)—that it’s better to sparsely search a larger region of the search space than to densely search a smaller region—is the same one that has been responsible for a revolution in computer Go over the last 20 years. The punchline of the new paper is that STOKE is sometimes able to discover code sequences as good as, or a bit better than, the best known ones generated by human assembly language programmers. Failing that, it produces code as good as GCC or Intel CC at a high optimization level. These results were obtained on a collection of small bit manipulation functions. STOKE optimizes x86-64 code, and it is only able to deal with loop-free instruction sequences.

At a high level, using randomized search to find good code sequences is simple. However, the devil is in the details and I’d imagine that getting interesting results out of this required a lot of elbow grease and good taste. Solving this kind of optimization problem requires (1) choosing a mutation operator that isn’t too stupid, (2) choosing a fitness function that is pretty good, and (3) making the main search loop fast. Other aspects of randomized search, such as avoiding local maxima, can be handled in fairly standard ways.

STOKE’s mutation operator is the obvious one: it randomly adds, removes, modifies, or reorders instructions. Part of the fitness function is also obvious; it is based on an estimate of performance supplied by the x86-64 interpreter. The less obvious part of STOKE’s fitness function is that it is willing to tolerate incorrect output: a piece of wrong code is penalized by the number of bits by which its output differs from the expected output, and also by erroneous conditions like segfaults. To make each iteration fast, STOKE does not try to ensure that the code sequence is totally correct, but rather (just like the original superoptimizer) runs some test cases through it. However, unlike the original superoptimizer, STOKE will never hand the user a piece of wrong code becuase a symbolic verification method is applied at the end to ensure that the optimized and original codes are equivalent. This equivalence check is made much easier by STOKE’s insistence on loop-freedom. STOKE’s speed is aided by the absence of loops and also by a highly parallelized implementation of the randomized search.

The STOKE authors ran it in two modes. First, it was used to improve the quality of code generated by Clang at -O0. This worked, but the authors found that it was unable to find better, algorithmically distinct versions of the code that were sometimes known to exist. To fix this, they additionally seeded STOKE with completely random code, which they tried to evolve into correct code in the absence of a performance constraint. Once the code became correct, it was optimized as before. The fact that this worked—that is, it discovered faster, algorithmically distinct code—for 7 out of 28 benchmarks, is cool and surprising and I’d imagine this is why the paper got accepted to ASPLOS.

Are superoptimizers ever going to be a practical technology? Perhaps. One can imagine them being used in a few ways. First, a system such as STOKE (with modest improvements such as handling loops) could be used by library developers to avoid writing hand-tuned assembly for math kernels, media CODECs, compression algorithms, and other small, performance-critical codes that are amenable to tricky optimizations. The advantage is that it is easy to tell STOKE about a new instruction or a new chip (a new target chip only requires a new cost function; a new target architecture is much more difficult). This goal could be achieved in a short time frame, though I suspect that developers enjoy writing assembly language enough that it may be difficult to persuade them to give it up.

The second, and more radical way that a superoptimizer could be used is to reduce the size of a production-grade optimizing compiler by removing most of its optimization passes. What we would end up with is a very simple code generator (TCC would be a good starting point) and a database-backed superoptimizer. Actually, I’m guessing that two superoptimizers would be needed to achieve this goal: one for the high-level IR and one for the assembly language. So perhaps we would use Clang to create LLVM code, superoptimize that code, use an LLVM backend to create assembly, and then finally superoptimize that assembly. If the optimization database lived in the cloud, it seems likely that a cluster of modest size could handle the problem of superoptimizing the hottest collections of IR and assembly code that everyone submits. Drawbacks of this approach include:

  • Novel code will either be optimized very poorly or very slowly.
  • Compilation predictability suffers since the cloud-based peephole database will constantly be improving.
  • You can’t compile without a network connection.

It remains to be seen whether we want to do this sort of thing in practice. But if I were a compiler developer, I’d jump at a chance to substantially reduce the complexity, maintenance requirements, and porting costs of my toolchain.

Update from 3/9/13: A strength of this kind of superoptimizer that I forgot to mention is that it can be applied to code compiled from any language, not just C/C++. So let’s say that I’ve written a stupid little x64 generator for my domain-specific language. Now if I have a database-backed superoptimizer I should be able to get roughly -O2 quality binaries out of it. Whether this kind of approach is better or worse than just compiling to LLVM in the first place, I don’t know.