Skip to content

Foothill Sunset

I went for a hike last night to celebrate being out from under whatever virus made me more or less sick for most of the last month.

The foothill wildflowers are more subdued than the ones that will cover the big mountains in July and August.

Four mountain ranges and the Great Salt Lake.

Looking down at the university.

I stopped a little below 7000′ and sat on a rock for 45 minutes waiting for sunset.

It was full night by the time I got home; should have brought a headlamp!

Procedural Decomposition

While teaching a CS class I spend quite a bit of time looking over the shoulders of students whose code doesn’t work. Sometimes they have a simple mistake and I’ll either point it out or ask a question that will lead them to the problem. However, other times the code is just generally not very good and I’ve always sort of struggled to find the right things to say; it’s hard because it’s usually a matter of judgement and taste.

Recently I’ve started to notice a pattern where students who are having trouble getting their code to work often have not done enough procedural decomposition. That is, they have failed to appropriately break up a large, difficult programming problem into a collection of smaller ones. You are probably thinking to yourself that this is pretty basic stuff, and perhaps it is when we’re solving easy problems. On the other hand, I’ve come to believe that procedural decomposition for hard problems is probably one of those skills that we can keep getting better at for as long as we care to keep writing code.

An Example

In my operating systems class this spring, I asked the students to do a bit of hacking on Linux’s driver for the Minix filesystem. The Minix FS is a reasonable starting point because it’s about as simple as realistic filesystems get. The assignment was to change the code so that instead of representing a file as a tree of disk block numbers, a file is represented as a list of extents. An extent is a collection of contiguous disk blocks that can be represented as a (base, length) pair. For purposes of this assignment, we make the unrealistic assumption that indirect extents are not needed.

The crux of this assignment was rewriting the Minix FS driver’s get_block() function, which finds a block of a file on the disk and then maps it into the block cache; this is nontrivial because the requested block must be created if it doesn’t already exist (and a “create” flag is set). My first attempt at get_block() — as well as many students’ first attempts — was a monolithic function. However, the logic turns out to be just hairy enough that this is difficult to get right. On the other hand, the job can be broken down into several pieces:

  • tiny utility functions for packing and unpacking the (base, length) pairs that are used to represent extents on disk
  • a lookup function that finds a block of a file in the extent list if it exists, and fails otherwise
  • an “extend file” function that makes a file bigger either by one block or by a specified number of blocks

Once these are available, get_block() needs only about a dozen lines of code. The tricky thing isn’t figuring out this particular decomposition — which is kind of obvious — but rather recognizing that it’s time to think about breaking up the problem rather than attacking it all at once. The temptation to keep banging on a function that seems to be almost correct is strong.

What’s Going On?

I asked some colleagues: Are we teaching procedural decomposition? And are we doing a good job? Well, of course we are teaching this and in fact one of the people who does that here, Matthew Flatt, is an author of How To Design Programs, one of the best books I know of for teaching abstraction and decomposition (and it’s free). But even so, the fact remains that in upper-division classes I’m getting a good number of students who are not performing procedural decomposition even in a situation where it is clearly necessary.

Since I’ve started thinking about procedural decomposition more, I’ve noticed that I don’t do enough of it either. When I catch myself writing poorly abstracted code, I try to analyze what I’m thinking and usually it boils down to a mixture of laziness (“if I can just get this stupid function to work, I won’t have to figure out how to break it up into pieces”) and overconfidence in my ability to deal with a complex case analysis all at once. I’m guessing that similar things are going on in students’ heads.

The purpose of the rest of this piece is to take a closer look at procedural decomposition and how we might be able to do a better job teaching it. But first, a quick note about terminology: the thing I’m calling procedural decomposition is also sometimes referred to as procedural abstraction, functional abstraction, or functional decomposition.

Reuse Isn’t the Main Thing

I think a lot of us were taught that the main reason we break code into functions is to facilitate reuse. In this view, we primarily write quicksort as a separate function so that we can call it from multiple locations in our program. It’s usually not too hard to notice when this kind of decomposition is necessary: we just get tired of writing quicksort from scratch after the second or third time. In contrast, procedural decomposition of a single problem is more difficult because it’s not as obvious where to introduce abstraction.

Decomposition and Problem Difficulty

I sometimes notice myself doing a much nicer job of procedural decomposition when solving simple problems than when solving complex ones. Of course if I were programming rationally it would be the other way around. Perhaps this effect can be understood by saying that I have a fixed amount of brain power to devote to a programming problem. In this case, when I’m totally overwhelmed by complexity I have no resources left for the added burden of figuring out how to best perform procedural decomposition. Of course, once I step back, delete some code, and find a good decomposition, I can surmount some sort of complexity barrier and probably find a nice landscape of simple code on the other side.

Years ago I heard Randy Pausch gave a talk describing how people who tried to fly a virtual magic carpet were overwhelmed by the many degrees of freedom offered by a prototype version of the controls. In order to simplify the situation, they left the carpet’s throttle at the maximum setting. This reduced the degrees of freedom by one, but with the unfortunate side effect of causing them to careen stupidly through the game running into everything. It seems like this might be a good analogy for how many of us approach a difficult programming problem.

Decomposition and Language Choice

There’s clearly a link between procedural decomposition and programming language choice. For example, I learned to program in early microcomputer BASIC dialects (TI BASIC, GW-BASIC, and Applesoft BASIC) where the support for procedural abstraction was not even as good as it is in assembly language. I sometimes fear that spending a few formative years writing BASIC may have stunted my capacity for abstraction.

One of the things that made the old BASIC implementations such turds is that they had no support for creating multiple namespaces. C is better, but the lack of nested functions and multiple return values can make it hard to perform elegant procedural decomposition there. C++ and Java don’t support nested functions either, but the object model seems to approximate them well enough for practical purposes. Modern languages with tuples, nested functions, and first-class functions seem to offer the best support for procedural decomposition.

Historically, we were discouraged from performing too much procedural decomposition because function calls added overhead. C/C++ programmers have been known to write horrific macros in order to get decomposition without increasing runtime. The modern wisdom — regardless of programming language — is that any performance-oriented compiler will do a pretty good job inlining functions that are small and that don’t have a ton of call sites. When the compiler has good dynamic information, such as a JIT compiler or an AOT compiler with access to profile data, it can do a considerably better job. However, even an AOT compiler without profile data can often do pretty well given good heuristics and perhaps some hints from developers. In any case, the days of allowing performance considerations to influence our procedural decomposition are mostly over.

Teaching Procedural Decomposition

I think that too often, our programming assignments are so canned that a solid grasp of procedural abstraction isn’t necessary to solve them. Moreover, we tend to not do a great job of grading on code quality — and of course the students are allowed to throw away their code after handing it in, so they have little incentive to do real software engineering. Of course there are plenty of students who write very good code because they can, and out of a general sense of pride in their work. But they’re not the ones I’m talking about trying to help here.

Maybe something we need to do is help students learn to recognize the danger signs where the complexity of a piece of code is starting to get out of control. Personally, I get sort of a panicky feeling where I become more and more certain that each fix that I add is breaking something else. I hate this so much that it’s making my skin prickle just writing about it. Usually, the best thing to do when this happens is step back and rethink the structure of the code, and probably also delete a bunch of what’s already written.

Finally, we probably need to do more code reading in classes. This is difficult or impossible for a large class, but probably can be done well with up to 10 or 15 students. As a random example, the Minix FS module for Linux kernel contains a rather nice decomposition of the functionality for managing the block tree for each file.

Memory Safe C/C++: Time to Flip the Switch

For a number of years I’ve been asking:

If the cost of memory safety bugs in C/C++ codes is significant, and if solutions are available, why aren’t we using them in production systems?

Here’s a previous blog post on the subject and a quick summary of the possible answers to my question:

  • The cost of enforcement-related slowdowns is greater than the cost of vulnerabilities.
  • The cost due to slowdown is not greater than the cost of vulnerabilities, but people act like it is because the performance costs are up-front whereas security costs are down the road.
  • Memory safety tools are not ready for prime time for other reasons, like maybe they crash a lot or raise false alarms.
  • Plain old inertia: unsafety was good enough 40 years ago and it’s good enough now.

I’m returning to this topic for two reasons. First, there’s a new paper SoK: Eternal War in Memory that provides a useful survey and analysis of current methods for avoiding memory safety bugs in legacy C/C++ code. (I’m probably being dense but can someone explain what “SoK” in the title refers to? In any case I like the core war allusion.)

When I say “memory safety” I’m referring to relatively comprehensive strategies for trapping the subset of undefined behaviors in C/C++ that are violations of the memory model and that frequently lead to RAM corruption (I say “relatively comprehensive” since even the strongest enforcement has holes, for example due to inline assembly or libraries that can’t be recompiled). The paper, on the other hand, is about a broader collection of solutions to memory safety problems including weak ones like ASLR, stack canaries, and NX bits that catch small but useful subsets of memory safety errors with very low overhead.

The SoK paper does two things. First, it analyzes the different pathways that begin with an untrapped undefined behavior and end with an exploit. This analysis is useful because it helps us understand the situations in which each kind of protection is helpful. Second, the paper evaluates a collection of modern protection schemes along the following axes:

  • protection: what policy is enforced, and how effective is it at stopping memory-based attacks?
  • cost: what is the resource cost in terms of slowdown and memory usage?
  • compatibility: does the source code need to be changed? does it need to be recompiled? can protected and unprotected code interact freely?

As we might expect, stronger protection generally entails higher overhead and more severe compatibility problems.

The second reason for this post is that I’ve reached the conclusion that 30 years of research on memory safe C/C++ should be enough. It’s time to suck it up, take the best available memory safety solution, and just turn it on by default for a major open-source OS distribution such as Ubuntu. For those of us whose single-user machines are quad-core with 16 GB of RAM, the added resource usage is not going to make a difference. I promise to be an early adopter. People running servers might want to turn off safety for the more performance-critical parts of their workloads (though of course these might be where safety is most important). Netbook and Raspberry Pi users probably need to opt out of safety for now.

If the safe-by-default experiment succeeded, we would have (for the first time) a substantial user base for memory-safe C/C++. There would then be an excellent secondary payoff in research aimed at reducing the cost of safety, increasing the strength of the safety guarantees, and dealing with safety exceptions in interesting ways. My guess is that progress would be rapid. If the experiment failed, the new OS would fail to gain users and the vendor would have to back off to the unsafe baseline.

Please nobody leave a comment suggesting that it would be better to just stop using C/C++ instead of making them safe.

Reading Code

Reading code is an important skill that doesn’t get enough emphasis in CS programs. There are three main aspects:

  • the external view of the code: documentation, comments, APIs, white papers, information from developers, etc.
  • the static view: reading the code like a book
  • the dynamic view: reading the code as it executes, probably with help from a debugging tool

Of course these aren’t totally exclusive. For example, reading code like a book has a dynamic aspect because our brains serve as crude interpreters.

External View

Ideally we’ll be able to focus on a module that is small enough to understand all at once and that has relatively clean interfaces to the rest of the system. If the code seems too big or complicated to understand as a whole, perhaps it can be broken up. If the modularity that we want just isn’t there, we may be stuck trying to understand a piece of functionality that is buried is an ocean of complexity—not very fun.

We should be explicit about our goals. Are we reading the code for general enlightenment? In order to decide whether to hire the person who wrote the code? In order to begin refactoring or adding functionality? To look for bugs? Keep in mind that if we’re looking for bugs, code reviews are a whole separate thing.

Before reading any code we’ll certainly want to look over any documentation and also skim the comments in the code: sometimes there’s halfway-decent documentation hiding in the middle of a big source file. If the code is very well-known, like the Linux kernel, there’ll be plenty of books and web pages for us to look at. If not, perhaps there’s a specification, a white paper, a README, or similar. Often, even if there’s no documentation, the code will be implementing known algorithms that we can brush up on. Some kinds of domain-specific codes (signal processing, feedback control, storage management) will be very hard to understand if we lack basic domain knowledge, so again we may need to hit the books for a little while before getting back to the code.

Often it’s a good idea to build and run the code before starting to read it, or at least be sure that someone else has done it. This is because we often run across attractive-looking open source code on the web that isn’t worth reading because it’s not even finished—we might as well discover that fact as early as possible. Just the other day I got suckered by a Github project that promised to do exactly what I wanted, but that was badly broken.

Static View

Reading code is far easier if we come into it with an understanding of the patterns that are being used. Therefore, we can expect that in any given domain the first few pieces of code we read are going to require a lot of work and after that things should be easier. The tricky thing about patterns is that sometimes they are sitting there on the surface (like gotos in kernel code or explicit reference counts) but other times they are buried deeply enough that a lot of digging is needed in order to uncover them. Making matters worse, it’s not uncommon to see badly implemented patterns the author of the code didn’t even understand they were trying to implement. In systems code we sometimes see code that is half-assedly transactional. Also, see Greenspun’s 10th Rule of Programming.

It is important to learn to recognize code that contains little information. Java seems to be particularly prone to this (for example I assume this class is an obscure joke, but perhaps not…) but all languages have it. A pernicious kind of low-information code occurs in C where the verbose code is actually easy to get wrong, as in the well-known bug struct foo *x = (struct foo *) malloc (sizeof (struct foo *)).

One way to start reading a piece of code is to create an annotated call graph. If tool support is available then great, but if not this isn’t too much trouble to do by hand. Also, a by-hand callgraph is a good way to start getting a general feel for the code. Annotations on the call graph might include:

  • potential trouble spots: extra-snarly code, inline assembly, code containing comments such as “you are not expected to understand this
  • entry points to the module we’re reading, and exit points from it
  • resource allocations and deallocations
  • error-handling paths
  • accesses to important mutable global state

Putting together a good static callgraph may not be so easy if the code is functional, OO, event driven, or uses function pointers. In this case building the callgraph may become a dynamic problem.

With callgraph in hand (or without it) we should try to get a sense of the code’s control flow structure. Is it a library? An event loop? One out of a stack of layers? Does it use threads, and how? What sort of error handling does it use?

What are the main data structures used by the code we’re looking at? Which of these are shared with callers or callees? Where are they allocated, freed, and modified? What are the crucial data invariants? Are the data structures (and their algorithms) basically the textbook versions or are there interesting quirks in the implementations?

Dynamic View

Strictly speaking, we don’t need to run code in order to understand it. In practice, being able to run code is a lifesaver for several reasons. First, an actual execution follows a single path through the code, permitting us to ignore code not touched on that path. Second, if the computer is executing the code then the interpreter in our brains can take a rest and we can focus on other things. Third, if we have formed a bad hypothesis about the code, running the code is a good way to conclusively refute that hypothesis.

What is the dynamic view in practice? We can use a debugger to single-step through code, we can set breakpoints and watchpoints, we can add debugging printouts, we can add assertions corresponding to conjectured invariants, and we can write unit tests for the code to make sure we really understand it.

Finally, it’s often a good idea to change the code: add a bit of functionality, fix a bug, do some refactoring, etc. If we’ve successfully understood the code, we’ll be able to do this without too much trouble.

Conclusion

This has been a bit of a brain dump, not a checklist as much as a collection of things to keep in mind when starting out on a code-reading project. I’d be happy to get suggestions for improvement.

Labyrinth Rims

The Green River’s Labyrinth Canyon begins south of the town of Green River UT; the Labyrinth Rims refers to the area of BLM land on either side of this canyon. We spent four days in this somewhat isolated area of the San Rafael Desert without seeing any other people except for a group in the now-popular Bluejohn Canyon. Although this area is only about 30 miles from Moab, it’s at least 1.5 hours away by car.

On the first day of this trip we explored upper Keg Spring Canyon, which was pretty but a bit nondescript as canyons go, which perhaps explains why we saw no evidence of recent human traffic. We camped for three nights off a 4wd track near the head of Keg Spring Canyon.

Generally this campsite was perfect, with a bit of shelter from the wind, soft dirt to sleep on, and a large area of flat slickrock for parking and cooking. However, on our last night some cows moved through and one of them woke me up around 4am by licking my tent. In a befuddled state I tried to scare it off without scaring it so much that it trampled me and the tent.

On the second day of this trip we drove down to Bluejohn Canyon. Although this is most often done as a technical canyoneering route, much of the canyon can be hiked. The hiking route, however, crosses some confusing terrain before dropping into the canyon; map and compass are definitely needed here.

The main fork of Bluejohn is maybe the best slot canyon I’ve seen so far. Here, a chockstone some 20′ above the canyon bottom has a bunch of tree limbs and other debris jammed around it due to flash floods.

This is another of Bill’s pictures; he’s a better photographer than I am and shooting in slot canyons is not so easy due to the enormous dynamic range. A lot of slot canyons have a sense of intimacy; Bluejohn is more about grandeur and this photo captures that aspect nicely. For reference, I believe the chockstone is the same 20′ one from the previous picture.

There were plenty of minor obstacles like this before we finally got stopped by technical climbing.

A little bit of snow and a little bit of sun.

Here’s John Veranth in the “cathedral” section of Bluejohn where it’s maybe 150′ deep and only 1-2′ wide at the top. For a very short time, sun makes it to the bottom.

Anyway, Bluejohn was fantastic. The next day we explored the area around Bowknot Bend, a huge meander in the Green River. Like many things in this part of the world, it was named by John Wesley Powell or another member of his expedition. One interesting feature in this area is a large arch with five openings.

This is Labyrinth Canyon with the La Sal mountains beyond. We didn’t see any boaters on the Green, perhaps March is too early.

Finally we are overlooking Bowknot Bend. If you’re floating the river, it’s 7 miles from the left side of this photo to the right side. Although it’s only ~4 miles hiking each way to this overlook from the end of the 4wd track, we made a full day out of it by taking a couple of detours and a couple of wrong turns.

On our last day we explored Moonshine Wash, another slot canyon. This one is supposed to be nontechnical but after downclimbing a couple of short drops like the one shown here, we got stopped by a 20′ drop that did not look super easy to reverse.

Moonshine is really pretty.

A sheep bridge crosses Moonshine Wash near its deepest, narrowest point. Given my general dislike of heights (probably more than 100′ to the bottom here) this is as close as I was able to get to the bridge, which is generally falling apart and unsafe.

The Labyrinth Rims area is relatively easy to get to given how remote it feels. Much of this area could be explored with a 2wd vehicle, although in that case you’d end up doing more walking.

Fuzzers Need Taming

[This post explains a paper that we recently made available; it's going to be presented at PLDI 2013.]

Random testing tools, or fuzzers, are excellent at finding bugs that human testers miss. A particularly important use case for fuzzing is finding exploitable bugs, and companies such as Google use clusters to do high-throughput fuzzing. Whether you have a cluster or just a laptop, the usual workflow for fuzzing is something like this:

  1. Build the latest version of the code.
  2. Start the fuzzer and then go home for the night or the weekend.
  3. Later on, report a bug for each important new issue discovered by the fuzzer.

The problem is that step 3 can be a lot of work if, for example, the fuzzer discovers a few thousand test cases that cause failures. The situation is a bit complicated and next I’ll try to walk through some of the issues.

First, it is typical for some bugs to be triggered thousands of times more often than other bugs. Therefore, if you randomly choose a handful of failure-triggering test cases to look at, they are likely to all trigger the same bug or the same small set of bugs. A corollary is that in a fuzzing run of any length, there will be bugs that are triggered only once. (Clearly this cannot be true in the limit since there can only be a finite number of bugs in any code base. But it holds true for surprisingly long runs.) This graph shows how many times each of a collection of bugs was triggered, for three kinds of bugs, due to fuzzing GCC 4.3.0 and SpiderMonkey 1.6 for a few days:

powerlaw

Second, the job of triaging failure-triggering test cases can be made much easier if test case reduction is used. For example, here’s a fairly generic test-case reduction tool and here’s my group’s domain-specific tool that does a much better job for test cases that trigger bugs in C/C++ compilers. It is possible (though this wasn’t our experience) that a sufficiently good test case reduction tool makes it easy to examine and categorize every failure-triggering test case discovered by a fuzzer.

Third, if we can always fix bugs rapidly, then we can simply fix the first bug found by the fuzzer and then re-run the fuzzer. This is easy to do for small projects, but completely impractical for large projects that often have hundreds or thousands of open bugs at any given time, many of which are non-critical and therefore aren’t going to get fixed in the short term.

Fourth, it is sometimes the case that each bug has an unambiguous symptom. For example, if we have a bug that breaks a data structure’s invariants that is closely followed by an assertion that catches the bug, then we’ll reliably get a message such as “assertion violation at line 712 of foo.c” when this bug executes. If this is the case for all of the bugs in our code base, then we can rapidly find the true set of bugs by doing something simple like “grep Assert fuzzer-output/* | sort | uniq”. On the other hand, in practice there are often many bugs that result in ambiguous symptoms such as crashes or arbitrary incorrect output.

To summarize the issues so far: the problem of triaging a collection of test cases that trigger bugs may be easy or difficult depending on a number of factors. If this is easy, then great, you don’t need our work. If it is hard, then this can make fuzzing a lot less attractive. We have noticed this ourselves and also we heard from Csmith users who decided to stop using the tool not because it stopped finding bugs, but rather because it kept finding bugs that were already known and weren’t going to get fixed in the short term. So we decided to look for an automated solution to this problem:

Fuzzer taming problem: Given a potentially large collection of test cases, each of which triggers a bug, rank them in such a way that test cases triggering distinct bugs are early in the list.

Sub-problem: If there are test cases that trigger bugs previously flagged as undesirable, place them late in the list.

The obvious approach is to use clustering to divide the test cases into groups and then to hand the user one test case from each cluster (or, to solve the sub-problem, one bug from each cluster not containing a test case that triggers an undesirable bug). We tried this and it worked ok, but the approach was slow, it seemed overly complicated, and the clustering approach has no built-in way to perform a ranking among clusters. Moreover, we got the sense (but never really proved) that the clustering software we were using was probably trying to deal with outliers in some graceful way. On the other hand, the power-law distribution of bugs triggered by random test cases means that the outliers are likely to be interesting.

Suresh pointed us to Clustering to Minimize the Maximum Intercluster Distance, which presents the technique that we ended up using, which is to rank test cases in such a way that each test case in the list is the one that maximizes the distance to the closest previously listed test case. In other words, each test case is the one that is least similar to any test case listed so far. This is called the furthest point first (FPF) ordering and it can be used to create a clustering if you run FPF for a while and then form clusters by adding each unlisted point to the cluster defined by the nearest listed point. For our purposes clustering is uninteresting: we just want a ranking. Therefore, we run FPF all the way to the end, giving a list that contains diverse test cases at the head and mostly very similar test cases at the end.

So far I have left out something important: how to compute the distance between two test cases. This is the input to a clustering algorithm or to the FPF computation. The short answer is that we tried a variety of distance functions ranging from Levenshtein distance (or edit distance: the number of character additions, deletions, and alterations that suffices to change one string into another) over test case text to euclidean distance over vectors of code coverage data. Why try multiple distance functions? This is because bugs are diverse and we don’t have any reason to think that we can come up with the one true distance function that accurately captures test case similarity for all bugs. The paper contains more details.

The way to evaluate something like a fuzzer tamer is with a discovery curve: a graph that shows how quickly a particular ranking shows a person inspecting the items in ranked order one member of each class of objects being examined. Here are some discovery curves for SpiderMonkey bugs:

js

And here are the results for some bugs that cause GCC 4.3.0 to generate incorrect object code:

gcc

The upshot is that the baseline—looking at deduplicated reduced test cases in random order—is not very attractive and that you can discover all bugs much faster using fuzzer taming. The most interesting part of each graph is at the left edge because we would expect a human to inspect maybe 10 or 15 test cases, not 500 or 700. In that region, the tamed results will show you about twice as many bugs as you’d see by examining test cases in random order (some more detailed graphs are in the paper). The curves for clustering in these graphs represent our best attempt to get a ranking out of x-means, a modern clustering algorithm. Of course it is possible that some other clustering technique would do better.

Our ranking approach was inspired by similar approaches that are used to rank static analysis results. Notice that they also show bug discovery curves.

So what do you do with a fuzzer taming tool? My idea is to bolt together Csmith, C-Reduce, and the fuzzer tamer into a  nice integrated package. Someone who wants to fuzz a compiler should be able to build it, point it at a compiler (or better yet, the compiler’s repo) and then every day or so, glance at a web page that shows them interesting, newly discovered defects in the compiler. The toolchain will look something like this:

workflow

My guess is that other fuzzing workflows can also benefit from this sort of integration. We’d appreciate feedback on this material. The work described in this post was done by my group in conjunction with Alex Groce and others at Oregon State University.

Stochastic Superoptimization

“Stochastic Superoptimization” is a fancy way to say “randomized search for fast machine code.” It is also the title of a nice paper that was presented recently at ASPLOS. Before getting into the details, let’s look at some background. At first glance the term “superoptimization” sounds like nonsense because the optimum point is already the best one. Compiler optimizers, however, don’t tend to create optimal code; rather, they make code better than it was. A superoptimizer, then, is either an optimizer that may be better than regular optimizers or else a code generator with optimality guarantees.

How can a superoptimizer (or a skilled human, for that matter) create assembly code that runs several times faster than the output of a state of the art optimizing compiler? This is not accomplished by beating the compiler at its own game, which is applying a long sequence of simple transformations to the code, each of which improves it a little. Rather, a superoptimizer takes a specification of the desired functionality and then uses some kind of systematic search to find a good assembly-level implementation of that functionality. Importantly, the output of a superoptimizer may implement a different algorithm than the one used in the specification. (Of course, a regular optimizer could also change the algorithm, but it would tend to do so via canned special cases, not through systematic search.)

Over the last 25 years a thin stream of research papers on superoptimization has appeared:

  • The original superoptimizer paper from 1987 describes a brute-force search for short sequences of instructions. Since candidate solutions were verified by testing, this superoptimizer was perfectly capable of emitting incorrect code. Additionally, only very short instruction sequences could be generated.
  • The Denali superoptimizer works on an entirely different principle: the specification of the desired functionality, and the specification of the available instructions, are encoded as a satisfiability problem. Frustratingly, both the original paper and the followup paper on Denali-2 are short on evidence that this approach actually works. As idea papers, these are great. As practical compiler papers, I think they need to be considered to be negative results.
  • A more practical superoptimizer was described in 2006. This one is modeled after the “peephole” passes found in all optimizing compilers, which perform low-level (and often architecture-specific) rewrites to improve the quality of short sequences of instructions or IR nodes. Although using exhaustive search makes it expensive to discover individual rewrites, the cost is amortized by storing rewrites in a fast lookup structure. Another cool feature of this work was to use testing to rapidly eliminate unsuitable code sequences, but then to verify the correctness of candidate solutions using a SAT solver. Verifying equivalence of already-discovered code is much more practical than asking the SAT solver to find a good code sequence on its own. The punchline of this paper was that the peephole superoptimizer could sometimes take unoptimized code emitted by GCC and turn it into code of about -O2 quality. Also, without any special help, it found interesting ways to use SSE instructions.

Ok, enough background. There’s more but this is the core.

The new paper on stochastic superoptimization, which is from the same research group that produced the peephole superoptimizer, throws out the database of stored optimizations. However, that aspect could easily be added back in. The idea behind STOKE (the stochastic superoptimizer tool)—that it’s better to sparsely search a larger region of the search space than to densely search a smaller region—is the same one that has been responsible for a revolution in computer Go over the last 20 years. The punchline of the new paper is that STOKE is sometimes able to discover code sequences as good as, or a bit better than, the best known ones generated by human assembly language programmers. Failing that, it produces code as good as GCC or Intel CC at a high optimization level. These results were obtained on a collection of small bit manipulation functions. STOKE optimizes x86-64 code, and it is only able to deal with loop-free instruction sequences.

At a high level, using randomized search to find good code sequences is simple. However, the devil is in the details and I’d imagine that getting interesting results out of this required a lot of elbow grease and good taste. Solving this kind of optimization problem requires (1) choosing a mutation operator that isn’t too stupid, (2) choosing a fitness function that is pretty good, and (3) making the main search loop fast. Other aspects of randomized search, such as avoiding local maxima, can be handled in fairly standard ways.

STOKE’s mutation operator is the obvious one: it randomly adds, removes, modifies, or reorders instructions. Part of the fitness function is also obvious; it is based on an estimate of performance supplied by the x86-64 interpreter. The less obvious part of STOKE’s fitness function is that it is willing to tolerate incorrect output: a piece of wrong code is penalized by the number of bits by which its output differs from the expected output, and also by erroneous conditions like segfaults. To make each iteration fast, STOKE does not try to ensure that the code sequence is totally correct, but rather (just like the original superoptimizer) runs some test cases through it. However, unlike the original superoptimizer, STOKE will never hand the user a piece of wrong code becuase a symbolic verification method is applied at the end to ensure that the optimized and original codes are equivalent. This equivalence check is made much easier by STOKE’s insistence on loop-freedom. STOKE’s speed is aided by the absence of loops and also by a highly parallelized implementation of the randomized search.

The STOKE authors ran it in two modes. First, it was used to improve the quality of code generated by Clang at -O0. This worked, but the authors found that it was unable to find better, algorithmically distinct versions of the code that were sometimes known to exist. To fix this, they additionally seeded STOKE with completely random code, which they tried to evolve into correct code in the absence of a performance constraint. Once the code became correct, it was optimized as before. The fact that this worked—that is, it discovered faster, algorithmically distinct code—for 7 out of 28 benchmarks, is cool and surprising and I’d imagine this is why the paper got accepted to ASPLOS.

Are superoptimizers ever going to be a practical technology? Perhaps. One can imagine them being used in a few ways. First, a system such as STOKE (with modest improvements such as handling loops) could be used by library developers to avoid writing hand-tuned assembly for math kernels, media CODECs, compression algorithms, and other small, performance-critical codes that are amenable to tricky optimizations. The advantage is that it is easy to tell STOKE about a new instruction or a new chip (a new target chip only requires a new cost function; a new target architecture is much more difficult). This goal could be achieved in a short time frame, though I suspect that developers enjoy writing assembly language enough that it may be difficult to persuade them to give it up.

The second, and more radical way that a superoptimizer could be used is to reduce the size of a production-grade optimizing compiler by removing most of its optimization passes. What we would end up with is a very simple code generator (TCC would be a good starting point) and a database-backed superoptimizer. Actually, I’m guessing that two superoptimizers would be needed to achieve this goal: one for the high-level IR and one for the assembly language. So perhaps we would use Clang to create LLVM code, superoptimize that code, use an LLVM backend to create assembly, and then finally superoptimize that assembly. If the optimization database lived in the cloud, it seems likely that a cluster of modest size could handle the problem of superoptimizing the hottest collections of IR and assembly code that everyone submits. Drawbacks of this approach include:

  • Novel code will either be optimized very poorly or very slowly.
  • Compilation predictability suffers since the cloud-based peephole database will constantly be improving.
  • You can’t compile without a network connection.

It remains to be seen whether we want to do this sort of thing in practice. But if I were a compiler developer, I’d jump at a chance to substantially reduce the complexity, maintenance requirements, and porting costs of my toolchain.

Update from 3/9/13: A strength of this kind of superoptimizer that I forgot to mention is that it can be applied to code compiled from any language, not just C/C++. So let’s say that I’ve written a stupid little x64 generator for my domain-specific language. Now if I have a database-backed superoptimizer I should be able to get roughly -O2 quality binaries out of it. Whether this kind of approach is better or worse than just compiling to LLVM in the first place, I don’t know.

Exhaustive Testing is Not a Proof of Correctness

It is often said that exhaustively testing a piece of software is equivalent to performing a proof of correctness. Although this idea is intuitively appealing—and I’ve said it myself a few times—it is incorrect in a technical sense and also in practice.

What’s wrong with exhaustive testing in practice? The problem comes from the question: What is it that we are testing? One possible answer would be “We’re testing the program we wrote.” For example, I might write this program containing an exhaustive test harness and an ostensible identity function which is defined to return the value of its argument for any value in the range -5..5:

#include <stdio.h>

int identity (int x)
{
}

int main (void)
{
  int i;
  for (i=-5; i<=5; i++) {
    printf ("identity(%d) = %d\n", i, identity(i));
  }
  return 0;
}

Let’s run the exhaustive tester:

$ gcc ident_undef.c ; ./a.out
identity(-5) = -5
identity(-4) = -4
identity(-3) = -3
identity(-2) = -2
identity(-1) = -1
identity(0) = 0
identity(1) = 1
identity(2) = 2
identity(3) = 3
identity(4) = 4
identity(5) = 5

Nice! We have used exhaustive testing to show that our program is correct. But actually not:

$ gcc -O ident_undef.c ; ./a.out
identity(-5) = 1322972520
identity(-4) = 26
identity(-3) = 18
identity(-2) = 18
identity(-1) = 18
identity(0) = 18
identity(1) = 17
identity(2) = 17
identity(3) = 17
identity(4) = 17
identity(5) = 17

The identity function is, of course, woefully wrong even if it happened to work at -O0. So clearly we did not prove the C program to be correct. On the other hand, perhaps we proved the compiled program to be correct? In this case (I looked at the object code) the compiled program is correct.

The thing is, it is not difficult to create a different compiled program that passes exhaustive testing, while being demonstrably incorrect. For example, a program that relies upon buggy operating system functionality or buggy hardware (such as a Pentium with the fdiv bug) could pass exhaustive testing while being totally wrong.

So maybe the thing that we’re proving correct isn’t just the compiled program. Maybe it’s the entire computer system: the hardware platform, the operating system, and also our compiled program. Now, a program that gives the correct result by relying on the fdiv bug is perfectly fine.

There are two problems with saying that we have proved our computer system correct (with respect to its ability to execute a program we have chosen). First, the computer system as a whole is of course incorrect: any real hardware and operating system will have some known bugs and we have failed to argue that none of these bugs can affect the execution of the identity function (to do that, we would have to test every input to our program in every possible state that the hardware and OS could be in—a tough job). Second, the computer is a real physical object and it follows the laws of physics. If we wait long enough, a disturbance will occur (it could be external, like some alpha particles, or internal, like a quantum fluctuation) that causes the computer to fail to give the right result for our program.

The fundamental problem here is that our “proof” was not in terms of an abstraction, it was in terms of a physical object. Physical objects live in the real world where all guarantees are probabilistic.

The second problem with saying that exhaustive testing constitues a proof (actually, the second aspect of the only problem) is that a proof of correctness is a mathematical proof, whereas a collection of successful test cases is not a mathematical proof. A collection of successful test cases, even if it is exhaustive, may form a very compelling argument, but that doesn’t make it a proof.

Although I do not know of anyone who has done this, it should be perfectly feasible to automatically turn an exhaustive collection of test cases into a proof of correctness. To do this you would first need to borrow (or create) a formal semantics for the computing platform. This could be a semantics for C, it could be a semantics for x86-64, or it could be something different. The formal semantics can be used to evaluate the behavior of the computer program for every input; if the behavior is correct for all inputs, then we can finally construct a proof of correctness. The point of using a formal semantics is that it provides a mathematical interpretation, as opposed to a physical one, for the computing platform.

GCC pre-4.8 Breaks Broken SPEC 2006 Benchmarks

UPDATE: Ok, I did something stupid. I wrote most of this post while using a pre-4.8 snapshot that I had sitting around and then I built the actual 4.8.0 release to verify the behavior but I screwed up a path or something. The 4.8.0 release does not have the behavior described in this post. It was patched a few days before the release, see comments 9 and 10 below. I apologize for the sloppiness. Of course, SPEC is still broken and there are still plenty of undefined programs not in SPEC that GCC will break for you…

Near the top of the GCC 4.8.0 release notes we find this:

GCC now uses a more aggressive analysis to derive an upper bound for the number of iterations of loops using constraints imposed by language standards. This may cause non-conforming programs to no longer work as expected, such as SPEC CPU 2006 464.h264ref and 416.gamess. A new option, -fno-aggressive-loop-optimizations, was added to disable this aggressive analysis.

The thing that is happening here—a compiler breaking a previously working program by exploiting undefined behavior—is a topic I’ve written plenty about, but breaking the current version of SPEC seems like a big enough deal to make it worth looking into. If you want the goriest possible details, check out some discussion at the GCC bugzilla and the response from SPEC. Here I’ll just present the essentials.

The problem with h264ref is a function that contains code like this:

int d[16];

int SATD (void)
{
  int satd = 0, dd, k;
  for (dd=d[k=0]; k<16; dd=d[++k]) {
    satd += (dd < 0 ? -dd : dd);
  }
  return satd;
}

Using GCC 4.8.0 on Linux on x86-64, we get this (I’ve cleaned it up a bit):

$ gcc -S -o - -O2 undef_gcc48.c
SATD:
.L2:
	jmp	.L2

Of course this is an infinite loop. Since SATD() unconditionally executes undefined behavior (it’s a type 3 function), any translation (or none at all) is perfectly acceptable behavior for a correct C compiler. The undefined behavior is accessing d[16] just before exiting the loop. In C99 it is legal to create a pointer to an element one position past the end of the array, but that pointer must not be dereferenced. Similarly, the array cell one element past the end of the array must not be accessed.

One thing we might ask is: What is the compiler thinking here? Why not just give a nice warning or error message if the program is going to unconditionally execute undefined behavior? The basic answer is that the compiler isn’t somehow maliciously saying to itself: “Aha– undefined behavior. I’ll screw up the object code.” Rather, the compiler is simply assuming that undefined behavior never occurs. How does that come into play in SATD()?

In detail, here is what’s going on. A C compiler, upon seeing d[++k], is permitted to assume that the incremented value of k is within the array bounds, since otherwise undefined behavior occurs. For the code here, GCC can infer that k is in the range 0..15. A bit later, when GCC sees k<16, it says to itself: “Aha– that expression is always true, so we have an infinite loop.” The situation here, where the compiler uses the assumption of well-definedness to infer a useful dataflow fact, is precisely analogous to the null-pointer check optimization which got some Linux kernels into trouble.

My view is that exploiting undefined behavior can sometimes be OK if a good debugging tool is available to find the undefined behavior. In this case, a memory safety error, we’ll turn to Valgrind and the address sanitizer (an awesome new feature in GCC 4.8.0, ported over from LLVM). Here goes:

$ gcc undef_gcc48.c
$ valgrind -q ./a.out
$ gcc -fsanitize=address undef_gcc48.c
$ ./a.out 
$ clang -fsanitize=address undef_gcc48.c
$ ./a.out 
$

No dice, bummer. Valgrind cannot change the layout of objects so I would not have expected it to notice this, but it seems like the address sanitizer should be able to catch the problem. On the other hand, where dynamic analysis fails us, static analysis succeeds:

$ gcc -O3 undef_gcc48.c -Wall 
undef_gcc48.c: In function ‘SATD’:
undef_gcc48.c:6:29: warning: array subscript is above array bounds [-Warray-bounds]
   for (dd=d[k=0]; k<16; dd=d[++k]) {
                             ^
undef_gcc48.c: In function ‘main’:
undef_gcc48.c:6:29: warning: array subscript is above array bounds [-Warray-bounds]
   for (dd=d[k=0]; k<16; dd=d[++k]) {
                             ^

For whatever reason, this warning shows up at -O3 but not -O2. We should, however, not pat ourselves on the back quite yet—this kind of static analysis is generally somewhat brittle and we cannot expect to always get such a nice warning.

It would be reasonable to ask the larger question of whether a compiler should be breaking code like this, but it’s late to be opening that can of worms, so I’ll leave this discussion for another day (or for the comments). I have not looked into 416.gamess but I suspect it’s more of the same.

Proofs from Tests

An idea that I’ve been interested in for a while is that a good test suite should be able to directly support formal verification. How would this work, given that testing usually misses bugs? The basic insight is that a test case is usually telling us about more than just one execution: it’s telling us about a class of similar executions that will all give the correct result if the test case succeeds. Partition testing is based on this same idea.

To get started, let’s do a little thought experiment where our test suite is exhaustive: it tests every possible input value. Although this is obviously very restrictive, exhaustive testing is sometimes feasible for non-trivial examples such as a function that decodes a binary ARM instruction.

What does it take to make a proof out of an exhaustive collection of test cases? We need some sort of a mathematical model for how the system works—an executable formal semantics—and also a mathematical model for how the system is intended to work: a formal specification. Then, we basically unroll the execution of the system on each test case and show that the result is equivalent to the specified result. The conjunction of this collection of statements about the system is our proof. The problem is that it’s going to be a very large, unsatisfying proof.

So the next problem is turning the cumbersome proof into a prettier one. The basic strategy is to look for a collection of “similar” conjuncts: these should all have the same control flow and, for all operations performed (addition, subtraction, etc.) the inputs and outputs should live in the same continuous region. Two’s complement addition, for example, has a discontinuity where it wraps around; the 8-bit signed addition operations 50+50 and 100+100 therefore do not live in the same continuous region. I’m guessing that for a suitable definition of “continuous” (which may not be totally straightforward for bitwise operations) it should be possible to automatically collapse a collection of cases into a single case that covers all of them. In terms of partition testing, our final proof will have one conjunct for each partition.

But we still haven’t accomplished anything particularly useful, since we can’t exhaustively test most of the codes that we care about. What I hope is that if we actually went through this exercise, we’d learn something about how to attack the more general case where we have a non-exhaustive collection of test cases and we want to automatically expand each of them to reason about an entire partition of the input space. If we could do that, a collection of test cases achieving 100% partition coverage would lead to a proof of correctness.

Consider a 2-bit saturating adder where the result of the addition sticks at the minimum or maximum representable value instead of overflowing. This function takes a pair of values in the range -2..1 and returns a value in the same range. A C implementation would be:

int twobit_sat_add (int a, int b)
{
  int result = a+b;
  if (result>1) return 1;
  if (result<-2) return -2;
  return result;
}

Here is an exhaustive collection of test cases with results:

twobit_sat_add (-2, -2) = -2
twobit_sat_add (-2, -1) = -2
twobit_sat_add (-2,  0) = -2
twobit_sat_add (-2,  1) = -1
twobit_sat_add (-1, -2) = -2
twobit_sat_add (-1, -1) = -2
twobit_sat_add (-1,  0) = -1
twobit_sat_add (-1,  1) =  0
twobit_sat_add ( 0, -2) = -2
twobit_sat_add ( 0, -1) = -1
twobit_sat_add ( 0,  0) =  0
twobit_sat_add ( 0,  1) =  1
twobit_sat_add ( 1, -2) = -1
twobit_sat_add ( 1, -1) =  0
twobit_sat_add ( 1,  0) =  1
twobit_sat_add ( 1,  1) =  1

The proof of correctness requires an executable semantics for C; these are available. The resulting proof will be large, even for this simple example, so let’s look at the partitions.

Only one test case takes the first return path:

twobit_sat_add (1, 1) = 1

Three test cases take the second return path:

twobit_sat_add (-2, -2) = -2
twobit_sat_add (-2, -1) = -2
twobit_sat_add (-1, -2) = -2

And these test cases take the third return path:

twobit_sat_add (-2,  0) = -2
twobit_sat_add (-2,  1) = -1
twobit_sat_add (-1, -1) = -2
twobit_sat_add (-1,  0) = -1
twobit_sat_add (-1,  1) =  0
twobit_sat_add ( 0, -2) = -2
twobit_sat_add ( 0, -1) = -1
twobit_sat_add ( 0,  0) =  0
twobit_sat_add ( 0,  1) =  1
twobit_sat_add ( 1, -2) = -1
twobit_sat_add ( 1, -1) =  0
twobit_sat_add ( 1,  0) =  1

Within each collection of test cases, no discontinuities in C’s addition operator are triggered, so our final proof should contain three top-level conjuncts. Is it possible to reconstruct this proof given only, for example, these three representative test cases?

twobit_sat_add ( 1,  1) =  1
twobit_sat_add (-2, -1) = -2
twobit_sat_add ( 0, -1) = -1

I don’t know, but I hope so. Perhaps techniques such as abstract interpretation can be used to reason effectively about the behavior of entire partitions. If so, then this proof based on three test cases would also work for saturating additions of any width (assuming that the implementation had available to it an addition operation at least one bit wider—if not, a bit more work is required).

To summarize, here’s the work to be done:

  • For some simple programming language where an executable formal semantics already exists, show that an exhaustive collection of test cases for a function can be automatically turned into a proof of correctness.
  • Define partition coverage in an appropriate fashion.
  • Show that an exhaustive proof of correctness can be turned into a proof by partitions.
  • Show that a single test case per partition is sufficient to generate an overall proof of correctness.
  • For extra credit, automatically detect the case where the test suite does not achieve 100% partition coverage.

I suspect that a lot of this hinges upon finding an appropriate way to define partitions. A conservative definition for what constitutes a partition will make the research easier (because there will be less going on inside each partition) but also less useful because more test cases will be required to achieve 100% partition coverage. A liberal definition of partition makes the testing part easier and the proofs harder. Clearly, in practice, it is not OK to insist (as I did above) that every test case in a partition has identical control flow: we’ll need to be able to abstract over loop counts and recursion depth.

The basic insight here is that some of the work of formal verification can be borrowed from the testing phase of software development. We can’t avoid testing even when we’re doing formal verification, so we might as well try to get more benefit out of the sunk costs. My hope is that for a smallish language subset, this project is within the scope of a PhD thesis. If anyone knows of work that already accomplishes some or all of the goals I’ve outlined here, please let me know.