We Need Hardware Traps for Integer Overflow

Processors should support integer math instructions that optionally trap on overflow. Because popular architectures lack this feature, otherwise excellent modern systems programming languages, such as Rust, Go, and D, have default integer types that wrap. This is bad because unexpected wrapping causes programs to produce incorrect results, although of course integer overflow in a safe language is not the utter disaster that it is in C/C++, where integer overflow and type unsafety cooperate to create the worst kinds of bugs. The reason that Rust provides wrapping integers is that so far, the costs of a better semantics — both in terms of runtime overhead and in terms of implementation effort for the Rust team — exceed the perceived benefits. My belief is that hardware traps could change this inequation in a favorable way.

The lack of trapping math instructions doesn’t only hurt low-level languages. In JavaScript, where all numbers are semantically double-precision floats, replacing doubles with integer types in the runtime requires expensive software overflow checks. I’ve heard that this results in a 5-10% slowdown for basically all JavaScript code. In languages such as Python and Racket the default integer type overflows into bignums instead of doubles; Matthew Flatt tells me that Racket’s performance penalty due to software overflow checks probably exceeds 100% for jitted tight loops that do lots of integer math.

You might be saying to yourself: But I require wrapping integers to implement crypto codes and PRNGs and hash functions and stuff like that. The answer is easy: we provide wrapping operators that can be used in these specialized situations. One choice would be +., -., etc. In a unicode language we might use ⊞, ⊟, etc. (Jesse Ruderman suggested this, and I like the idea).

Architectures such as MIPS and Alpha support integer overflow traps. However, to a good approximation, the only architectures that matter right now are ARM’s and Intel’s. There are two issues in adding integer overflow traps to these ISAs. First, where do we get opcode space for the new trapping instructions? For x86 and x86-64, which support an elaborate system of instruction prefixes, it may make sense to use that mechanism, although this comes with a code size penalty. ARM has fixed-size instructions so finding space may be trickier there. A mail to a friend at ARM on this topic has so far gone unanswered, but I am guessing that this could be shoehorned into ARMv9 somehow. The second issue is the effect of integer overflow traps on chip area and critical path length. An experienced architect who I talked to doesn’t think these are serious problems, and in any case the complexity is dramatically lower than the complexity of implementing something like hardware support for array bounds checking.

This post isn’t as much of an opinion piece as a plea to the folks at ARM, Intel, and AMD: Please provide this feature. It is needed in order to make high-level languages faster and low-level languages saner.

UPDATE: Some data about the overhead of software integer overflow checking in C/C++ codes can be found in section IIID of this paper. Please keep in mind, however, that this implementation is tuned for debugging not performance. A highly tuned software integer undefined behavior checker for C/C++ could probably have overhead in the 5% range.

A Guide to Better Scripty Code for Academics

[Suresh suggested that I write a piece about unit testing for scripty academic software, but the focus changed somewhat while I was writing it.]

Several kinds of software are produced at universities. At one extreme we have systems like Racket and ACL2 and HotCRP that are higher quality than most commercial software. Also see the ACM Software System Award winners (though not all of them came from academia). I wrote an earlier post about how hard it is to produce that kind of code.

This piece is about a different kind of code: the scripty stuff that supports research projects by running experiments, computing statistics, drawing graphs, and that sort of thing. Here are some common characteristics of this kind of code:

  • It is often written in several different programming languages; for example R for statistics, Matplotlib for pretty pictures, Perl for file processing, C/C++ for high performance, and plenty of shell scripts and makefiles to tie it all together. Code in different languages may interact through the filesystem and also it may interact directly.
  • It seldom has users outside of the research group that produced it, and consequently it usually embeds assumptions about its operating environment: OS and OS version, installed packages, directory structure, GPU model, cluster machine names, etc.
  • It is not usually explicitly tested, but rather it is tested through use.

The problem is that when there aren’t any obvious errors in the output, we tend to believe that this kind of code is correct. This isn’t good, and it causes many of us to have some legitimate anxiety about publishing incorrect results. In fact, I believe that incorrect results are published frequently (though many of the errors are harmless). So what can we do? Here’s a non-orthogonal list.

Never Ignore Problems

Few things in research are worse than discovering a major error way too late and then finding out that someone else had noticed the problem months earlier but didn’t say anything. For example we’ll be tracking down an issue and will find a comment in the code like this:

  # dude why do I have to mask off the high bits or else this segfaults???

Or, worse, there’s no comment and we have to discover the offending commit the hard way — by understanding it. In any case, at this point we pull out our hair and grind our teeth because if the bug had been tracked down instead of hacked around, there would have been huge savings in terms of time, energy, and maybe face. As a result of this kind of problem, most of us have trained ourselves to be hyper-sensitive to little signs that the code is horked. But this only works if all members of the group are onboard.

Go Out of Your Way to Find Problems

Failing to ignore problems is a very low bar. We also have to actively look for bugs in the code. The problem is that because human beings don’t like being bothered with little details such as code that does not work, our computing environments tend to hide problems by default. It is not uncommon for dynamically and weakly typed programming languages to (effectively) just make up crap when you do something wrong, and of course these languages are the glue that makes everything work. To some extent this can be worked around by turning on flags such as -Wall in gcc and use warningsuse strict; in Perl. Bugs that occur when crossing layers of the system, such as calling into a different language or invoking a subprocess, can be particularly tricky. My bash scripts became a lot less buggy once I discovered the -e option. Many languages have a lint-like tool and C/C++ have Valgrind and UBSan.

One really nice thing about scripty research code is that there’s usually no reason to recover from errors. Rather, all dials can be set to “fail early, fail fast” and then we interactively fix any problems that pop up.

The basic rule is that if your programming environment supports optional warnings and errors, turn them all on (and then maybe turn off the most annoying ones). This tends to have a gigantic payoff in terms of code quality relative to effort. Also, internal sanity checks and assertions are worth their weight in gold.

Fight Confirmation Bias

When doing science, we formulate and test hypotheses. Although we are supposed to be objective, objectivity is difficult, and there’s even a term for this. According to Wikipedia:

Confirmation bias is the tendency of people to favor information that confirms their beliefs or hypotheses.

Why is this such a serious problem? For one thing, academia attracts very smart people who are accustomed to being correct. Academia also attracts people who prefer to work in an environment where bad ideas do not lead to negative economic consequences, if you see what I mean. Also, our careers depend on having good ideas that get good results. So we need our ideas to be good ones — the incentives point to confirmation bias.

How can we fight confirmation bias? Well, most of us who have been working in the field for more than a few years can easily bring to mind a few examples where we felt like fools after making a basic mistake. This is helpful in maintaining a sense of humility and mild research paranoia. Another useful technique is to assume that the people doing previous work were intelligent, reasonable people: if implementing their ideas does not give good results, then maybe we should figure out what we did wrong. In contrast, it is easy to get into the mindset that the previous work wasn’t very good. Evidence of this kind of thinking can be seen in the dismissive related work sections that one often sees.

Write Unit Tests

Modern programming languages come with good unit testing frameworks and I’ve noticed that the better students tend to instinctively write unit tests when they can. In contrast, us old fogies grew up as programmers long before the current testing culture developed and we have a harder time getting ourselves to do this.

But does unit testing even make sense for scripty code? In many cases it clearly doesn’t. On the other hand, Suresh gives the example where they are comparing various versions of an algorithm; in such a situation we might be able to run various data sets through all versions of the algorithm and make sure their results are consistent with each other. In other situations we’re forced to re-implement a statistical test or some other piece of fairly standard code; these can often be unit tested using easy cases. Mathematical functions often have properties that support straightforward smoke tests. For example, a function that computes the mean or median of a list should compute the same value when fed the same list twice.

Write Random Testers

It is often the case that an API that can be unit tested can also be fuzzed. Two things are required: a test-case generator and an oracle. The test-case generator can do something easy like randomly shuffling or subsetting existing data sets or it can make up new data sets from scratch. The oracle decides whether the code being tested is behaving correctly. Oracles can be weak (looking for crashes) or strong (looking for correct behavior). Many modern programming languages have a QuickCheck-like tool which can make it easier to create a fuzzer. This blog post and this one talk about random testing (as do plenty of others, this being one of my favorite subjects).

Clean Up and Document After the Deadline

As the deadline gets closer, the code gets crappier, including the 12 special cases that are necessary to produce those weird graphs that reviewer 2 wants. Cleaning this up and also documenting how the graphs for the paper were produced is surely one of the best investments we could make with our time.

Better Tooling

Let’s take it as a given that we’re doing code reviews, using modern revision control, unit testing frameworks, static and dynamic analysis tools, etc. What other tool improvements do we want to see? Phil Guo’s thesis has several examples showing how research programming could be improved by tools support. There’s a lot of potential for additional good work here.

Summary

There are plenty of easy ways to make scripty research code better. The important thing is that the people who are building the code — usually students — are actually doing this stuff and that they are receiving proper supervision and encouragement from their supervisors.

Early Superoptimizer Results

[Here’s a shortcut to the results. But it would be best to read the post first.]

Following my previous superoptimizer post, my student Jubi and I were getting up to speed on the prerequisites — SMT solvers, LLVM internals, etc. — when Googler Peter Collingbourne contacted me saying that he had recently gotten a superoptimizer up and running and might I be interested in working with him on it? I read his code and found it to be charmingly clear and simple. Also, one of my basic principles in doing research is to avoid competing, since competition wastes resources and burns students because the resulting race to publication effectively has an arbitrary winner. So I immediately started feeding bug reports to Peter.

The new superoptimizer, Souper, makes a few simplifying assumptions:

  • The only optimization candidates that it considers are the true and false values. Therefore, at present Souper only harvests expressions that compute an i1: a one-bit integer, which is how Booleans are represented in LLVM. Thus, the result of a Souper run is a collection of expressions that LLVM could have — but did not — evaluate to either true or false.
  • It doesn’t yet have models for all instructions or for all undefined behaviors for the instructions it does support.

These assumptions need to be relaxed. One generalization that should be pretty easy is to harvest expressions that end up as integers of arbitrary width. The interesting thing about this is that we cannot take time to check if every harvested expression evaluates to, for example, every possible value that an i32 can take. What we will do instead is to ask the SMT solver to synthesize the equivalent constant. The problem is that by default, when we make an equivalence query to an SMT solver, it is an unsat result that signals equivalence, and unsat doesn’t come with a model — it indicates failure to find a model. It turns out there’s a cute trick (which I learned from Nuno Lopes) involving a quantifier which flips a query around such that an equivalence results in sat, and therefore a model, from which we can pluck the synthesized constant. Consider this Z3/Python code where we’re asking, for a variety of constants c, how to express i*c (where i is an integer variable) in the form i<<x + i<<y + i<<z:

from z3 import *

s = Solver()

def checkit (c):
    s.push()
    i, x, y, z = BitVecs('i x y z',32)
    q = ForAll (i, i*c == ((i<<x) + (i<<y) + (i<<z)))
    s.add(q)
    s.add(x>=0, x<32)
    s.add(y>=0, y<32)
    s.add(z>=0, z<32)
    if s.check() == sat:
        m = s.model()
        print ("i * " + str(c) + 
               " == i<<" + str(m.evaluate(x)) + 
                " + i<<" + str(m.evaluate(y)) +
                " + i<<" + str(m.evaluate(z)))
    else:
        print "i * " + str(c) + " has no model"
    s.pop()

for m in range(100):
    checkit(m)

This is just an example but it’s the kind of thing that might make sense on a small embedded processor where the integer multiply instruction is expensive or doesn’t exist. The results include:

i * 28 == i<<4 + i<<3 + i<<2
i * 29 has no model
i * 30 has no model
i * 31 has no model
i * 32 == i<<4 + i<<3 + i<<3
i * 33 == i<<4 + i<<4 + i<<0

The full set of results is here. I particularly enjoyed the solver's solutions for the first three cases. So we know that the synthesis part of a superoptimizer is possible and in fact probably not all that difficult. But that's a digression that we'll return to in a later post; let's get back to the main topic.

Now I'll show you how to read Souper's output. You may find it useful to keep the LLVM instruction set reference handy. Here's an optimization report:

%0:i32 = var 
%1:i32 = mul 4294967294:i32, %0
%2:i1 = eq 1:i32, %1
cand %2 0:i1

The first line tells us that %0 has type i32 -- a 32-bit integer -- corresponding to a signed or unsigned int in C/C++, and that it is a variable: an input to the superoptimized code that may hold any value. Reasoning about any-valued variables is hard but solvers are good at it and that is the entire point of the superoptimizer.

The second line tells us that %1 is a new i32 computed by multiplying %0 by -2. The third line tells us that %2 is a new i1 -- a Boolean or 1-bit integer -- computed by seeing if %1 is equal to 1. The last line, starting with "cand", is Souper telling us that it believes %2 will always take the value 0. If Souper tells us this when running on optimized code, it has found a missed optimization. In this case LLVM has missed the fact that multiplying an arbitrary value by an even number can never result in an odd number. Is this a useful optimization to implement in LLVM? I don't know, but GCC does it, see the bottom of this page.

Souper finds many missed optimizations that fit this general pattern:

%0:i32 = var 
%1:i64 = sext %0
%2:i64 = sdiv 2036854775807:i64, %1
%3:i1 = ne 0:i64, %2
cand %3 1:i1

Here the observation is that if we divide a large constant by an arbitrary 32-bit value, the result cannot be zero. GCC does not find this one.

Some Souper output contains path constraints:

%0:i32 = var 
%1:i1 = eq 0:i32, %0
pc %1 1:i1
%2:i32 = addnsw 1:i32, %0
%3:i1 = slt %2, 2:i32
cand %3 1:i1

Here, at line 3, we learn that %1 must take the value 1 in the remaining code due to a path constraint. In the original LLVM code there was a conditional branch exiting if %1 had the value 0. Since %1 has the value 1, we can infer, in the remaining code, that %0 contains 0. Thus, %2 contains 1 and the expression %2 < 2 must evaluate to true. Finally, this charming example exploits the fact that if the product of two numbers is not zero, then neither of the numbers could have been zero:

%0:i32 = var 
%1:i32 = var 
%2:i32 = mul %0, %1
%3:i1 = eq 0:i32, %2
pc %3 0:i1
%4:i1 = eq 0:i32, %0
cand %4 0:i1

One more thing that you might see in the full set of results is an entry like this: %0 = block. This means (more or less) that %0 is a value that is going to pass through the code to a phi node without being otherwise used, this is useful for increasing Souper's precision.

I think that's about all you need to know in order to read the full set of results from a couple days of running Csmith, Souper, and C-Reduce in a loop. First, we wait for Souper to find a missed optimization and then second, we find a minimal C program that exhibits the missed optimization. The results have been ranked in a way that attempts to push more similar results (that are more likely to be duplicates) lower in the list.

So far, the most common pattern that comes out of Souper's findings is that LLVM needs an integer range analysis. Such an analysis would also help eliminate integer overflow checks, one of my hobby horses. LLVM also doesn't always propagate information that would be best represented at the bit level, such as the even/odd distinction required for the first optimization that I discussed. Finally, LLVM does not always learn from branches. My not-necessarily-educated guess is that all of this is a symptom of LLVM's heavy reliance on the instruction combiner, which is not so much an optimization pass as a loose federation of a few hundred peephole passes.

Some of the missing LLVM optimizations won't be hard to implement for people have passable C++ and who have spent some time becoming familiar with the instruction combiner. But here are a few things we need to keep in mind:

  • One might ask: Does it make sense to harvest missed optimizations from randomly generated code? My initial idea was that since Csmith's programs are free from undefined behaviors, the resulting optimizations would be less likely to be evil exploitation of undefined behaviors. But also I did it because it was easy and I was curious what the results would look like. My judgement is the the results are interesting enough to deserve a blog post. Perhaps an easier way to avoid exploiting undefined behavior would be to add a command line option telling Souper to avoid exploiting undefined behaviors.

  • For each missed optimization we should do a cost/benefit analysis. The cost of implementing a new optimization is making LLVM a bit bigger and a bit more likely to contain a bug. The benefit is potential speedup of code that contains the idioms.

  • Although the reduced C programs can be useful, you should look at Souper output first and the C code second. For one thing, the Boolean that Souper finds is sometimes a bit obscured in the C code. For another, the C-Reduce output is somewhat under-parenthesized -- it will test your knowledge of C's operator precedence rules. Finally C-Reduce has missed some opportunities to fold constants, so for example we see ~1 instead of -2 in the 2nd example from the top.

  • Each missed optimization found by Souper should be seen as a member of a class of missed optimizations. So the goal is obviously not to teach LLVM to recognize the specific cases found by Souper, but rather to teach it to be smart about some entire class of optimizations. My belief is that this generalization step can be somewhat automated, but that is a research problem.

  • Although all of the optimizations that I've looked at are correct, there's always the possibility that some of them are wrong, for example due to a bug in Souper or STP.

This article presents some very early results. I hope that it is the beginning of a virtuous circle where Souper and LLVM can both be strengthened over time. It will be particularly interesting to see what kinds of optimizations are missing in LLVM code emitted by rustc, GHC, or llgo.

UPDATE: Here are some bugs that people have filed or fixed in response to these results:

It's very cool that people are acting on this! Please let me know if you know of more results than are listed here.

This Is Not a Defect

In several previous blog entries I’ve mentioned that in some recent versions of C and C++, left-shifting a 1 bit into the high-order bit of a signed integer is an undefined behavior. In other words, if you have code that computes INT_MIN by evaluating 1<<31 (or 1<<(sizeof(int)*CHAR_BIT-1) if you want to be that way) or code that does a byte swap using a subexpression like c<<24, then your program most likely has no meaning according to the standards. And in fact, Clang's integer sanitizer confirms that most non-trivial codes, including several crypto libraries, are undefined according to this rule.

An obvious fix is to tighten up the standard a bit and specify that shifting a 1 into the sign bit has the expected effect, which is what all compilers that I am aware of already do. This is what the C++ standards committee is doing, though as far as I can tell the fix doesn't officially take effect until a TC, a "technical corrigendum," is issued -- and even that doesn't finalize the thing, but it seems near enough in practice.

Anyhow, today Nevin Liber pointed out that there's a bit of news here, which is that last month the C standards committee decided that this same issue is not a defect in C and that they'll reconsider it later on, which I guess is fine since compiler implementers are ignoring this particular undefined behavior, but it seems like a bit of a missed opportunity to (1) make the language slightly saner and (2) bring it into line with the existing practice. Also you might consider perusing the full set of defect reports if you want to be thankful that you did something other than attend a standards meeting last month.

Cedar Mesa

For years I’d heard people talk about Cedar Mesa, a remote part of southern Utah containing so many Anazazi ruins that it’s basically a huge outdoor museum. Recently my family spent a few days exploring this area. Despite the fact that Cedar Mesa is well-known — it was popularized, in large part, by a book by David Roberts in the 1990s — as far as we could tell nobody was camped within several miles of our campsite off of a high-clearance track near the head of Lime Canyon, seen here in the evening light:

April is a great time to be in the desert but this area is pretty high elevation (6400 feet or almost 2000 m) and it was well below freezing on our first night out. Here the sun is finally starting to warm us up the next morning:

Yep, the kids are wearing their snow pants. Later that morning we visited the Moon House, one of the larger ruins in the area. Although the hike to it is short, the route is circuitous, first dropping over a small pouroff, following a ledge around a corner, and then following a talus slope to the bottom of the canyon, passing between some huge boulders in the bottom of the canyon, climbing most of the way up the other side, and following another ledge behind a big pinnacle. There are good views along the way:

The ruins are impressive:

Life in the desert, though a bit sparse, is often pretty:

The next day we hiked in Natural Bridges National Monument; as you might expect it contains some big natural bridges:

And there were other things to see as well:

Although metates (stone mortars) are a common sight on the Colorado Plateau, something I haven’t seen elsewhere are the manos (grinding stones) since they are so easy to pick up and carry away:

We had great weather in the early part of the trip but got chased home a day early by a rainy night with a forecast for more rain: the roads in this part of the world tend to turn into grease that is impassable even with 4WD when they get wet enough, and we really did not want to get stuck while pulling our tent trailer:

All in all a nice short vacation.

I’m slowly ratcheting down the number of personal blog posts but I will continue to throw in this sort of thing every now and then.

Too Much Milk: The Secret Origin Story

When I first taught operating systems 12 years ago, I based my teaching materials on a set of slides inherited from John Carter, the previous instructor at Utah. I was generally happy with these slides, and I’ve continued to evolve them since then, but one thing I was always curious about was the “too much milk” example that is used to introduce concurrency problems. It goes something like: “You see that you’re out of milk and leave to go buy some. A few minutes later your roommate comes home and sees that there’s no milk. Later: OH NO THERE’S TOO MUCH MILK because of a synchronization error between humans.” This always struck me as a bit of an idiosyncratic way to go about introducing synchronization but I didn’t have any reason to replace it and didn’t think about it much until one time while procrastinating on lecture prep I Googled “too much milk” and learned the shocking truth: some large fraction of operating systems courses at research universities use this same example. See this for yourself.

The intriguing possibility is that too much milk could be used as sort of a mitochondrial DNA to trace the evolution of OS course notes. I started to envision creating a tool for analyzing the lineage of Powerpoint files, but then totally forgot about it until a few weeks ago. While chatting with Ryan Stutsman about teaching OS, for some reason I mentioned the Too Much Milk phenomenon. He thought that this had probably originated with his advisor John Ousterhout. John then explained:

I believe that the “too much milk” example owes its popularity to me, in that I have been using it ever since the early 1980s in my courses. For many years in the 1980s and 1990s, whenever a PhD student graduated in the systems area, Dave Patterson gave them a complete set of lecture videos for a couple of systems classes, including my CS 162 lectures. As a result, a lot of the material from my CS 162 lectures has spread all across the country through Berkeley graduates, including the “too much milk” example.

However, I did not invent this example. Before I taught my first operating systems course at Berkeley, Mike Powell gave me a copy of his notes and the “too much milk” example was there. So, it goes back at least to Mike Powell. I don’t know if he invented it, or if he got it from someone else.

So there you have it.

This post doesn’t have too much of a point, but I thought this was a nice story and also it provides a bit of insight into how teaching actually works: figuring out how to teach course material is hard and in practice, most of the time we borrow a framework and adapt it to fit our needs. Every now and then I’ll discuss this in class and invariably one or two undergraduates are so surprised that I’ll get dinged on the course evaluations with comments like “Lazy professor doesn’t even develop his own course materials.” However, I’ve done three or four courses the other way — starting from scratch — and have found that it takes multiple iterations before the material converges, so it’s not really clear to me that it’s better to start from scratch in cases where a good set of existing material can be found.

Hints for Computer System Design

On the last day of my advanced OS course this spring we discussed one of my all-time favorite computer science papers: Butler Lampson’s Hints for Computer System Design. Why is it so good?

  • It’s hard-won advice. Designing systems is not easy — a person can spend a lifetime learning to do it well — and any head start we can get is useful.

  • The advice is broadly applicable and is basically correct, there aren’t really any places where I need to tell students “Yeah… ignore Section 3.5.”

  • There are many, many examples illustrating the hints. Some of them require a bit of historical knowledge (the paper is 30 years old) but by and large the examples stand the test of time.

  • It seems to me that a lot of Lampson’s hints were routinely ignored by the developers of large codes that we use today. I think the reason is pretty obvious: the massive increases in throughput and storage capacity over the last 30 years have permitted a great deal of sloppy code to be created. It’s nice to read a collection of clear thinking about how things could be done better.

Something that bums me out is that it’s now impossible to publish a paper like this at a top conference such as SOSP.