Planning for Disaster

Alan Perlis once said:

I think that it’s extraordinarily important that we in computer science keep fun in computing. When it started out, it was an awful lot of fun. Of course, the paying customers got shafted every now and then, and after a while we began to take their complaints seriously. We began to feel as if we really were responsible for the successful, error-free perfect use of these machines. I don’t think we are.

This is a nice sentiment, perhaps even a defensible one if we interpret it as narrowly talking about academic computer science. On the other hand, probably within 20 years, there’s going to be a major disaster accompanied by the loss of many lives that is caused more or less directly by software defects. I don’t know what exactly will happen, but something related to transportation or SCADA seems likely. At that point we can expect things to hit the fan. I’m not optimistic that, as a group, computer scientists and computing professionals can prevent this disaster from happening: the economic forces driving automation and system integration are too strong. But of course we should try. We also need to think about what we’re going to do if, all of a sudden, a lot of people suddenly expect us to start producing computer systems that actually work, and perhaps hold us accountable when we fail to do so.

Obviously I don’t have the answers but here are a few thoughts.

  • We know that it is possible to create safety-critical software that (largely) works. Generally this happens when the organizations creating the software are motivated not only by market forces, but also by significant regulatory pressure. Markets (and the humans that make them up) are not very good at analyzing low-probability risks. A huge amount of critical software is created by organizations that are subject to very little regulatory pressure.
  • It is difficult to tell people that something is a bad idea when they very much want it to be a good idea. We should get used to doing this, following Parnas’s courageous example.
  • It is difficult to tell people that something is going to be slow and expensive to create, when they very much want it to be quick and cheap. We need to get used to saying that as well.
  • We can and should take responsibility for our work. I was encouraged by the field’s generally very positive response to The Moral Character of Cryptographic Work. Computer scientists and computing professionals almost always think that their particular technology makes the world better or is at worst neutral — but that is clearly not always the case. Some of this could be taught.
  • We need to be educating CS students in methods for creating software that works: testing, specification, code review, debugging, and formal methods. You’d think this is obvious but students routinely get a CS degree without having done any sort of serious software testing.

Finally, let’s keep in mind that causes are tricky. A major human disaster usually has a complex network of causes, perhaps simply because any major disaster with a single cause would indicate that the system had been designed very poorly. Atomic Accidents makes it clear that most nuclear accidents have been the result of an anti-serendipitous collection of poor system design, unhappy evolution and maintenance, and failure-enhancing responses by humans.

Acknowledgments: Pascal Cuoq and Kelly Heller gave some great feedback on drafts of this piece.

Do Fiddly Buffer Overruns Matter?

by Pascal Cuoq and John Regehr

Using tis-interpreter we found a “fiddly” buffer overrun in OpenSSL: it only goes a few bytes out of bounds and the bytes it writes have just been read from the same locations. Fiddly undefined behaviors are common and it can be hard to convince developers to fix them, especially in a case such as this where the behavior is intentional. More accurately, the behavior was intentional when the code was written in the 1990s—a simpler age of more interesting computers. Nevertheless, we reported this bug and it was recently fixed in the master branch. The new version of the file is far nicer. The bug is still present in the release branches.

The bug is in the implementation of RC4: a comparatively old cipher that has had a useful life, but is reaching the end of it. Indeed, OpenSSL’s rc4_enc.c was — before the recent fix — some seriously 1990s C code with plentiful use of the preprocessor, design choices based on characteristics of Alphas and UltraSPARCs, and this dodgy OOB-based performance hack. A new paper “Analysing and Exploiting the Mantin Biases in RC4” might be the final nail in the coffin of RC4.

The encryption code in rc4_enc.c causes up to RC4_CHUNK - 1 bytes past the end of the output buffer to be overwritten with values that have been read from the same (out of bounds) locations. There are several variants delineated with #ifdefs. On an ordinary x86-64, RC4_CHUNK is 8 and execution will go through line 217:

for (; len & (0 - sizeof(RC4_CHUNK)); len -= sizeof(RC4_CHUNK)) {
    ichunk = *(RC4_CHUNK *) indata;
    otp = ...
    *(RC4_CHUNK *) outdata = otp ^ ichunk;
    indata += sizeof(RC4_CHUNK);
    outdata += sizeof(RC4_CHUNK);
}

This loop consumes all the complete 8-byte words found in the input buffer, and writes as many words into the output buffer. So far so good. Things get interesting after the loop:

if (len) {
    RC4_CHUNK mask = (RC4_CHUNK) - 1, ochunk;

    ichunk = *(RC4_CHUNK *) indata;  // !
    ochunk = *(RC4_CHUNK *) outdata; // !
    ...
    mask >>= (sizeof(RC4_CHUNK) - len) << 3;
               
    ... something about ultrix ...

    ochunk &= ~mask;
    ochunk |= (otp ^ ichunk) & mask;
    *(RC4_CHUNK *) outdata = ochunk; // !!

If there remain between one and seven unprocessed bytes, these are taken care of by reading an entire 8-byte word from indata, reading an entire word from outdata, computing a new word made from encrypted input and original out-of-bound values, and writing that word to outdata.

What could go wrong? At first glance it seems like this overrun could trigger an unwanted page fault, but that isn’t the case on architectures where an aligned word never crosses a page boundary. However, in a concurrent environment, the illegal writing of illegally read data can be directly observed. We wrote a small program that shows this. The key is a struct that places important data directly after a buffer that is the destination of RC4-encrypted data:

struct stuff {
    unsigned char data[data_len];
    int important;
};

Then, one thread repeatedly puts RC4-encrypted data into the buffer:

void *thread2(void *arg) {
    RC4_KEY key;
    const char *key_data = “Hello there.”;
    RC4_set_key(&key, strlen(key_data), (const unsigned char *)key_data);
    while (!stop)
        RC4(&key, data_len, src->data, dst->data);
    return 0;
}

While the other thread waits for the important field to be corrupted:

void *thread1(void *arg) {
    int x = 0;
    while (!stop) {
        dst->important = ++x;
        check(&dst->important, x);
    }
    return 0;
}

The check() function is compiled separately to help ensure that the value being checked comes from RAM rather than being cached in a register:

void check(int *addr, int val) { assert(*addr == val); }

Our complete code is on github. If you run it, you should see the assertion being violated almost immediately. We’ve tested against OpenSSL 1.0.2f on Linux and OS X. On Linux you must make sure to get the C implementation of RC4 instead of the assembly one:

./Configure no-asm linux-x86_64

Although any of tis-interpreter, Valgrind, or ASan can find the OpenSSL RC4 bug when the encrypted data bumps up against the end of an allocated block of memory, none of them finds the bug here since the “important” field absorbs the OOB accesses. There’s still an UB but it’s subtle: accessing buffer memory via a pointer-to-long is a violation of C’s strict aliasing rules.

So, do fiddly buffer overruns matter? Well, this particular bug is unlikely to have an observable effect on programs written in good faith, though one of us considered submitting as an Underhanded Crypto Challenge entry a backdoored chat program with end-to-end encryption that, each time the race condition triggers, sends a copy of the private message being processed to Eve, encrypted with Eve’s key. Alas, the race window is very short. A fiddly OOB can also be a problem if the compiler notices it. In an example from Raymond Chen’s blog, GCC uses the OOB array reference as an excuse to compile this function to “return true”:

int table[4];
bool exists_in_table(int v) {
    for (int i = 0; i <= 4; i++) {
        if (table[i] == v) return true;
    }
    return false;
}

That could only happen here if we used link-time optimization to give the compiler access to both the application and the OpenSSL code at once.