Do Fiddly Buffer Overruns Matter?

by Pascal Cuoq and John Regehr

Using tis-interpreter we found a “fiddly” buffer overrun in OpenSSL: it only goes a few bytes out of bounds and the bytes it writes have just been read from the same locations. Fiddly undefined behaviors are common and it can be hard to convince developers to fix them, especially in a case such as this where the behavior is intentional. More accurately, the behavior was intentional when the code was written in the 1990s—a simpler age of more interesting computers. Nevertheless, we reported this bug and it was recently fixed in the master branch. The new version of the file is far nicer. The bug is still present in the release branches.

The bug is in the implementation of RC4: a comparatively old cipher that has had a useful life, but is reaching the end of it. Indeed, OpenSSL’s rc4_enc.c was — before the recent fix — some seriously 1990s C code with plentiful use of the preprocessor, design choices based on characteristics of Alphas and UltraSPARCs, and this dodgy OOB-based performance hack. A new paper “Analysing and Exploiting the Mantin Biases in RC4” might be the final nail in the coffin of RC4.

The encryption code in rc4_enc.c causes up to RC4_CHUNK - 1 bytes past the end of the output buffer to be overwritten with values that have been read from the same (out of bounds) locations. There are several variants delineated with #ifdefs. On an ordinary x86-64, RC4_CHUNK is 8 and execution will go through line 217:

for (; len & (0 - sizeof(RC4_CHUNK)); len -= sizeof(RC4_CHUNK)) {
    ichunk = *(RC4_CHUNK *) indata;
    otp = ...
    *(RC4_CHUNK *) outdata = otp ^ ichunk;
    indata += sizeof(RC4_CHUNK);
    outdata += sizeof(RC4_CHUNK);
}

This loop consumes all the complete 8-byte words found in the input buffer, and writes as many words into the output buffer. So far so good. Things get interesting after the loop:

if (len) {
    RC4_CHUNK mask = (RC4_CHUNK) - 1, ochunk;

    ichunk = *(RC4_CHUNK *) indata;  // !
    ochunk = *(RC4_CHUNK *) outdata; // !
    ...
    mask >>= (sizeof(RC4_CHUNK) - len) << 3;
               
    ... something about ultrix ...

    ochunk &= ~mask;
    ochunk |= (otp ^ ichunk) & mask;
    *(RC4_CHUNK *) outdata = ochunk; // !!

If there remain between one and seven unprocessed bytes, these are taken care of by reading an entire 8-byte word from indata, reading an entire word from outdata, computing a new word made from encrypted input and original out-of-bound values, and writing that word to outdata.

What could go wrong? At first glance it seems like this overrun could trigger an unwanted page fault, but that isn’t the case on architectures where an aligned word never crosses a page boundary. However, in a concurrent environment, the illegal writing of illegally read data can be directly observed. We wrote a small program that shows this. The key is a struct that places important data directly after a buffer that is the destination of RC4-encrypted data:

struct stuff {
    unsigned char data[data_len];
    int important;
};

Then, one thread repeatedly puts RC4-encrypted data into the buffer:

void *thread2(void *arg) {
    RC4_KEY key;
    const char *key_data = “Hello there.”;
    RC4_set_key(&key, strlen(key_data), (const unsigned char *)key_data);
    while (!stop)
        RC4(&key, data_len, src->data, dst->data);
    return 0;
}

While the other thread waits for the important field to be corrupted:

void *thread1(void *arg) {
    int x = 0;
    while (!stop) {
        dst->important = ++x;
        check(&dst->important, x);
    }
    return 0;
}

The check() function is compiled separately to help ensure that the value being checked comes from RAM rather than being cached in a register:

void check(int *addr, int val) { assert(*addr == val); }

Our complete code is on github. If you run it, you should see the assertion being violated almost immediately. We’ve tested against OpenSSL 1.0.2f on Linux and OS X. On Linux you must make sure to get the C implementation of RC4 instead of the assembly one:

./Configure no-asm linux-x86_64

Although any of tis-interpreter, Valgrind, or ASan can find the OpenSSL RC4 bug when the encrypted data bumps up against the end of an allocated block of memory, none of them finds the bug here since the “important” field absorbs the OOB accesses. There’s still an UB but it’s subtle: accessing buffer memory via a pointer-to-long is a violation of C’s strict aliasing rules.

So, do fiddly buffer overruns matter? Well, this particular bug is unlikely to have an observable effect on programs written in good faith, though one of us considered submitting as an Underhanded Crypto Challenge entry a backdoored chat program with end-to-end encryption that, each time the race condition triggers, sends a copy of the private message being processed to Eve, encrypted with Eve’s key. Alas, the race window is very short. A fiddly OOB can also be a problem if the compiler notices it. In an example from Raymond Chen’s blog, GCC uses the OOB array reference as an excuse to compile this function to “return true”:

int table[4];
bool exists_in_table(int v) {
    for (int i = 0; i <= 4; i++) {
        if (table[i] == v) return true;
    }
    return false;
}

That could only happen here if we used link-time optimization to give the compiler access to both the application and the OpenSSL code at once.

The Problem with Friendly C

I’ll assume you’re familiar with the Proposal for Friendly C and perhaps also Dan Bernstein’s recent call for a Boring C compiler. Both proposals are reactions to creeping exploitation of undefined behaviors as C/C++ compilers get better optimizers. In contrast, we want old code to just keep working, with latent bugs remaining latent.

After publishing the Friendly C Proposal, I spent some time discussing its design with people, and eventually I came to the depressing conclusion that there’s no way to get a group of C experts — even if they are knowledgable, intelligent, and otherwise reasonable — to agree on the Friendly C dialect. There are just too many variations, each with its own set of performance tradeoffs, for consensus to be possible. To get a taste of this, notice that in the comments for the Friendly C post, several people disagree with what I would consider an extremely non-controversial design choice for Friendly C: memcpy() should have memmove() semantics. Another example is what should be done when a 32-bit integer is shifted by 32 places (this is undefined behavior in C and C++). Stephen Canon pointed out on twitter that there are many programs typically compiled for ARM that would fail if this produced something besides 0, and there are also many programs typically compiled for x86 that would fail when this evaluates to something other than the original value. So what is the friendly thing to do here? I’m not even sure– perhaps the baseline should be “do what gcc 3 would do on that target platform.” The situation gets worse when we start talking about a friendly semantics for races and OOB array accesses.

I don’t even want to entertain the idea of a big family of Friendly C dialects, each making some niche audience happy– that is not really an improvement over our current situation.

Luckily there’s an easy away forward, which is to skip the step where we try to get consensus. Rather, an influential group such as the Android team could create a friendly C dialect and use it to build the C code (or at least the security-sensitive C code) in their project. My guess is that if they did a good job choosing the dialect, others would start to use it, and at some point it becomes important enough that the broader compiler community can start to help figure out how to better optimize Friendly C without breaking its guarantees, and maybe eventually the thing even gets standardized. There’s precedent for organizations providing friendly semantics; Microsoft, for example, provides stronger-than-specified semantics for volatile variables by default on platforms other than ARM.

Since we published the Friendly C proposal, people have been asking me how it’s going. This post is a long-winded way of saying that I lost faith in my ability to push the work forward. However, I still think it’s a great idea and that there are people besides me who can make it happen.

Reducers are Fuzzers

A test case isn’t just a test case: it lives in the (generally extremely large) space of inputs for the software system you are testing. If we have a test case that triggers a bug, here’s one way we can look at it:

The set of test cases triggering a bug is a useful notion since we can search it. For example, a test case reducer is a program that searches for the smallest test case triggering a bug. It requires a way to transform a test case into a smaller one, for example by deleting part of it. The new variant of the test case may or may not trigger the bug. The process goes like this:

I’ve spent a lot of time watching reducers run, and one thing I’ve noticed is that the reduction process often triggers bugs unrelated to the bug that is the subject of the reduction:

Sometimes this is undesirable, such as when a lax interestingness test permits the reduction of one bug to get hijacked by a different bug. This happens all the time when reducing segfaults, which are hard to tell apart. But on the other hand, if we’re looking for bugs then this phenomenon is a useful one.

It seems a bit counterintuitive that test case reduction would lead to the discovery of new bugs since we might expect that the space of inputs to a well-tested software system is mostly non-bug-triggering with a few isolated pockets of bug-triggering inputs scattered here and there. I am afraid that that view might not be realistic. Rather, all of the inputs we usually see occupy a tiny portion of the space of inputs, and it is surrounded by huge overlapping clouds of bug-triggering inputs. Fuzzers can push the boundaries of the space of inputs that we can test, but not by as much as people generally think. Proofs remain the only way to actually show that a piece of software does the right thing any significant chunk of its input space. But I digress. The important fact is that reducers are decently effective mutation-based fuzzers.

In the rest of this post I’d like to push that idea a bit farther by doing a reduction that doesn’t correspond to a bug and seeing if we can find some bugs along the way. We’ll start with this exciting C++ program

#include <iostream>

int main() {
  std::cout << "Hello World!++" << std::endl;
}

and we’ll reduce it under the criterion that it remains a viable Hello World implementation. First we’ll preprocess and then use delta’s topformflat to smoosh everything together nicely:

g++ -std=c++11 -E -P -O3 hello-orig.cpp | topformflat > hello.cpp

You might be saying to yourself something like “it sure is stupid that John is passing an optimization flag to the preprocessor,” but trust me that it actually does change the emitted code. I didn’t check if it makes a difference here but I’ve learned to avoid trouble by just passing the same flags to the preprocessor as to the compiler.

Anyhow, the result is 550 KB of goodness:

Here’s the code that checks if a variant is interesting — that is, if it acts like a Hello World implementation:

g++ -O3 -std=c++11 -w hello.cpp >/dev/null 2>&1 &&
ulimit -t 1 &&
./a.out | grep Hello

The ulimit is necessary because infinite loops sometimes get into the program that is being reduced.

To find compiler crashes we’ll need a bit more elaborate of a test:

if
  g++ -O3 -std=c++11 -w hello.cpp >compiler.out 2>&1
then
  ulimit -t 1 &&
  ./a.out | grep Hello
else
  if
    grep 'internal compiler error' compiler.out
  then
    exit 101
  else
    exit 1
  fi
fi

When the compiler fails we look at its output and, if it contains evidence of a compiler bug, exit with code 101, which will tell C-Reduce that it should save a copy of the input files that made this happen.

The compiler we’ll use is g++ r231221, the development head from December 3 2015. Let’s get things going:

creduce --nokill --also-interesting 101 --no-default-passes \
  --add-pass pass_clex rm-tok-pattern-4 10 ../test.sh hello.cpp

The -also-interesting 101 option indicates that the interestingness test will use process exit code 101 to tell C-Reduce to make a snapshot of the directory containing the files being reduced, so we can look at it later. --no-default-passes clears C-Reduce’s pass schedule and -add-pass pass_clex rm-tok-pattern-4 10 add a single pass that uses a small sliding window to remove tokens from the test case. The issue here is that not all of C-Reduce’s passes are equally effective at finding bugs. Some passes, such as the one that removes dead variables and the one that removes dead functions, will probably never trigger a compiler bug. Other passes, such as the one that removes chunks of lines from a test case, eliminate text from the test case so rapidly that effective fuzzing doesn’t happen. There are various ways to deal with this problem, such as probabilistically rejecting improvements or rejecting improvements that are too large, but for this post I’ve chosen the simple expedient of running just one pass that (1) makes progress very slowly and (2) seems to be a good fuzzer.

The dynamics of this sort of run are interesting: as the test case walks around the space of programs, you can actually see it brush up against a compiler bug and then wander off into the weeds again:

The results are decent: after about 24 hours, C-Reduce caused many segfaults in g++ and triggered six different internal compiler errors: 1, 2, 3, 4, 5, 6. One of these was already reported, another looks probably like a duplicate, and four appear to be new.

I did a similar run against Clang++ r254595, again the development head from December 3. This produced segfaults and also triggered 25 different assertions:

llvm/include/llvm/Support/Casting.h:230: typename cast_retty::ret_type llvm::cast(Y &) [X = clang::FunctionProtoType, Y = clang::QualType]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed.
llvm/include/llvm/Support/Casting.h:95: static bool llvm::isa_impl_cl::doit(const From *) [To = clang::FunctionTemplateDecl, From = const clang::Decl *]: Assertion `Val && "isa<> used on a null pointer"' failed.
llvm/tools/clang/lib/AST/Decl.cpp:2134: clang::APValue *clang::VarDecl::evaluateValue(SmallVectorImpl &) const: Assertion `!Init->isValueDependent()' failed.
llvm/tools/clang/lib/AST/Decl.cpp:2181: bool clang::VarDecl::checkInitIsICE() const: Assertion `!Init->isValueDependent()' failed.
llvm/tools/clang/lib/AST/ExprCXX.cpp:451: static clang::DependentScopeDeclRefExpr *clang::DependentScopeDeclRefExpr::Create(const clang::ASTContext &, clang::NestedNameSpecifierLoc, clang::SourceLocation, const clang::DeclarationNameInfo &, const clang::TemplateArgumentListInfo *): Assertion `QualifierLoc && "should be created for dependent qualifiers"' failed.
llvm/tools/clang/lib/AST/../../include/clang/AST/TypeNodes.def:98: clang::TypeInfo clang::ASTContext::getTypeInfoImpl(const clang::Type *) const: Assertion `!T->isDependentType() && "should not see dependent types here"' failed.
llvm/tools/clang/lib/CodeGen/CodeGenModule.cpp:623: llvm::StringRef clang::CodeGen::CodeGenModule::getMangledName(clang::GlobalDecl): Assertion `II && "Attempt to mangle unnamed decl."' failed.
llvm/tools/clang/lib/CodeGen/../../include/clang/AST/Expr.h:134: void clang::Expr::setType(clang::QualType): Assertion `(t.isNull() || !t->isReferenceType()) && "Expressions can't have reference type"' failed.
llvm/tools/clang/lib/Lex/PPCaching.cpp:101: void clang::Preprocessor::AnnotatePreviousCachedTokens(const clang::Token &): Assertion `CachedTokens[CachedLexPos-1].getLastLoc() == Tok.getAnnotationEndLoc() && "The annotation should be until the most recent cached token"' failed.
llvm/tools/clang/lib/Parse/../../include/clang/Parse/Parser.h:2256: void clang::Parser::DeclaratorScopeObj::EnterDeclaratorScope(): Assertion `!EnteredScope && "Already entered the scope!"' failed.
llvm/tools/clang/lib/Sema/../../include/clang/AST/DeclTemplate.h:1707: void clang::ClassTemplateSpecializationDecl::setInstantiationOf(clang::ClassTemplatePartialSpecializationDecl *, const clang::TemplateArgumentList *): Assertion `!SpecializedTemplate.is() && "Already set to a class template partial specialization!"' failed.
llvm/tools/clang/lib/Sema/../../include/clang/Sema/Lookup.h:460: clang::NamedDecl *clang::LookupResult::getFoundDecl() const: Assertion `getResultKind() == Found && "getFoundDecl called on non-unique result"' failed.
llvm/tools/clang/lib/Sema/SemaDecl.cpp:10455: clang::Decl *clang::Sema::ActOnParamDeclarator(clang::Scope *, clang::Declarator &): Assertion `S->isFunctionPrototypeScope()' failed.
llvm/tools/clang/lib/Sema/SemaDeclCXX.cpp:11373: ExprResult clang::Sema::BuildCXXDefaultInitExpr(clang::SourceLocation, clang::FieldDecl *): Assertion `Lookup.size() == 1' failed.
llvm/tools/clang/lib/Sema/SemaExpr.cpp:2274: ExprResult clang::Sema::ActOnIdExpression(clang::Scope *, clang::CXXScopeSpec &, clang::SourceLocation, clang::UnqualifiedId &, bool, bool, std::unique_ptr, bool, clang::Token *): Assertion `R.getAsSingle() && "There should only be one declaration found."' failed.
llvm/tools/clang/lib/Sema/SemaExprCXX.cpp:2272: clang::FunctionDecl *clang::Sema::FindUsualDeallocationFunction(clang::SourceLocation, bool, clang::DeclarationName): Assertion `Matches.size() == 1 && "unexpectedly have multiple usual deallocation functions"' failed.
llvm/tools/clang/lib/Sema/SemaExprCXX.cpp:6663: ExprResult clang::Sema::CorrectDelayedTyposInExpr(clang::Expr *, clang::VarDecl *, llvm::function_ref): Assertion `TyposInContext < ~0U && "Recursive call of CorrectDelayedTyposInExpr"' failed.
llvm/tools/clang/lib/Sema/SemaExprMember.cpp:91: IMAKind ClassifyImplicitMemberAccess(clang::Sema &, const clang::LookupResult &): Assertion `!R.empty() && (*R.begin())->isCXXClassMember()' failed.
llvm/tools/clang/lib/Sema/SemaLookup.cpp:1904: bool clang::Sema::LookupQualifiedName(clang::LookupResult &, clang::DeclContext *, bool): Assertion `(!isa(LookupCtx) || LookupCtx->isDependentContext() || cast(LookupCtx)->isCompleteDefinition() || cast(LookupCtx)->isBeingDefined()) && "Declaration context must already be complete!"' failed.
llvm/tools/clang/lib/Sema/SemaLookup.cpp:2729: Sema::SpecialMemberOverloadResult *clang::Sema::LookupSpecialMember(clang::CXXRecordDecl *, clang::Sema::CXXSpecialMember, bool, bool, bool, bool, bool): Assertion `CanDeclareSpecialMemberFunction(RD) && "doing special member lookup into record that isn't fully complete"' failed.
llvm/tools/clang/lib/Sema/SemaOverload.cpp:11671: ExprResult clang::Sema::CreateOverloadedBinOp(clang::SourceLocation, unsigned int, const clang::UnresolvedSetImpl &, clang::Expr *, clang::Expr *): Assertion `Result.isInvalid() && "C++ binary operator overloading is missing candidates!"' failed.
llvm/tools/clang/lib/Sema/SemaTemplate.cpp:2906: ExprResult clang::Sema::BuildTemplateIdExpr(const clang::CXXScopeSpec &, clang::SourceLocation, clang::LookupResult &, bool, const clang::TemplateArgumentListInfo *): Assertion `!R.empty() && "empty lookup results when building templateid"' failed.
llvm/tools/clang/lib/Sema/SemaTemplateDeduction.cpp:609: (anonymous namespace)::PackDeductionScope::PackDeductionScope(clang::Sema &, clang::TemplateParameterList *, SmallVectorImpl &, clang::sema::TemplateDeductionInfo &, clang::TemplateArgument): Assertion `!Packs.empty() && "Pack expansion without unexpanded packs?"' failed.
llvm/tools/clang/lib/Sema/SemaTemplateInstantiate.cpp:2781: llvm::PointerUnion *clang::LocalInstantiationScope::findInstantiationOf(const clang::Decl *): Assertion `isa(D) && "declaration not instantiated in this scope"' failed.
llvm/tools/clang/lib/Sema/SemaTemplateVariadic.cpp:290: bool clang::Sema::DiagnoseUnexpandedParameterPack(clang::Expr *, clang::Sema::UnexpandedParameterPackContext): Assertion `!Unexpanded.empty() && "Unable to find unexpanded parameter packs"' failed.

I have to admit that I felt a bit overwhelmed by 25 potential bug reports, and I haven’t reported any of these yet. My guess is that a number of them are already in the bugzilla since people have been fuzzing Clang lately. Anyway, I’ll try to get around to reducing and reporting these. Really, this all needs to be automated so that when subsequent reductions find still more bugs, these just get added to the queue of reductions to run.

If you were interested in reproducing these results, or in trying something similar, you would want to use C-Reduce’s master branch. I ran everything on an Ubuntu 14.04 box. While preparing this post I found that different C-Reduce command line options produced widely varying numbers and kinds of crashes.

Regarding previous work, I believe — but couldn’t find written down — that the CERT BFF watches out for new crashes when running its reducer. In a couple of papers written by people like Alex Groce and me, we discussed the fact that reducers often slip from triggering one bug to another.

The new thing in this post is to show that triggering new bugs while reducing isn’t just some side effect. Rather, we can go looking for trouble, and we can do it without being given a bug to reduce in the first place. A key enabler for easy bug-finding with C-Reduce was finding a simple communication mechanism by which the interestingness test can give C-Reduce a bit of out-of-band information that a variant should be saved for subsequent inspection. I’m not trying to claim that reducers are awesome fuzzers, but on the other hand, it might be a little bit awesome that mutating Hello World resulted in triggering 25 different assertion violations in a mature and high-quality compiler. I bet we’d have done even better by starting with a nice big fat Boost application.

Latency Numbers Every Professor Should Know

### Latency numbers every professor should know
    Email from student ............................ 20 sec
    Person at office door  ......................... 8 min
    Other interruption ............................ 20 min
    Twitter or something seems really important ... 45 min
    Anxiety about deadlines ........................ 1 hr
    A meeting ...................................... 2 hrs
    A meeting you forgot about ..................... 1 day
    A class to teach ............................... 2 days
    Request to review a paper ...................... 3 days
    Request to write evaluation letter ............. 6 days
    Stuff to grade ................................. 1 wk
    Unsolicited time management advice arrives ..... 2 wks
    Fire alarm clears building ..................... 3 wks
    Travel to conference ........................... 5 wks
    Paper deadline ................................. 6 wks
    Grades due .................................... 16 wks
    Grant proposals due ........................... 26 wks
    Summer ......................................... 1 yr
    Sabbatical ..................................... 7 yrs = 2.208e+17 ns

With apologies to the folks who published latency numbers every programmer should know.

Multi-Version Execution Defeats a Compiler-Bug-Based Backdoor

[This piece is jointly authored by Cristian Cadar, Luís Pina, and John Regehr]

What should you do if you’re worried that someone might have exploited a compiler bug to introduce a backdoor into code that you are running? One option is to find a bug-free compiler. Another is to run versions of the code produced by multiple compilers and to compare the results (of course, under the additional assumption that the same bug does not affect all the compilers). For some programs, such as those whose only effect is to produce a text file, comparing the output is easy. For others, such as servers, this is more difficult and specialized system support is required.

Today we’ll look at using Varan the Unbelievable to defeat the sudo backdoor from the PoC||GTFO article. Varan is a multi-version execution system that exploits the fact that if you have some unused cores, running additional copies of a program can be cheap. Varan designates a leader process whose system call activity is recorded in a shared ring buffer, and one or more follower processes that read results out of the ring buffer instead of actually issuing system calls.

Compilers have a lot of freedom while generating code, but the sequence of system calls executed by a program represents its external behaviour and in most cases the compiler is not free to change it at all. There might be slight variations e.g., due to different compilers using different libraries, but these can be easily handled by Varan. Since all correctly compiled variants of a program should have the same external behaviour, any divergence in the sequence of system calls across versions flags a potential security attack, in which case Varan stops the program before any harm is done.

Typically, Varan runs the leader process at full speed while also recording the results of its system calls into the ring buffer. However, when used in a security-sensitive setting, Varan can designate some system calls as blocking, meaning that the leader cannot execute those syscalls until all followers have reached that same program point without diverging. For sudo, we designate execve as blocking, since that is a point at which sudo might perform an irrevocably bad action.

So here’s the setup:

  1. We have a patched version of sudo 1.8.13 from the PoC||GTFO article. It runs correctly and securely when compiled by a correct C compiler, but improperly gives away root privileges when compiled by Clang 3.3 because the patch was designed to trigger a wrong-code bug in that compiler.
  2. We are going to pretend that we don’t know about the Clang bug and the backdoor. We compile two versions of the patched sudo: one with Clang 3.3, the other with the default system compiler, GCC 4.8.4.
  3. We run these executables under Varan. Since the critical system call execve is blocking, it doesn’t much matter which version is the leader and which is the follower.

Now let’s visit an Ubuntu 14.04 VM where both versions of sudo (setuid root, of course) and Varan are installed. We’re using a user account that is not in the sudoers file — it should not be allowed to get root privileges under any circumstances. First let’s make sure that a sudo that was properly compiled (using GCC) works as expected:

$ /home/varan/sudo-1.8.13/install/bin/sudo-gcc cat /etc/shadow
Password: 
test is not in the sudoers file.  This incident will be reported.

Next, we make sure that the backdoor is functioning as intended:

$ /home/varan/sudo-1.8.13/install/bin/sudo-clang cat /etc/shadow
Password: 
root:!:16693:0:99999:7:::
...
test:$6$nL9rosCD$A.prR28WWZYgDHy.m6ovWYaRhP/OhOX7V9nDDirm9ZffjTWhLdJ7C171wTfLBpQNC2EoeEKAJpcEz9gV9ravL/:16693:0:99999:7:::
messagebus:*:16757:0:99999:7:::
colord:*:16757:0:99999:7:::

So far so good. Next let’s try the gcc-compiled sudo as the leader with the backdoored sudo as the follower:

$ vx-suid /home/varan/sudo-1.8.13/install/bin/sudo-gcc \
          /home/varan/sudo-1.8.13/install/bin/sudo-clang -- cat /etc/shadow
Password: 
$
test is not in the sudoers file.  This incident will be reported.

What happened here is that the gcc-compiled leader runs as before, since it doesn’t ever try to execute an execve call. When the backdoored follower tries to execute the malicious execve call, Varan detects the divergence and terminates both processes safely.

Now let’s try switching around the leader and follower, i.e., run the backdoored sudo as the leader with the gcc-compiled sudo as the follower:

$ vx-suid /home/varan/sudo-1.8.13/install/bin/sudo-clang \
          /home/varan/sudo-1.8.13/install/bin/sudo-gcc -- cat /etc/shadow
Password: 
$

This time the leader tries to execute the malicious execve call, and Varan blocks its execution until the follower reaches the same system call or diverges. In this case, the follower tries to execute a write system call (to print “test is not in the sudoers file...”) and thus Varan detects divergence and again terminates execution safely.

In this example, we only ran two versions in parallel, but Varan can run more than two versions. In terms of performance and resource utilization, security applications like sudo are a great match for multi-version execution: they are not CPU-bound, so any performance degradation is imperceptible to the user, and the extra cores are needed only briefly, during the critical security validation checks. We are looking into applying this approach to other critical security applications (e.g. ssh-agent and password managers), and are investigating a way of hardening executables by generating a single binary with Varan and a bunch of versions, each version generated by a different compiler. We can then deploy this hardened executable instead of the original program.

Of course, Varan can detect misbehavior other than compiler-bug-based backdoors. Divergence could be caused by a memory or CPU glitch, by a plain old compiler bug that is triggered unintentionally instead of being triggered by an adversarial patch, or by a situation where an application-level undefined behavior bug has been exploited by only one of the compilers, or even where both compilers exploited the bug but not in precisely the same way. A nice thing about N-version programming at the system call level is that it won’t bother us about transient divergences that do not manifest as externally visible behaviour through a system call.

We’ll end by pointing out a piece of previous work along these lines: the Boeing 777 uses compiler-based and also hardware-based N-version diversity: there is a single version of the Ada avionics software that is compiled by three different compilers and then it runs on three different processors: a 486, a 68040, and an AMD 29050.

Testcase Reduction for Non-Preprocessed C and C++

C-Reduce takes a C or C++ file that triggers a bug in a compiler (or other tool that processes source code) and turns it into the smallest possible test case that still triggers the bug. Most often, we try to reduce code that has already been preprocessed. This post is about how to reduce non-preprocessed code, which is sometimes necessary when — due to use of an integrated preprocessor — the compiler bug goes away when a separate preprocessing step is used.

The first thing we need to do is get all of the necessary header files into one place. This is somewhat painful due to things like computed includes and #include_next. I wrote a script that follows the transitive includes, renaming files and copying them over; it works fine on Linux but sometimes fails on OS X, I haven’t figured out why yet. Trust me that you do not want to look too closely at the Boost headers.

Second, we need to reduce multiple files at once, since they have not yet been glommed together by the preprocessor. C-Reduce, which is a collection of passes that iterate to fixpoint, does this by running each pass over each file before proceeding to the next pass. The seems to work well. A side benefit of implementing multi-file reduction is that it has other uses such as reducing programs that trigger link-time optimization bugs, which inherently requires multiple files.

Non-preprocessed code contains comments, so C-Reduce has a special pass for stripping those out. We don’t want to do this before running C-Reduce because removing comments might make the bug we’re chasing go away. Another pass specifically removes #include directives which tend to be deeply nested in some C++ libraries.

#ifdef … #endif pairs are hard to eliminate from first principles because they are often not located near to each other in the file being reduced, but you still need to eliminate both at once. At first this sounded like a hard problem to solve but then I found Tony Finch’s excellent unifdef tool and wrote a C-Reduce pass that simply calls it for every relevant preprocessor symbol.

Finally, it is often the case that a collection of reduced header files contains long chains of trivial #includes. C-Reduce fixes these with a pass that replaces an #include with the included text when the included file is very small.

What’s left to do? The only obvious thing on my list is selectively evaluating the substitutions suggested by #define directives. I will probably only do this by shelling out to an external tool, should someone happen to write it.

In summary, reducing non-preprocessed code is not too hard, but some specific support is required in order to do a good job of it. If you have a testcase reduction problem that requires multi-file reduction or needs to run on non-preprocessed code, please try out C-Reduce and let us know — at the creduce-dev mailing list — how it worked. The features described in this post aren’t yet in a release, just build and install our master branch.

As a bonus, since you’re dying to know, I’ll show you what C-Reduce thinks is the minimal hello world program in C++11. From 127 header files + the original source file, it creates 126 empty files plus this hello.cpp:

#include "ostream"
namespace std {
basic_ostream cout;
ios_base::Init x0;
}
int main() { std::cout << "Hello"; }

And this ostream:

namespace
{
typedef __SIZE_TYPE__ x0;
typedef __PTRDIFF_TYPE__ x1;
}
namespace std
{
template < class > struct char_traits;
}
typedef x1 x2;
namespace std
{
template < typename x3, typename = char_traits < x3 > >struct basic_ostream;
template <> struct char_traits 
{
    typedef char x4;
    static x0 x5 ( const x4 * x6 )
    {
        return __builtin_strlen ( x6 );
    }
};
template < typename x3, typename x7 > basic_ostream < x3,
         x7 > &__ostream_insert ( basic_ostream < x3, x7 > &, const x3 *, x2 );
struct ios_base
{
    struct Init
    {
        Init (  );
    };
};
template < typename, typename > struct basic_ostream
{
};
template < typename x3,
         typename x7 > basic_ostream < x3 > operator<< ( basic_ostream < x3,
                 x7 > &x8, const x3 * x6 )
{
    __ostream_insert ( x8, x6, x7::x5 ( x6 ) );
    return x8;
}
}

It's obviously not minimal but believe me we have already put a lot of work into domain-specific C++ reduction tricks. If you see something here that can be gotten rid of and want to take a crack at teaching our clang_delta tool to do the transformation, we'll take a pull request.

Classic Bug Reports

A bug report is sometimes entertaining either because of the personalities involved or because of the bug itself. Here are a collection of links into public bug trackers; I learned about most of these in a recent Twitter thread.

I hope you enjoy these as much as I do. Thanks to everyone who contributed links.

Updates from comments and Reddit:

API Fuzzing vs. File Fuzzing: A Cautionary Tale

Libraries that provide APIs should be rock solid, and so should file parsers. Although we can use fuzzing to ensure the solidity of both kinds of software, there are some big differences in how we do that.

A file parser should be fully robust: it isn’t allowed to crash even if presented with a corrupted file or the wrong kind of file. Although a library that implements an API usually does some validation of arguments, this is generally a debugging aid rather than an attempt to withstand hostility.
A file usually originates outside of a trust boundary. There generally is not a trust boundary between a library and the program that uses the library.
A file parser can perform arbitrary checks over its inputs. A C/C++ library providing an API cannot check some properties of its input. For example, it cannot check that a pointer refers to valid storage. Moreover, APIs are often so much in the critical path that full error checking would not be desirable even if it were possible.

For fuzzing campaigns the point is this:

The sole concern of a file fuzzer is to generate interesting files that expose bugs in the file parser or in subsequent logic. An API fuzzer must balance two goals. Like the file fuzzer, it needs to expose bugs by using the API in interesting ways. However, at the same time, it must stay within the usage prescribed by the API — careful reading of the documentation is required, as well as a bit of finesse and good taste.

As a concrete example, let’s look at libpng, which provides both a file parser and an API. To fuzz the file parsing side, we generate evil pngish files using something like afl-fuzz and we wait for the library to do something wrong. But what about fuzzing the API? We’ll need to look at the documentation first, where we find text like this:

To write a PNG file using the simplified API:

  1) Declare a 'png_image' structure on the stack and memset()
     it to all zero.

  2) Initialize the members of the structure that describe the
     image, setting the 'format' member to the format of the
     image in memory.

  3) Call the appropriate png_image_write... function with a
     pointer to the image to write the PNG data.

Hopefully it is immediately obvious that calling random API functions and passing random crap to them is not going to work. This will, of course, cause libpng to crash, but the crashes will not be interesting because they will not come from valid usage of the library.

You might ask why anyone would care to fuzz an API. Aren’t trust boundaries the things that matter? The answer is easy: If you are looking to find or prevent exploitable vulnerabilities, you should always focus on fuzzing at trust boundaries. On the other hand, if you are more generally interested in reliable software, you should fuzz APIs as well. We might use API fuzzing to ensure that printf prints floats correctly and that the red-black tree you coded up late last night doesn’t spuriously drop elements.

Now let’s look at a blog post by GDS from last week about fuzzing mbed TLS. There’s something kind of interesting going on here: they are doing API fuzzing (of the mbed TLS library) but they are doing it using afl-fuzz: a file fuzzer. This works well because afl-fuzz provides the data that is “sent” between the client and server (sent is in quotes because this fuzzing effort gains speed and determinism by putting the client and server in the same process).

At the bottom of the post we read this:

Our fuzzing scans discovered client-side NULL pointer dereference vulnerabilities in mbed TLS which affect all maintained versions of mbed TLS. ARM fixed the issues in versions 2.1.1, 1.3.13 and PolarSSL version 1.2.16. See the release notes for details.

However, the corresponding text in the release notes is this:

Fabian Foerg of Gotham Digital Science found a possible client-side NULL pointer dereference, using the AFL Fuzzer. This dereference can only occur when misusing the API, although a fix has still been implemented.

In other words, the blog post author disagrees with the mbed TLS maintainers about whether a bug has been discovered or not. It is a matter of debate whether or not this kind of crash represents a bug to be fixed. I would tend to agree with the mbed TLS maintainers’ decision to err on the side of defensive programming here. But was there a vulnerability? That seems like a stretch.

In summary, API fuzzing is different from file fuzzing. Good API fuzzing requires exploration of every legal corner of the API and no illegal corners. Ambiguous situations will come up, necessitating judgement calls and perhaps even discussions with the providers of the API.

Comments on a Formal Verification of PolarSSL

The C language has given the world many enduring gifts such as buffer overflows, uninitialized variables, and use-after-free errors. Since rewriting a code base in some other language is not easy, we’re often stuck trying to eliminate bugs in legacy C before they bite us, and of course bugs in network-facing code sometimes bite us very hard. In this post I’ll be discussing TrustInSoft’s Verification Kit for PolarSSL 1.1.8, which I’ll call “the report” throughout this post. Just to be clear, I’m not unbiased — I’m working for these folks this year — so I’ll try to keep the cheerleading to a minimum.

The crux is:

This report states the total immunity of the SSL server implemented by the PolarSSL 1.1.8 library to [CWE-119, CWE-120, CWE-121, CWE-122, CWE-123, CWE-124, CWE-125, CWE-126, CWE-127, CWE-369, CWE-415, CWE-416, CWE-457, CWE-476, CWE-562, and CWE-690] if it is deployed according to the Secure Deployment Guide detailed in section 4.4.

In other words, a collection of nasty undefined behaviors that are so hard to get rid of, have been gotten rid of — if you are careful to follow the instructions.

The thing that struck me while reading the report is that I’ve never seen a document quite like it. I’ll next try to nail down that idea by explaining what the report is not.

It’s not a piece of formal methods research, which would need to demonstrate the novelty and utility of a verification technique. Rather, the verification techniques are completely deemphasized in the report in order to concentrate on the code that is being verified. In fact, one of the things that bothers me about formal verification research is that so much of it is concerned with new notations and clever algorithms instead of focusing on the thing I’m interested in: the interplay between the verification techniques and the code being verified.

It’s neither a formal proof nor a code review. Rather, nearly all of the reasoning about code has been delegated to the analysis tool, which must be trusted to come to the right conclusions. Why would we trust a formal verification tool? For one thing, the analyzer has been validated on many other codes including programs generated by Csmith (we wrote a short paper about that work). For another, the analyzer identified some legitimate shortcomings in PolarSSL that needed to be fixed before the verification would go through (these are in Section 4.4.3). Additionally, a much more serious PolarSSL bug was found earlier in the verification process. It is not mentioned in this report since it had been fixed before version 1.1.8 was released.

Because it makes a guarantee, it is fundamentally different from a report based on unsound static analysis. By early 2014, OpenSSL had been examined by various unsound analyzers and yet none of them found Heartbleed. In contrast, if someone had used sound static analysis to verify OpenSSL 1.0.1f, it is a certainty that Heartbleed would have been among the defects that were discovered. (We have to be a little bit careful when making this kind of statement. For example, the bug won’t be discovered if the OpenSSL being verified was compiled using -DOPENSSL_NO_HEARTBEATS. The buggy code must lie within the verification perimeter that I’m about to discuss.)

So what is this report? It is, mainly, three things:

The definition of a perimeter within which its claims apply. This perimeter has three aspects. First, it excludes some parts of PolarSSL, such as X.509, that were not verified. Second, it excludes some ways in which the code could be used by defining drivers that exercise API calls in certain ways. For example, the server driver that spans pages 16 and 17 initializes the library, makes an arbitrary number of read and write calls, and then shuts it down. The guarantees in the report would not apply to a different usage, such as one that started writing data without properly initializing the library. Nor would they apply to usage of the library by a buggy program that corrupted PolarSSL’s state using a stray indirect write. Third, some configurations of the code are placed outside of the perimeter. The issue is that C code is implicitly parameterized by a collection of implementation defined behaviors, such as the representation of integers, and it is also typically explicitly parameterized by preprocessor directives. A real C program is actually a shorthand for a huge family of different programs; any formal verification effort either has to pin down which member of the family is being verified or else reason about the entire family. Reasoning about the entire family in a sound and precise way is an open problem.

A sketch of the verification strategy, which is divide and conquer. Each of chapters 5-10 describes the verification of a component. The interfaces between components are specified in ACSL, which is designed in such a way that when two components are verified against opposite sides of an interface, their overall verification can be taken for granted.

A justification of alarms. The analysis tool signals an alarm any time it fails to prove that an operation (for example, incrementing a signed integer) is safe. Each alarm must be investigated to determine if it is a true alarm (exposing a defect in the code) or a false alarm (exposing an imprecision in the analysis). Appendix A of the report contains this analysis. It turns out to be easy, in theory, to avoid the imprecisions that caused these alarms: the necessary techniques have been known for decades. The issue is that there are engineering reasons for avoiding these static analysis techniques: their results can be very difficult for people to understand and also they can be computationally expensive. An entertaining exercise would be to pick a few parts of Appendix A, download PolarSSL 1.1.8, and make sure you agree with the reasoning from the report.

That’s all there is to it, though if you read the report you’ll see that the reasoning sometimes becomes quite detailed and technical.

Finally I’ll try to anticipate a few questions the you might have at this point.

How much effort would be required to track the upstream PolarSSL (now mbed TLS)? A majority of local changes will just go through the verification automatically. Larger-scale changes might raise one or more new alarms that need to be investigated.

Under what conditions does the claim of “total immunity to UBs” apply? In addition to the conditions described in the report, the intellectual analyses in Appendix A must be correct, the verification tool must not have made a mistake, and the C compiler and the platform (OS + standard libs + hardware) must not make mistakes.

If we don’t want UBs, why not just use a TLS written in Java or C# or some other safe language? Of course that’s a fine idea! But we’ll have to somehow shoehorn a Java runtime onto those embedded processors with ~64 KB of RAM where PolarSSL / mbed TLS run. Also, I’m not sure that it is clear how to implement crypto in higher level languages in such a way that it resists timing attacks (this is hard enough in C). Finally, safe languages are only dynamically safe: a buffer overrun bug in a Java program will lead to an exception, and perhaps to a DoS attack. The report makes a guarantee of static safety, meaning that buffer overrun bugs do not exist.

Where is the corresponding document for OpenSSL? It doesn’t exist. Based on lines of code, it would probably be about four times longer than the PolarSSL report, and would require at least four times the effort to create.

What is the future of this kind of report? It’s all about economics. Given a piece of C code, we can choose to inspect it, to test it by hand, to fuzz it, to subject it to unsound static analysis, and to subject it to sound static analysis. Obviously, all of these validation techniques have merit — the question is how many resources to allocate to each of them. Larger, more rapidly evolving code bases tend to favor cheaper validation methods whereas smaller, security-critical codes are amenable to formal verification. The reach of formal verification techniques is increasing as techniques improve, but also software keeps getting bloatier.

How should I think or act differently based on the information in this report (and others like it)? Well, this is the central question. Formal verification has made major progress in the last 10-15 years and we now have extremely impressive technical results including CompCert and seL4. However, I think it’s fair to say that almost everyone is having trouble figuring out how to fit these results into the big picture. From the client’s point of view, formal verification tends to be somewhat mysterious, as if a mathematically adept wizard had sprinkled some magic verification dust over a source code repository, which now glows with correctness. Let’s take a quick look at this article about seL4 that appeared in New Scientist a few days ago. On Twitter, Gerwin Klein — the leader of the L4.verified project — said:

The sad truth is that this article is not that bad compared to others. I’ve seen way worse. Nuance is hard..

What he says is true: the article, while being sort of decent, is just extremely far from conveying accurate technical information. Here’s a more nuanced version.

The PolarSSL report can be best understood as providing details that are necessary for the next level of people to crack open the black box surrounding this particular formal verification effort. We didn’t need another article about abstract interpretation, we needed specific technical claims about a particular piece of source code, followed by details that back up the claim. Much more work on communicating verification results is needed, but this is a good piece of progress.