American fuzzy lop is a polished and effective fuzzing tool. It has found tons of bugs and there are any number of blog posts talking about that. Here we’re going to take a quick look at what it isn’t good at. For example, here’s a program that’s trivial to crash by hand, that afl-fuzz isn’t likely to crash in an amount of time you’re prepared to wait:
#include <stdlib.h> #include <stdio.h> int main(void) { char input[32]; if (fgets(input, 32, stdin)) { long n = strtol(input, 0, 10); printf("%ld\n", 3 / (n + 1000000)); } return 0; }
The problem, of course, is that we’ve asked afl-fuzz to find a needle in a haystack and its built-in feedback mechanism does not help guide its search towards the needle. Actually there are two parts to the problem. First, something needs to recognize that divide-by-zero is a crash behavior that should be targeted. Second, the search must be guided towards inputs that result in a zero denominator. A finer-grained feedback mechanism, such as how far the denominator is from zero, would probably do the job here. Alternatively, we could switch to a different generation technology such as concolic testing where a solver is used to generate test inputs.
The real question is how to get the benefits of these techniques without making afl-fuzz into something that it isn’t — making it worse at things that it is already good at. To see why concolic testing might make things worse, consider that it spends a lot of time making solver calls instead of actually running tests: the base testing throughput is a small fraction of what you get from plain old fuzzing. A reasonable solution would be to divide up the available CPU time among strategies; for example, devote one core to concolic testing and seven to regular old afl-fuzz. Alternatively, we could wait for afl-fuzz to become stuck — not hitting any new coverage targets within the last hour, perhaps — and then switch to an alternate technique for a little while before returning to afl-fuzz. Obviously the various search strategies should share a pool of testcases so they can interact (afl-fuzz already has good support for communication among instances). Hopefully some concolic testing people will spend a bit of time bolting their tools onto afl-fuzz so we can play with these ideas.
A variant of the needle-in-a-haystack problem occurs when we have low-entropy input such as the C++ programs we would use to test a C++ compiler. Very few ascii strings are C++ programs and concolic testing is of very little help in generating interesting C++. The solution is to build more awareness of the structure of C++ into the testcase generator. Ideally, this would be done without requiring the somewhat elaborate effort we put into Csmith. Something like LangFuzz might be a good compromise. (Of course I am aware that afl-fuzz has been used against clang and has found plenty of ways to crash it! This is great but the bugs being found are not the kind of semantic bugs that Csmith was designed to find. Different testcase generators solve different problems.)
So we have this great fuzzing tool that’s easy to use, but it also commonly runs up against testing problems that it can’t solve. A reasonable way forward will be for afl-fuzz to provide the basic framework and then we can all build extensions that help it reach into domains that it couldn’t before. Perhaps Valgrind is a good analogy: it provides not only an awesome baseline tool but also a rich infrastructure for building new tools.
[Pascal Cuoq gave me some feedback on this post including a bunch of ideas about afl-fuzz-related things that he’ll hopefully blog about in the near future.]
14 responses to “What afl-fuzz Is Bad At”
afl seems to benefit greatly from seed test cases. I’d regard the SMT solver strategy as an automated way to find additional seeds.
Does afl have any mechanisms to steer it away from known bugs? I.e. once I have 8-10 crashes from the same bug I’d rather not spend time generating another 100 if that could be spent generating cases for another bug.
If I need more characterization, I’d throw the first 8 at a reducer and log all the transition pairs.
bcs, I don’t think so, but it does look like there’s some crash triage code for trying to weed out the redundant tests later on.
Hey,
Although afl-fuzz isn’t very modular internally [*], it actually does provide a fairly convenient way to sync coverage and test cases with other tools. It’s not documented as such, and I’ll try to fix that. But the mechanism used to synchronize instances of afl-fuzz actually gives you a very simple tool: pull in queue/ files from a running AFL instance, extend them and write them back to another subdirectory, have AFL pull them in.
[*] It’s been on my TODO list, but it requires ripping out and redoing most of the guts and I’m not sure how many people would genuinely benefit from it. I always worry about giving folks too much flexibility in a setting where it’s a lot easier to come up with bad ideas and settings than with good ones. Curiosity killed the cat and all =)
Hi Michal, communication through the filesystem sounds like the right solution for most use cases I can think of.
Regarding internal control, one thing I’d like is to be able to turn off mutation strategies that I’m fairly certain will not help solve a particular problem (e.g., bit flips when generating C programs).
BTW when I was writing programs that afl is bad at finding bugs in, it surprised me a few times by generating test cases I didn’t expect it to get in a short time.
+1 for communication through the filesystem.
http://llvm.org/docs/LibFuzzer.html#afl-compatibility documents the compatibility between AFL and LLVM’s libFuzzer.
Turns out this is not the 100% correct description of the current state (which is a bit more complex) but I’d love to make it correct. 🙂
Kostya, looks great, except I don’t want to have to restart the fuzzers to get them to talk!
One stupid-simple thing AFL could do that would eventually work in this case is to harvest integer literals from the code to add to its pool of magic numbers.
It’s not obvious to me that this strategy would pay its way, but it seems worth trying.
Sean, I’ve seen that done but it seems a little too “this one weird trick” to me.
But along the same lines, running strings on a binary could easily be useful in finding the tokens you should use to fuzz that binary, assuming it contains some sort of parser.
I haven’t tried it, but wouldn’t it be easy to hook AFL up to KLEE’s zesti version (http://srg.doc.ic.ac.uk/projects/zesti/)? It shouldn’t be hard to set up a scan that looks for new files in AFL’s output directory and launches a make-test-zesti instance on them. That’s basically what we did here (http://www.cs.cmu.edu/~agroce/issta14.pdf) except offline, since instead of AFL we were waiting for a delta-debugger to finish minimizing things. You could actually put in the whole pipeline we had in that paper:
1. Scan for new files arriving in AFL’s output directory
2. When you see one, minimize it by coverage (AFL already does some kind of minimization on request, so you could use that or plain old Zeller DD, though Zeller will require more work since I think AFL “knows” what a crash and valid minimization are, etc., but you might want more semantics at work in your minimization).
3. Drop it in the KLEE-zesti queue, perhaps prioritizing by an FPF ranking like we did in the paper if AFL is getting ahead of KLEE (which seems plausible even if we give KLEE several cores).
I haven’t played with AFL enough to know: does it handle closing the feedback loop? Can I drop new files into the AFL input testcase directory, and expect AFL to notice them and add them to a running fuzzer instance?
Alex see Michal’s comment #4 above — sounds like the documentation for syncing up testcases is coming soon.
Is Zesti the right tool here? How does it differ from Klee?
zesti’s a bit of “KLEE made easy” if you have existing test cases lying around to base your symbolic exploration on. I haven’t directly played with it myself, just KLEE, but Super found using it for the ISSTA paper and his thesis work since then quite pleasant, I think. For existing test cases vs. building a harness I think it may be The Right Thing right now.
Given what KLEE and company are doing, saying “elaborate on this existing test case” vs. “let me right a harness with symbolic holes” definitely has some nice aspects.
er. “write” a harness, you can tell I’m describing how to determine if a specification is right in another window…