Fuzzer Puzzle


A recent post on the CERT/CC blog says:

… BFF 2.7 and FOE 2.1 do have an optional feature that in our testing drastically increases the number of uniquely-crashing test cases: crasher recycling. When enabled, the frameworks will add [a] uniquely-crashing test case back into the pool of seed files for further fuzzing. When crashing test cases are fuzzed more, this can uncover both new vulnerabilities as well as new ways in which an already-known vulnerability can cause a crash.

It seems pretty clear why adding a crasher back into the seed pool will find new ways to trigger the bug that the crasher already triggers: fuzzing this test case will generate test cases that are near the crasher (in some abstract test case space that we won’t try to define too closely). These nearby test cases are presumably more likely to find different ways to trigger the same bug.

But why is it the case that this technique “drastically increases the number of uniquely-crashing test cases”? I believe the authors intend “uniquely-crashing test case” to mean a test case that triggers a new bug, one that is distinct from any bugs already found during fuzzing. I’ll avoid speculating for now so I can see what you folks think.

UPDATE: Allen Householder from CERT has written some more about this.

,

15 responses to “Fuzzer Puzzle”

  1. Maybe because crasher bugs are ‘clustered’ around certain functionalities or code areas and fuzzing based on already found bugs tends to stay within these areas of higher bug incidence?

    Unfortunately I don’t have anything to back this speculation up except the anecdotal obseration that when there’s a serious bug somewhere, there’s usually another one quite close.

  2. Mario, I don’t think anyone has anything more than anecdotal evidence about this sort of thing. Characteristics of distance functions for test cases (which is clearly the important concept here) are very poorly understood for interesting software systems.

    My guess is that if we look at each bug in a big system individually, it’s probably possible to define a distance function that places test cases that trigger the bug close together. However, this distance function is probably different for every single bug.

  3. I didn’t mean to imply that crasher recycling uncovers mostly new underlying bugs. When I use the term uniquely-crashing test cases, I mean that they are uniquely crashing in the way that we currently determine crash uniqueness, which is based on the stack trace. !exploitable with FOE and similar fuzzy stack hashing with BFF. And yes, recycled crashers that do things like trash the stack are likely to produce more uniquely-crashing test cases, if the stack is used to determine uniqueness.

    However, there are important things to consider:
    1) If I have 10, 100, 1000000 crashing test cases that turn out to be the same underlying bug, does it *really* matter if one of them just happens to be have the “home run” crash properties that demonstrates exploitability? Or if as a developer, I fix that one bug and the 1000000 cases now all go away when I do my verify run?
    2) There really may be actual new underlying bugs uncovered by crasher recycling and fuzzing that quasi-valid file.

    But both points indicate the need for further investigation and testing in both of those areas. In the case of 1), it demonstrates perhaps the need for better uniqueness determination and also the need for the ability to handle huge piles of crashers that may or may not be from unique underlying bugs. In the case of 2), it demonstrates the need to perform tests to show whether or not any of the uniquely-crashing test cases that were derived from fuzzing quasi-valid (crashing) files are indeed new underlying bugs.

    Both of these are definitely on our radar. But in my testing, crasher recycling *drastically* (think more than an order of magnitude in a short amount of time) increases the uniquely-crashing test cases that are produced. And there’s a reason that the feature is turned off by default. Neither security researchers nor software vendors seem to be prepared to handle that sort of volume.

  4. It sounds extremely plausible to me. I’ve often tracked down a bug on a long-lived system and found other bugs in the same place. Either the overall quality of the code is poor, or the test case illustrates a deficiency in how we had thought about the relevant code in the past.

    I would distinguish “new bugs” from “new test cases”. New test cases means test cases that are a large distance from the other ones–whatever “histance” means for program code.

    New test cases are good, but new bugs are better. A new bug means that when you fix the first test case, the second test case is still failing.

  5. Hi Will, thanks for commenting!

    It’s great when you have to dial down a fuzzer in order to avoid overwhelming people. I’ve noticed a similar effect: report 5 bugs and they get fixed, report 50 and people will find something else to do with their time.

    A closely related issue is the propensity of test case reducers to discover new bugs. I’ve noticed this, Alex Groce has also seen it, and in a conversation with Allen Householder earlier this summer, he mentioned that you folks have gotten good mileage out of this as well. This is something I’m really interested in investigating further.

  6. Lex, definitely — this idea of disambiguating the mapping from test cases to bugs is a crucial aspect of scaling up fuzzing efforts (not coincidentally, the topic of an upcoming blog post here).

  7. My best guess here is that tests have “hoops” to get through and that test cases that expose bug #1 often have to jump through some hoops that are also preconditions for ever exposing bug #2..4, etc. Presumably there are parts of code that are complicated, and broken in multiple ways, and close in source terms. But it’s a guess, still. It could be there’s no source proximity, just some kind of state space proximity.

  8. I would guess there are certain classes of bugs that are likely to be triggered in multiple ways. One that comes to mind is use-after-free bugs where the problem is that you deleted some object and a pointer remains to it in some datastructure — there will likely be many consumers of the datastructure that is now corrupted, so you can trigger the same underlying problem at many unique points in the program.

  9. It might be interesting to analyze the correlation between fuzzy stack hashes and underlying bugs. OSS vendors like Red Hat are already triaging (including de-duplicating) large volumes of bugs, at least some of which have a stack hash available (via uploaded core files), so a data set for this analysis might already be publicly available.

  10. Fixing bugs is one the most difficult and error-prone parts of programming (often, the person writing the bug fix doesn’t know the full context of the code). So it is very likely that if you get a test case which triggers a bug, you make the minimal code change to defeat the given test case. But the underlying bug has not really gone away and may even be worse. So it is natural that a slightly changed test case may still trigger a new bug.

  11. Hi David, I’m sure that is true in some cases, and in fact this is one area where I think automated test case reduction has a lot to offer. That is, a test case reducer produces a large series of test cases, some of which trigger the bug and some do not. These could be replayed against the fixed system as part of verifying the fix.

    I’ve heard stories of people who tried to fix integer overflow bugs three or four times before getting it right…

  12. Jonathan, I agree. Also one thing we have faced while doing compiler testing is a large collection of programs that trigger an unknown number of wrong code bugs. Since there is no crash, there is no stack trace, so we are forced to disambiguate in other ways.

    In our experience, the extremely hard part of this work is getting a reliable version of ground truth. For our compiler work (link below) this took one of my students a couple of months.

    http://blog.regehr.org/archives/925

  13. Alex I like the hoops idea. Intuitively, some parts of the test case space (or ~equivalently, application state space) are hard to reach.

    Morten, definitely. This is one reason why memory safety bugs are frustrating to do triage on: their effects show up much later than their causes.

  14. My immediate thought is to blame copy-pasted code: there are two paths to the bug, one at each paste site. Since the code was copied, it’s doing the same thing, just with a different path to reach it, so it’s reasonable that mutations of test cases reaching one paste site will reach other paste sites. That is, the actual bug trigger in the input might be the same, but mutating the input before the trigger hits the other paste site.

  15. If you’ve found a bug, you have likely also found an area which wasn’t already tested well. So it is quite plausible to me that the above approach would find multiple distinct bugs.