Oracles for Random Testing


Random testing is a powerful way to find bugs in software systems. However, to actually find a bug it’s necessary to be able to automatically tell the difference between a correct and an incorrect execution of the system being tested. Sometimes this is easy: we’re just looking for crashes. On the other hand, there are lots of incorrect executions that don’t crash and we’d like to find those too. A test oracle is what we call a way to automatically find erroneous executions. This post is intended to be a public brainstorm: I’ll list all the kinds of oracles I’ve thought of and hopefully you readers will leave comments filling in the gaps.

Programming environment rule violations. Most programming environments have some amount of built-in checking for rule violations. For example, the Linux execution environment will crash a process that attempts to access an unmapped part of the address space. Java has lots of reasons to throw exceptions. In many cases, the execution environment can be asked or forced to act as a more effective oracle. For example, Perl code can be run under “use warnings” and the Linux environment can enforce a dramatically expanded set of rules by running a program under Valgrind. Some libraries that you use can be compiled with extra input validation turned on. Also, the code being tested has to cooperate by failing to hide rule violations. For example, if my Linux program masks all signals or my Java program catches and ignores all exceptions, these oracles are going to have a hard time revealing anything interesting.

Function/inverse pairs. If we can find a pair of operations that undo each other, we can generate a random test case, apply the function to it, apply the inverse function to the result, and compare the output with the original test case. I wrote about this earlier.

Nops. Many codes have semantic nops that can be exploited to find bugs (a function/inverse pair is a kind of nop but it seemed important enough to split out into its own category). For example, if we’re testing a transaction processing system we might begin a lot of transactions, force them all to abort, and then check that the data has not been changed. If we’re testing a JVM we can force the garbage collector to execute after every instruction and then check that the program’s output is not changed. At the hardware level, flushing the pipeline or caches should not affect the ongoing computation. Concurrent software (usually) should not change its result if we put random sleeps inside critical sections or force the thread scheduler to perform many more context switches than it usually would. A system that is designed to recover from transient faults, such as dropped or corrupted network messages, should not change its behavior when such a fault is artificially induced.

Nullspace transformations. If I’m testing a compiler, I might take some test case and replace every occurrence of x with (1+x-1) or (1.0 * x) or perhaps a call to a silly getter or setter function. The transformed program, when compiled, should behave the same as the original, as long as I’m super careful not to run afoul of overflows and such. If I’m testing a spellchecker, I might introduce whitespace changes or additional correctly-spelled words and make sure its behavior is not changed. The basic idea is that even if we can’t predict the behavior of a system, we can probably think of ways to change inputs to that system in such a way that the output of a correct system will not change. I’m grabbing the name of this kind of oracle from this paper.

Multiple versions. Occasionally we have multiple independent implementations of the same specification, such as a collection of Java compilers, and they can be tested against each other. Even when this is not the case, we can often test today’s version of our system against yesterday’s version or perhaps the previous release. Some kinds of program, such as optimizing compilers, effectively contain many versions in one: the -O0 version of GCC, the -O1 version, the -O2 version, etc.

Assertions. Internal consistency checks can be a very powerful kind of oracle. Here, we run our code on a random input and simply wait for the code to report an error.

Resource limits. Sometimes we can be pretty sure that our code should not, for example, use more than 1 GB of RAM or 1 minute  of CPU time on simple random inputs. In that case, we can use a call such as ulimit to arrange for the OS to signal an error when the resource bound is exceeded. Finding good resource limits can be hard.

 

,

14 responses to “Oracles for Random Testing”

  1. Busy with a paper submission, but off hand, metamorphic testing comes to mind — I know changing the input F(x) will cause F'(y) on the output. Generalization of nullspace, in a sense — predictable changes in output based on a class of input transformations.

  2. A generalization of the function/inverse and nop ideas is fixed-point iteration: applying a function F to the program x and then to F(x), etc., should converge to some F(y) = y “quickly.” A classic example is that a parser and AST pretty-printer pair should converge within two cycles; another example is the multiple-stage trick of compilers (a compiler should be byte-for-byte equivalent with itself when compiled by itself and when compiled by a host compiler).

  3. A generalization of my earlier comment on Function/inverse pairs: I’d suggest generalizing the term to “Identities with Fudge Factors.” The obvious one is the traditional sine squared plus cosine squared, though we didn’t get much mileage out of that. A slightly more complicated test did find a genuine bug in the PalmSource compiler’s hypotf.

  4. Re inverse functions: we (Hex-Rays) started using Csmith to test our decompiler. We generate a “good” (non-erroring and non-crashing) program with Csmith, compile and run it, decompile the object code, compile the decompiled code and (if this step succeeded) check that the re-compiled program produces the same output. We are not using it extensively yet but we already found many subtle (and not so subtle) decompilation bugs. Many thanks for the great program!

  5. Multiple Versions reminds me of a technique I learned from an Wise Old Programmer. Let’s say you need to re-implement an interface e.g. to make it faster/use less memory/etc. Keep the old interface available. When you are ready to validate the new interface, run your entire test suite–under the covers, use both the new and old implementation, and verify that the results are same. When you’re sure everything is working properly, cut away the old code. You can save your customers a lot of pain this way.

  6. Generalizing assertions, of course, there is the huge area of checking properties of the logs. Temporal logic if you are a fancy model checking background person (maybe instrumentation for runtime verification), but regexps on logs for all kinds of people — I’ve used this on actual code, not just in papers.

    LogScope was an attempt to make log properties fairly easy to write/check, that we used some (before the test team was disbanded during mission wait until later launch) on Curiosity software (http://www.cs.cmu.edu/~agroce/icse10.pdf),

  7. “Nullspace transformations” and “nops” seem like the same principle to me. In both cases, you’re testing your implementation of F by finding some function g that preserves all the interesting properties of its argument (which is to say, g(x) ≡ x) and then verifying that F(g(x)) ≡ F(x).

    “Multiple versions” testing (e.g. testing -O1 against -O2) is useful when you have implementations of F and F’ that are supposed to be equivalent; in that case you can verify that F(x) ≡ F'(x).

    “Function-inverse pairs” are distinct from “nullspace transformations” (for the record I think those are both horrible obfuscatory names): in this case, you’re testing F by verifying that x ≡ h(F(x)) for some “inverse function” h.

    In all the cases above, note that “x ≡ y” really means “f(x) = f(y) for some conveniently chosen hash function f”. Maybe that hash function is “run the code and take its output”, or maybe it’s “dump the contents of the text section through od”, or maybe it’s just “grep the output for a certain message”.

    And by “verify …” I really mean “provide some concrete evidence in favor of the hypothesis that …”. 🙂

    Compiling a compiler with itself is sort of a special case, I think. In that case your function F is supposed to have the special property that F(p) ≡ p for any program p. So you can do “multiple versions” testing with your original F and with an F’ := F(F); if the two versions ever differ in their behavior, then you’ve found a program p (namely F itself) for which F(p) is distinguishable from p.

  8. Hi Arthur, please consider supplying some non-horrible, non-obfuscated names! Alex and I are going to be putting this material in a book at some point so now’s your chance to stop us from putting these into print.

  9. By the time I wrote my third garbage collector, I knew that debugging it was going to suck so Before I wrote the collector, I wrote a pretty full set of invariant checking assertions for the heap and then ran the checks before and after every gc.
    I used the same trick in the SoC-C runtime system (a load of coroutines, fifos, irq handlers, etc.) – very effective.

    You had already mentioned assertions, but I think the special case of (quite expensive, recursive functions that check) invariants is worth particular attention.

  10. Hi Alastair, I like to make a distinction between shallow assertions (only look at metadata, run in small constant time) and deep assertions (may look at all data in a data structure, may have runtime even worse than that, for example to verify a sort routine).

  11. @John #10: I don’t have any shrinkwrapped suggestions, but I think good names would reflect the mathematical equivalences going on here. For example, I’d describe “nullspace transformations” by the phrase “Verifying that equivalent inputs produce equivalent outputs”. For “function-inverse pairs” the best I’ve got is “Verifying that a function of the output is equivalent to the input”. Then, for “multiple versions testing” (which I think is already a good name, btw), it would be “Verifying that the output is equivalent to a [known] function of the input.”

    It *is* cool (now that I really notice it) that you’ve cleanly cleared “testing methodology” – black-box, white-box, fuzz, coverage, etc. – away from the much simpler question of “When we run any given test, how do we know if it passed or failed?”, to which there really *are* just a few possible answers. (It segfaults; it assert-fails; or it doesn’t match the expected output, for only a few possible definitions of “match”.)

    Maybe there’s a useful parallel to the categorization of weaknesses in cryptography: a cipher might be attackable if C(g(k)) ≡ C(k), or if h(C(k)) ≡ k,… okay, so the parallels aren’t very exact, but still there might be some useful terminology you could steal from that field.

  12. For parallel programs, what works great for us is schedulers randomizing execution order. That is, you run the same program many times and check that it produces the same outputs, but every time, you use a different random seed and the scheduler runs things in a different order.

    I haven’t tried replacing a generic thread scheduler that way (a tool called Jinx does/did something like that I believe); I usually work against a higher-level task creation interface (not unlike OpenMP or TBB). Here’s an open source framework with a randomizing scheduler that I’m working on now (the code isn’t that great, nor is the entire repository… it’s a couple days old) – the scheduler is here (it’s really dumb – just run iterations in a parallel_for in random order, which is good enough to detect non-commutative side effects where there’s a race between the different iterations):

    https://github.com/yosefk/checkedthreads/blob/master/src/shuffle_imp.c

    At work, we have something more elaborate dealing with arbitrary (but statically-known in our case) task graphs. With a task graph, it seems desirable to generate all schedules which could cause the order of independent tasks to flip (A before B in one order, B before A in another order; it’s equally desirable with iterations of parallel_for, it’s just so easy that it’s less worth mentioning). Here’s a write-up describing this:

    http://www.yosefk.com/blog/making-data-races-manifest-themselves.html

    A related oracle is access to data stored by another task/thread, however that’s defined in the parallelism framework – that is, the access itself regardless of how it affects the output. An easy way to track ownership (ID of the last storing task) is using something like Valgrind. Here’s a Valgrind tool which will flag access to data stored by someone else at runtime; it should be used together with a randomizing scheduler and then it will detect commutative side effects like accumulator updates or “almost commutative” side effects like allocation from thread-unsafe pools, which will otherwise go undetected (again the code is half-baked):

    https://github.com/yosefk/checkedthreads/blob/master/valgrind/checkedthreads_main.c

    I think this works great with programs which are coded in a parallel fashion for performance reasons (to compute something faster). I’m not sure it extends that well to inherently concurrent programs (like banking or websites), because the success of these methods seems to me to depend, among other things, on being able to limit the interface to parallelism in ways that may be too inflexible for such applications; but I haven’t really thought it through (I work on computer vision, not banking or websites…)

    (All of this is related to manipulating the scheduler as you mentioned, except it’s specifically directed at making the order of particular events different; which apparently has the downside of requiring more understanding of the app structure.)