Operant Conditioning by Software Bugs

conditioning Have you ever used a new program or system and found it to be obnoxiously buggy, but then after a while you didn’t notice the bugs anymore? If so, then congratulations: you have been trained by the computer to avoid some of its problems. For example, I used to have a laptop that would lock up to the point where the battery needed to be removed when I scrolled down a web page for too long (I’m guessing the video driver’s logic for handling a full command queue was defective). Messing with the driver version did not solve the problem and I soon learned to take little breaks when scrolling down a long web page. To this day I occasionally feel a twinge of guilt or fear when rapidly scrolling a web page.

The extent to which we technical people have become conditioned by computers became apparent to me when one of my kids, probably three years old at the time, sat down at a Windows machine and within minutes rendered the GUI unresponsive. Even after watching which keys he pressed, I was unable to reproduce this behavior, at least partially because decades of training in how to use a computer have made it very hard for me to use one in such an inappropriate fashion. By now, this child (at 8 years old) has been brought into the fold: like millions of other people he can use a Windows machine for hours at a time without killing it.

Operant conditioning describes the way that humans (and of course other organisms) adapt their behavior in response to the consequences resulting from that behavior. I drink beer at least partially because this has made me feel good, and I avoiding drinking 16 beers at least partially because that has made me feel bad. Conditioning is a powerful guiding force on our actions and it can happen without our being aware of it. One of my favorite stories is where a psychology class trained the professor to lecture from the corner by paying attention when he stood in one part of the room and looking elsewhere when he did not. (This may or may not be only an urban legend, but seems plausible even so.) Surely our lives are filled with little examples of this kind of unconscious guidance from consequences.

How have software bugs trained us? The core lesson that most of us have learned is to stay in the well-tested regime and stay out of corner cases. Specifically, we will:

  • periodically restart operating systems and applications to avoid software aging effects,
  • avoid interrupting the computer when it is working (especially when it is installing or updating programs) since early-exit code is pretty much always wrong,
  • do things more slowly when the computer appears overloaded—in contrast, computer novices often make overload worse by clicking on things more and more times,
  • avoid too much multitasking,
  • avoid esoteric configuration options,
  • avoid relying on implicit operations, such as the fact that MS Word is supposed to ask us if we want to save a document on quit if unsaved changes exist.

I have a hunch that one of the reasons people mistrust Windows is that these tactics are more necessary there. For example, I never let my wife’s Windows 7 machine go for more than about two weeks without restarting it, whereas I reboot my Linux box only every few months. One time I had a job doing software development on Windows 3.1 and my computer generally had to be rebooted at least twice a day if it was to continue working. Of the half-dozen Windows laptops that I’ve owned, none of them could reliably suspend/resume ten times without being rebooted. I didn’t start this post intending to pick on Microsoft, but their systems have been involved with all of my most brutal conditioning sessions.

Boris Beizer, in his entertaining but long-forgotten book The Frozen Keyboard, tells this story:

My wife’s had no computer training. She had a big writing chore to do and a word-processor was the tool of choice. The package was good, but like most, it had bugs. We used the same hardware and software (she for her notes and I for my books) over a period of several months. The program would occasionally crash for her, but not for me. I couldn’t understand it. My typing is faster than hers. I’m more abusive of the equipment than she. And I used the equipment for about ten hours for each of hers. By any measure, I should have had the problems far more often. Yet, something she did triggered bugs which I couldn’t trigger by trying. How do we explain this mystery? What do we learn from it?

The answer came only after I spent hours watching her use of the system and comparing it to mine. She didn’t know which operations were difficult for the software and consequently her pattern of usage and keystrokes did not avoid potentially troublesome areas. I did understand and I unconsciously avoided the trouble spots. I wasn’t testing that software, so I had no stake in making it fail—I just wanted to get my work done with the least trouble. Programmers are notoriously poor at finding their own bugs—especially subtle bugs—partially because of this immunity. Finding bugs in your own work is a form of self-immolation. We can extend this concept to explain why it is that some thoroughly tested software gets into the field and only then displays a host of bugs never before seen: the programmers achieve immunity to the bugs by subconsciously avoiding the trouble spots while testing.

Beizer’s observations lead me to the first of three reasons why I wrote this piece, which is that I think it’s useful for people who are interested in software testing to know that you can generate interesting test cases by inverting the actions we have been conditioned to take. For example, we can run the OS or application for a very long time, we can interrupt the computer while it is installing or updating something, and we can attempt to overload the computer when its response time is suffering. It is perhaps instructive that Beizer is an expert on software testing, despite also being a successful anti-tester, as described in his anecdote.

skinner The second reason I wrote this piece is that I think operant conditioning provides a partial explanation for the apparent paradox where many people believe that most software works pretty well most of the time, while others believe that software is basically crap. People in the latter camp, I believe, are somehow able to resist or discard their conditioning in order to use software in a more unbiased way. Or maybe they’re just slow learners. Either way, those people would make amazing members of a software testing team.

Finally, I think that operant conditioning by software bugs is perhaps worthy of some actual research, as opposed to my idle observations here. An HCI researcher could examine these effects by seeding a program with bugs and observing the resulting usage patterns. Another nice experiment would be to provide random negative reinforcement by injecting failures at different rates for different users and observing the resulting behaviors. Anyone who has been in a tech support role has seen the bizarre cargo cult rituals that result from unpredictable failures.

In summary, computers are Skinner Boxes and we’re the lab rats—sometimes we get a little food, other times we get a shock.

Acknowledgment: This piece benefited from discussions with my mother, who previously worked as a software tester.

Is the Browser the New OS?

Yes, this is an old question. I still think it’s interesting. Disclaimer: I haven’t tried out a Chromebook yet.

First, let’s look at the situation as of late 2012. The applications I use generally fall into three categories:

  1. Web-based.
  2. Native, but easily available on Windows, Mac, and Linux. These include a file browser, a shell, Emacs, LaTeX.
  3. Native and less portable: Photoshop, development tools such as compilers and debuggers, high-performance games.

A quick look at the Chrome store indicates that most or all of the applications in category #2 are already available in browser versions (though I currently use the native versions out of inertia). Category #3 is trickier; it’s not clear that it really makes sense to port, for example, Photoshop into the browser. On the other hand, category #3 is basically power tools and many people (including me, much of the time) can get by using weaker web-based versions. Even embedded software development tools, which might at first seem to be the antithesis of web applications, have web-based versions. In summary, to a first approximation, “the browser is the new OS” has mostly happened already, though we do see an interesting reverse trend in the mobile space (though quite a few of those native apps are thin wrappers around a browser).

Second, how did we get to this point? The first application I ever saw that used a browser as its primary interface was SATAN in 1995. I remember thinking this was an elegant and striking design decision; it permitted the implementors to focus on their application logic instead of wasting time on what would likely have ended up being the kind of crappy X windows GUI that was standard at the time. Not long after 1995 we saw the rise of Java-enabled browsers making everyone (for a short time, at least) get interested in the potential for platform-independent applications delivered over the web. But this potential remained unrealized. Around 2008 it became clear that the following three efforts, taken together, constituted a serious attempt to create a new platform. First, Google Chrome, with a fast JavaScript engine and significantly increased robustness (and perhaps also security) due to the process-per-tab model. Second, Google Gears, which facilitated disconnected operation for web applications. And third, Google Docs, which implements at least 80% of the useful stuff in MS Office. Of course Gears is gone but similar functionality is available elsewhere. More recently, Chrome OS and the Chromebook make it clear what Google wants to see. I’m painting a sort of Google-centric picture here and in fact I use their stuff a lot. However, most of the time I could get by using Firefox, Bing, and other alternative technologies.

Finally, what is the prognosis for browser-is-the-OS, looking forward? What will it take for people to really not care about which OS they’re running? First, we want to minimize the number of applications in category #3, but realistically I think most casual users don’t have that many apps in this category and us power users are willing to compromise. For example, the special-purpose tools I use for research are probably always going to run outside the browser and that is fine—they’ll run on workstation-class machines at the office and in the cloud. Photoshop and high-quality video processing is not likely to be moving into the cloud real soon, but on the other hand again these are special-purpose, workstation-grade applications. Most people already do their photo editing online.

The second thing that I require is that any web application that looks remotely like an editor (whether for programs, documents, presentations, web pages, or whatever) has to support rock-solid disconnected operation and have a sensible conflict resolution mechanism. Google Docs’ disconnected operation seems pretty good since last summer, but I worry that lots of applications are going to need special-purpose logic to handle this nicely, and it’s going to be a long slog since many developers won’t consider disconnected to be a priority.

Third, we need near-native performance in the browser. Plenty of JavaScript work has been done and it’s getting pretty fast. For more compute-intensive workloads Google Native Client seems like a good option.

In summary, we seem to be headed towards a segregated world where most people don’t need a PC, but rather an appliance that runs a browser. On the other hand, people who use computers more seriously will keep using workstation-style machines that run native apps when considerations such as performance, dealing with large amounts of data, security, reliability, and local control are paramount. I’ll go ahead and predict that in 2022, the number of PCs sold will be 25% of what they were in 2012. By PC I mean something like “non-Android, non-iOS machine primarily intended to be used by a single person to run native applications”—a category that includes all laptops except Chromebooks.

Software Testing Using Easy Cases

Chapter 2 of Street Fighting Mathematics is called Easy Cases. The idea is very simple:

A correct solution works in all cases, including the easy ones.

Solution 2 in this writeup of the napkin ring problem is a fun example of applying this principle.

Easy cases can be exploited for software testing as well as in mathematics. Many instances of this are trivial in the sense that any test case written by a human is likely to be an easy case when compared to the universe of all possible inputs. What I’m interested in here are clever and non-obvious applications of easy cases.

Let’s take the example of an FIR filter. Since these eat up a lot of CPU cycles in some systems, people create fiendishly optimized FIRs in assembly that are not so easy to verify by inspection. Chapter 8 of the ARM System Developer’s Guide has many great examples (code here). An FIR filter implementation can be tested using the following easy cases:

  • Feed the filter an impulse signal (…,0,0,0,1,0,0,0,…) and it should output its coefficients in order.
  • Feed the filter a step signal (…,0,0,0,1,1,1,…) and its output should settle to the sum of its coefficients.
  • Feed the filter a sine wave and its output should be a sine wave with the expected amplitude given the filter’s design.

Can you write an FIR that is wrong, but that passes these tests? Sure, but these will weed out a lot of wrong implementations. I learned about these tests here. In general, a lot of codes, particularly mathy ones, have these easy cases that can and should be exploited for simple sanity checks.

Oracles for Random Testing

Random testing is a powerful way to find bugs in software systems. However, to actually find a bug it’s necessary to be able to automatically tell the difference between a correct and an incorrect execution of the system being tested. Sometimes this is easy: we’re just looking for crashes. On the other hand, there are lots of incorrect executions that don’t crash and we’d like to find those too. A test oracle is what we call a way to automatically find erroneous executions. This post is intended to be a public brainstorm: I’ll list all the kinds of oracles I’ve thought of and hopefully you readers will leave comments filling in the gaps.

Programming environment rule violations. Most programming environments have some amount of built-in checking for rule violations. For example, the Linux execution environment will crash a process that attempts to access an unmapped part of the address space. Java has lots of reasons to throw exceptions. In many cases, the execution environment can be asked or forced to act as a more effective oracle. For example, Perl code can be run under “use warnings” and the Linux environment can enforce a dramatically expanded set of rules by running a program under Valgrind. Some libraries that you use can be compiled with extra input validation turned on. Also, the code being tested has to cooperate by failing to hide rule violations. For example, if my Linux program masks all signals or my Java program catches and ignores all exceptions, these oracles are going to have a hard time revealing anything interesting.

Function/inverse pairs. If we can find a pair of operations that undo each other, we can generate a random test case, apply the function to it, apply the inverse function to the result, and compare the output with the original test case. I wrote about this earlier.

Nops. Many codes have semantic nops that can be exploited to find bugs (a function/inverse pair is a kind of nop but it seemed important enough to split out into its own category). For example, if we’re testing a transaction processing system we might begin a lot of transactions, force them all to abort, and then check that the data has not been changed. If we’re testing a JVM we can force the garbage collector to execute after every instruction and then check that the program’s output is not changed. At the hardware level, flushing the pipeline or caches should not affect the ongoing computation. Concurrent software (usually) should not change its result if we put random sleeps inside critical sections or force the thread scheduler to perform many more context switches than it usually would. A system that is designed to recover from transient faults, such as dropped or corrupted network messages, should not change its behavior when such a fault is artificially induced.

Nullspace transformations. If I’m testing a compiler, I might take some test case and replace every occurrence of x with (1+x-1) or (1.0 * x) or perhaps a call to a silly getter or setter function. The transformed program, when compiled, should behave the same as the original, as long as I’m super careful not to run afoul of overflows and such. If I’m testing a spellchecker, I might introduce whitespace changes or additional correctly-spelled words and make sure its behavior is not changed. The basic idea is that even if we can’t predict the behavior of a system, we can probably think of ways to change inputs to that system in such a way that the output of a correct system will not change. I’m grabbing the name of this kind of oracle from this paper.

Multiple versions. Occasionally we have multiple independent implementations of the same specification, such as a collection of Java compilers, and they can be tested against each other. Even when this is not the case, we can often test today’s version of our system against yesterday’s version or perhaps the previous release. Some kinds of program, such as optimizing compilers, effectively contain many versions in one: the -O0 version of GCC, the -O1 version, the -O2 version, etc.

Assertions. Internal consistency checks can be a very powerful kind of oracle. Here, we run our code on a random input and simply wait for the code to report an error.

Resource limits. Sometimes we can be pretty sure that our code should not, for example, use more than 1 GB of RAM or 1 minute  of CPU time on simple random inputs. In that case, we can use a call such as ulimit to arrange for the OS to signal an error when the resource bound is exceeded. Finding good resource limits can be hard.

 

How Does Formal Verification Affect Software Testing?

This has been a difficult piece to write and I’ve already deleted everything and started over more than once. So, I’m going to take the easy way out and structure it as a sequence of questions and answers.

What does formal verification mean? Something like “using mathematical techniques to convincingly argue that a piece of software implements a specification.” The specification may be very domain-specific, as in “the robot may not injure a human being or, through inaction, allow a human being to come to harm,” or it may be generic, as in “no execution of the system dereferences a null pointer.” So far, we know extremely little about what might be involved in verifying a specification such as Asimov’s Laws that identifies entities in the real world, as opposed to entities inside the software system.

Do we need both testing and formal verification? Yes. It’s absurd to think we could do without testing. Testing is how we create systems that work. On the other hand, testing alone suffers from needle-in-haystack problems where critical software almost certainly has surprising behaviors that we need to know about, but we don’t know how to create test inputs that trigger them.

Why would testing and formal verification interact at all? Aren’t these separate activities? Today, testing and formal verification are largely independent activities. I’ve heard of a very small number of cases where formal verification is considered an acceptable substitute for running some kinds of tests. In the long run we can hope to see more of this.

Under what conditions can formal verification reduce the amount of testing that is required? This is difficult. The basic observation is that formal verification tends to eliminate the possibility of certain types of bugs in a narrow layer of the system—the one that is being verified. To be concrete, let’s say that I prove that my robot software faithfully follows Asimov’s laws, but then I compile it with a buggy compiler, run it on a buggy operating system, run it on a buggy processor, or run it inside a robot with a defective camera that cannot reliably recognize human beings. This robot, of course, will fail to follow Asimov’s laws; the proofs about its software are rendered irrelevant by violated assumptions. These problems need to be discovered via testing (which, of course, may also fail to reveal them). To make formal verification apply to a wider range of layers, we need verified operating systems, compilers, libraries, chips, and more. Of course, people are working on all of these things, though at present the chaining together of correctness proofs for a realistic software stack has not yet happened.

Another problem with formal verification is that the specifications themselves are often quite complicated and difficult to debug; gaining confidence in a specification requires a testing process not very different from how we gain confidence in code. Even if we have non-buggy formal specifications, it can be unclear what the specification means in practice, and it can be unclear that the specification describes all of the behaviors that we might care about. Usability considerations are an example of a property of software that we care about, that can be tested, but that may not be possible to formally specify and verify.

But let’s get back to the question: Under what conditions can formal verification replace testing? My answer would be:

  • The formal verification tool is considered to be trustworthy.
  • The specification is considered to be trustworthy.
  • The specification is considered to capture all properties of interest.
  • Layers of the system not being verified (compiler, OS, etc.) are considered to be trustworthy.

Of course, it may be difficult to satisfy all of these criteria. In contrast, testing makes many fewer assumptions, but rather provides end-to-end checks of the compiler, the OS, the code being developed, etc.

Why do we want formal verification to replace some kinds of testing? The reason is simple: cost. Although formal verification is currently quite expensive, techniques are improving every year. Testing techniques are also improving, but not necessarily at a very rapid rate. The fundamental problem with testing is that we can’t do a very good job of it in any realistic amount of time. The great hope of formal verification is that once the large initial investment is made to get it going, incremental re-verification will be relatively inexpensive, permitting high-quality software to be delivered rapidly. As the technologies improve, even the initial costs are likely to come down. In the long run, reduction in cost, not reduction in defects, will be the killer app for formal methods.

How about the other way around: When can testing replace formal verification? Testing all possible behaviors of a system is formal verification; it is a kind of proof by exhaustion. Unfortunately, exhaustive testing is only feasible in very narrow circumstances.

Can testing and formal verification interact in other ways? Definitely, and I think the possibilities here are very exciting. We are already seeing excellent systems like Klee and SAGE where formal methods are used to produce test cases. In the future we will start to see patchwork verification efforts where part of the code is verified to be correct, part of it is verified only to be free of certain bugs, and part of it is not verified at all. At that point we will need Klee-like tools that take these different levels of verification into account and generate test cases that maximally increase our confidence in the correctness of the unverified code. Even a fully formally verified software stack may contain some surprises, for example discontinuities in the functions it implements or perhaps unwarranted sensitivity to changes in input variables. We want test case generators for these too.

Raspberry Rockets

One of the things I most enjoy about teaching embedded systems is that the students show up with a very diverse set of skills. Some are straight-up CS, meaning they can hack but probably are intimidated by a breadboard, logic analyzer, or UART. Others are EE, meaning that they can design a noise-free circuit or lay out a PCB, but probably can’t write an advanced data structure. Even the computer engineering students (at Utah this isn’t a department but rather a degree program that spans departments), who know both hardware and software, have plenty to learn.

The projects in my class are always done on some embedded development board; I’ve used four or five different ones. This semester we used the Raspberry Pi, which has a very different feel than all previous boards we’ve used because you can just shell into the thing and run emacs and gcc.

This fall I did something a bit differently than usual, which was to structure all of the programming projects around a single specification taken from Knight and Leveson’s 1986 paper about n-version programming. The specification is for a (fake) launch interceptor system that takes a collection of parameters and a collection of points from a radar, applies a bunch of rules, and outputs a launch / no-launch decision for a missile interceptor. It’s a good fit for class projects because it’s generally pretty straightforward code but has a few wrinkles that most people (including me) aren’t going to get right on the first try. The students divided into groups and implemented my slightly adapted version of the specification, then we did several rounds of testing and fixing. This all ended up eating up a lot of the semester and we didn’t end up having time to use the nice little fuzzer that I built. For another assignment I asked each group to compute the worst-case stack memory usage of their code.

For the final project, with just a few weeks left in the semester, we put the whole class together on a project where multiple boards cooperated to make launch decisions. First, a master node, running Linux, loaded launch interceptor conditions and sent them to a couple of subordinate nodes using SPI. Although good SPI master drivers are available for the RPi, we were unable to use the BCM2835’s SPI device in slave mode since the appropriate pins are not exposed to the external connectors (also the SoC is a BGA so we can’t “cheat” and get access to the pins a different way). Therefore, a group implemented a bit-banged SPI slave for the subordinate nodes. For fun, and to avoid problems with the tight timing requirements for bit-banged SPI, these nodes dropped Linux and ran on bare metal. When a subordinate node received a collection of launch interceptor data it computed the launch or no-launch decision and then provided the answer to the master node when queried. The master, then, would only make a launch decision when both subordinate nodes said that a launch should happen, at which point it sent a signal to a high-power relay that sent a generous amount of current to the rocket igniter.

Since the students did an awesome job getting all of this working, we spent the last day of classes launching rockets in a field at the edge of campus; pictures follow. The system shown in the pictures actually includes two different implementations of the subordinate node: one on RPi and the other on a Maple board, which uses an ARM Cortex-M3 chip. I like this modification since it’s more in the spirit of the original n-version paper anyway.

Minimum Pubs for a PhD in CS?

Some of the faculty in my department would prefer that we don’t award a PhD to any candidate who hasn’t published at least three good papers. I’m curious if this is common and if people generally have strong feelings either way about this kind of requirement? Some web searching turned up not much information: UConn requires 3 papersCornell does not have a requirement, and finally Phil Guo reports that Stanford CS has an informal ~3 paper requirement. It’s pretty easy to put forth both pros and cons for this kind of requirement. I’ll omit my own views on this and maybe write about them later on.

Also see: Discussion at Hacker News.