ISSTA 2011

Earlier this week I gave one of the keynote talks at ISSTA, the International Symposium on Software Testing and Analysis. A year ago Matt Dwyer, the general chair, sent me the following invitation:

I would like to invite you to give a keynote talk to the meeting about the challenges in testing, dynamic and static analysis aimed at fault detection for embedded software and particularly sensor network applications. I believe that as sensor network apps continue to mature that new, perhaps domain-specific, V&V techniques will be needed in order to field reliable systems. This topic has received very little attention…

I thought this was pretty cool because it sounds almost exactly like something that I’d have written. The premise of my talk was that there’s a huge amount of interesting research on testing that still needs to be done in the embedded domain, and that — unlike in the past — there are now a number of really nice open-source embedded platforms such as Arduino, TinyOS, Android, and ROS that should provide ready-made audiences for solid tool work. Here are the slides:

Issta11

View more presentations from regehr

Of the ISSTA talks I saw (which unfortunately wasn’t all that many due to the multi-tracked nature of the event and the fact that I had to skip a couple of sessions to get my own talk done) one of the ones I really liked was Philip Guo’s talk about a hacked Python interpreter that makes computations persistent and also transparently memoizes them when this is possible and profitable. The idea is that a lot of people use languages like Python to do data analysis and end up spending a lot of time parsing and printing temporary files, and also end up having a difficult time figuring out which scripts need to be re-run when something changes. Persistence means you don’t have to explicitly write data to files, and memoization means that you can just invoke all of the code every time you want the results, and that you will get a cached result unless something has changed that actually necessitates re-running. Other than manually deciding what to re-run (which is what I do) it’s possible to write a makefile but that is a big pain. His infrastructure takes care of this stuff automatically. I’d use it except for the fact that I do all of this kind of work in Perl. Oh well. Philip is also the person who wrote the excellent CDE packager for Linux applications.

Lionel Briand gave a great talk about some problems with the idea of adaptive random testing (ART), which is a variation of black box fuzzing where test inputs are selected to be “far” from previous test inputs under some distance metric. The hypothesis is that if we look at the space of test inputs under this metric, buggy regions will be somewhat compact. Therefore, we should be trying to spread test cases as widely as possible over the input space. Briand’s paper makes several points but the main criticism is that an ART implementation requires a number of distance comparisons that grows quadratically. Therefore, for realistic choices of testing parameters, a distance check has to be something like 1e5 times cheaper than running a test for ART to have any chance of paying off. The point isn’t that ART is bad, but rather that its proponents had better think of ways to avoid getting killed by the distance checks. In general I love this kind of talk, which takes a critical look at previous work. I feel sure that it caused contention among the program committee and I’m glad they decided to accept it. Even the targets of this work (the people who wrote the previous ART papers) should be flattered. What I mean is that since most academic work goes unread, we should feel lucky if someone criticizes our work because it means they read it and thought about it carefully.

Another talk I liked very much was the other keynote by Laurie Hendren, whose current project is to provide support for the large number of scientists and engineers who do most of their work in Matlab. Matlab is one of those programming languages whose specification is encoded in a single implementation and she entertainingly described the process of reverse-engineering things like how Matlab looks up a name, which should be — but is not at all — simple.

Overall I found ISSTA (which I had never attended) to be a very friendly conference with a lot of smart people and interesting work. Actually, “smart” doesn’t make it stand out since all computer science conferences are mostly smart people. The thing that I liked most was the practical focus on solving real software problems.

<div style=”width:425px” id=”__ss_8665731″> <strong style=”display:block;margin:12px 0 4px”><a href=”http://www.slideshare.net/regehr/issta11″ title=”Issta11″ target=”_blank”>Issta11</a></strong> <iframe src=”http://www.slideshare.net/slideshow/embed_code/8665731″ width=”425″ height=”355″ frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe> <div style=”padding:5px 0 12px”> View more <a href=”http://www.slideshare.net/” target=”_blank”>presentations</a> from <a href=”http://www.slideshare.net/regehr” target=”_blank”>regehr</a> </div> </div>

Split Vote

In my group’s recent compiler testing paper we wrote:

We have never seen an “interesting” split vote where randomized differential testing of a collection of C compilers fails to produce a clear consensus answer

Randomized differential testing is just a fancy way of describing this process:

  1. Randomly generate a test input
  2. Run it through several different implementations of the same specification
  3. Check if all implementations produced equivalent output

Today we saw our first split vote using a program generated by Csmith. The reduced test case is:

#include <stdio.h>
struct S0 {
  unsigned f1 : 1;
};

struct S0 s;

int main (void) {
  int x = -3;
  int y = x >= (0, s.f1);
  printf ("%d\n", y);
  return 0;
}

GCC, KCC, and CompCert all print “0\n”. MSVC 2010, Intel CC 12.0.2, and today’s development snapshot of Clang (all for x86-64) print “1\n”. All compilers have optimizations turned off. There are two possibilities:

  1. The test case is ill-formed or ambiguous.
  2. Three of the six tools are wrong.

I’m pretty sure that (using the C99 standard) the test case is fine and the correct value for y is 0. The reasoning is that s.f1, an unsigned value, is promoted to signed int before the comparison is performed, making the comparison operator signed, resulting in false, or zero. The type and value of the left operand to the comma operator should be irrelevant.

There are a few interesting things going on here:

  • Two of the three (apparently) correct results were produced by relatively lightly-tested research compilers.
  • Most compiler bugs are in the optimizers. Therefore, problems like this that show up with optimizations disabled are relatively rare.
  • C is not the simple “portable assembly language” that people like to claim it is. Nobody gets all of its corner cases right, even for something relatively simple like integers.
  • Just yesterday, Xuejun — the main Csmith hacker — added support for the comma operator. Most compilers’ implementations of it are probably not very well tested.
  • Intuitively, n-version programming should work. Knight and Leveson famously showed that it may not.

Related previous posts from this blog are here and here.

Update: I should have added that I’m interested to see if there are any other compilers that get this wrong. If you have access to a compiler not on my list above and wouldn’t mind running the test and posting the result in a comment, I’d appreciate it.

Update from July 13: The behavior of this program is a bit more subtle than my explanation above indicates. John McCall’s comment has the best explanation I’ve seen so far.

Update from July 14: In C, a global variable DOES NOT need an explicit initializer. It is automatically initialized to zero.

Wanted: Frank Exchange of Views

On his blog today, Bertrand Meyer responded to a soon-to-appear paper that criticizes some of his previous work. Frank, open debates about the merits of various research approaches and results are important and yet, for various reasons, the vast majority of these debates are hidden inside program committee meetings, hallway discussions, reading groups, and (to a lesser extent) Q/A periods following conference talks.

The problem with keeping these debates private is that they are inaccessible to the people who probably stand to most benefit from seeing them: students and other new practitioners trying to understand the dynamics of a field. In fact, a few public debates made up some of the most interesting and useful reading I encountered as a grad student:

Definitely I’m not advocating flame wars, personal attacks, or even rudeness. In general the debates I’ve listed have stayed pretty civil, and certainly Meyer’s response above is a great example of how this kind of thing can be done in a friendly way. Obviously criticism can quickly become personal when careers and reputations are at stake — but this risk is a poor reason to not air technical criticism.

I’ve recently written a couple of critical posts, here and here. Also, some of my own work has been criticized publicly, for example in this thread. It would be fair to say that regardless of which side of the debate I find myself on, I’m struggling a bit to find the right tone. It’s very easy to write a quick blog post or dash off a mailing list response that one isn’t 100% happy to find immortalized on the Internet, and also quite easy for everyone to find using a search engine.

It used to be the case that a public debate had to happen in the “letters” part of some professional publication. Today, obviously, blogs and reddits and such have created a completely different environment. We should take advantage of it.

Update: Coincidentally, today Derek Jones posted an article criticizing use of statistics in software engineering papers. I have not read the papers he describes, but it’s fair to say that there are some systematic problems with the quality of statistical analysis in the CS literature.

Review Correlation Again

Not long ago I was surprised to find that there was a (slightly) negative correlation between my review scores and the average of the other reviewers’ scores for a collection of papers submitted to a conference. A few days ago I attended the program committee meeting for SenSys 2011, where I again reviewed around 20 papers. This time, the correlation between my score and the average of the other reviewers’ scores was very high: 0.89.

What is the ideal level of correlation between reviewers? This is a little tricky. From the program chair’s point of view, perhaps perfectly correlated scores would be best: since everyone agrees, selecting papers to accept should be trivial. On the other hand, from a selfish point of view I don’t want the correlation to be too high, because that is boring — it means I’m not adding anything to the discussion. However, if correlation is too low, every discussion becomes a fight and that’s no fun either. I’m guessing that a higher level of correlation than what I saw at CAV, and a slightly lower one than I saw at SenSys, would be ideal, maybe around 0.7.

Red Baldy Again

With some travel coming up and hot weather rapidly eroding our epic snowpack, I decided to sneak in a quick snow climb on July 1. Since I couldn’t convince anyone else to go along, I made a conservative choice and climbed Red Baldy, whose northwest slopes present one of the easier and safer routes found on a local 11,000′ mountain.

The only problem with Red Baldy is that the approach is on the White Pine trail which — due to being an old mining road — has a lot of switchbacks and is just long. This year it was more of a pain than usual with a ton of trees down due to avalanches, big snowbanks starting quite low, and plenty of standing and running water from the snowmelt.

Though I started walking kind of late (8:15) I brought crampons in case the snow had frozen up harder than I thought it would. This was definitely not the case, though the snow was nicely consolidated and I only punched through onto talus in a few places. It took about 3.5 hours to get to the top, which seemed really slow. I blame this on the obstacles and the fact that I was breaking trail the whole way — lots and lots of steps were kicked.

I was surprised to arrive at the summit at around the same time as two other guys. It is very uncommon to share the top of an 11,000′ mountain around here with anyone. One of them was on skis and didn’t stay long, the other was camped on the highest dry ground in White Pine and had come up the west ridge, from near White Pine Lake.

The most fun part of this hike was a monster (probably 1300 vertical feet) sitting glissade starting a little below the summit. Glissading is an easy way to shred pants and skin — my secret weapon is a pair of armored shorts I bought years ago after losing too much flesh during my (thankfully brief) career as a mountain biker. With the shorts, an axe, and a good runout I felt safe doing this. After the glissade, it was just a long walk out.