Compilation and Hyperthreading

Hyperthreading (HT) may or may not be a performance win, depending on the workload. I had poor luck with HT in the Pentium 4 era and ever since then have just disabled it in the BIOS on the idea that the kind of software that I typically wait around for—compilers and SMT solvers—is going to get hurt if its L1 and L2 cache resources are halved. This post contains some data about that. I’ll just start off by saying that for at least one combination of CPU and workload, I was wrong.

The benchmark is compilation of LLVM, Clang, and compiler-rt r279412 using Ninja on an Intel i7-5820K, a reasonably modern but by no means new Haswell-E processor with six real cores. The compiler doing the compilation is a Clang 3.8.1 binary from the LLVM web site. The machine is running Ubuntu 14.04 in 64-bit mode.

Full details about the machine are here. As an inexpensive CPU workhorse I think it stands the test of time, though if you were building one today you would double (or more) the RAM and SSD sizes and of course choose newer versions of everything. I’m particularly proud of the crappy fanless video cards I found for these machines.

This is the build configuration command:

cmake -G Ninja -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_ASSERTIONS=1 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_BUILD_TYPE=Release ..

Then, on an otherwise idle machine, I built LLVM five times for each degree of parallelism up to 16, both with and without hyperthreading. Here are the results. Since the variation between runs was very low—a few seconds at worst—I’m not worrying about statistics.

What can we take away from this graph? The main conclusion is that hyperthreading wins handily, reducing the best-case build time from 11.75 minutes to 10.04 minutes: an improvement of 1 minute and 42 seconds. Also, I had been worried that simply enabling HT would be detrimental since Linux would sometimes schedule two threads on the same real core when a different core was idle. The graph shows that either this happens only rarely or else it doesn’t hurt much when it happens. Overloading the system (forking more compilers than there are processors) hurts performance by just a very small amount. Of course, at some point the extra processes would use all RAM and performance would suffer significantly. Finally, the speedup is impressively close to linear until we start running more than one thread per core:

I don’t know how much of the nonlinearity comes from resource contention and how much comes from lack of available parallelism.

Here are the first and second graphs as PDF.

Looking at the bigger picture, a huge amount of variation is possible in the compiler, the software being compiled, and the hardware platform. I’d be interested to hear about more data points if people have them.

Isolating a Free-Range Miscompilation

If we say that a compiler is buggy, we need to be able to back up that claim with reproducible, compelling, and understandable evidence. Usually, this evidence centers on a test case that triggers the buggy behavior; we’ll say something like “given this test case, compiler A produces an executable that prints 0 whereas compiler B produces an executable that prints 1, therefore there’s a bug.” (In practice compilers A and B are usually different versions or different optimization levels of the same compiler — this doesn’t matter.)

The problem is that when compilers A and B emit code that behaves differently, the divergence can happen for reasons other than compiler bugs:

  • the program might rely on undefined behavior (UB) or unspecified behavior,
  • the program might read non-constant data from its environment, for example by looking at the clock or at its process id,
  • the program might be concurrent and its output influenced by scheduling decisions.

A big part of convincing someone that the compiler is buggy is convincing them that the test program is free of problems such as these. Since concurrency and interactions with the environment are easy to spot, the problem is often undefined behavior. You can find a few examples in my previous post about invalid GCC bug reports.

This post is about partially automating the process of coming up with a small test case for a miscompilation bug. The bug that we’ll be studying shows up when you compile the latest version of GMP, 6.1.1, using Clang+LLVM revision 135022, from about five years ago. The bug itself is irrelevant and long fixed, I’m just using it to illustrate the process.

As a slight digression, I’ll add that out of the 90 versions of LLVM, spanning revisions 90,000-250,000, that I used to compile GMP (in its C-only mode, disabling use of assembly), only those in the 135,000-139,000 range miscompiled GMP 6.1.1, according to its test suite. The GMP web page says:

Most problems with GMP these days are due to problems not in GMP, but with the compiler used for compiling the GMP sources. This is a major concern to the GMP project, since an incorrect computation is an incorrect computation, whether caused by a GMP bug or a compiler bug. We fight this by making the GMP testsuite have great coverage, so that it should catch every possible miscompilation.

This is all well and good but GMP used to contain some integer undefined behaviors and my guess is that some of the apparent miscompilations were actually due to compiler exploitation of these UBs. Finding some evidence one way or the other about this might be a fun project for a student.

Anyway, back to our compiler bug. The process for isolating a test case is:

  1. Grab a failing unit test.
  2. Use tis-interpreter to verify that it is free of undefined behavior and anything else that might trigger non-deterministic execution.
  3. Use C-Reduce to create a minimal program triggering the bug.

This sounds really easy — and it would be if we were working with a tidy little test case generated by Csmith — but real software is messy and there are plenty of complications. Let’s get started. The unit test we’re using is called mpz_lucnum_ui. Here are the 128 C files that are required to run it.

A First Try

We have to choose what exactly to reduce. Usually we want to reduce preprocessed code, but given the painfully slow interestingness test that we have here (almost 40 seconds, argh) this is going to make very slow progress since C-Reduce would need to eliminate a lot of redundant included junk from each of the 128 files. So let’s rather reduce the files without preprocessing and see what happens.

I had to make a few easy changes to get the GMP files to go through tis-interpreter. In an ideal future, the GMP project (and everyone else) will ensure that their unit tests are clean with respect to tis-interpreter.

Finally we’re ready to run the reduction:

$ creduce ./test1.sh *.c

After a few days the reduction finishes, see the results here. 119 of the 128 C files have become empty and the remaining 9 files contain a total of about 5 KB of code. A bit more than 99% of the code has been eliminated. Perhaps surprisingly, C-Reduce has managed to eliminate all conditional compilation directives, but some #defines remain as do a few #includes.

We’ve succeeded in creating a modestly small test case, but it isn’t yet standalone (due to the includes) and really we would like everything to live in a single file. We can kill two birds with one stone by using CIL or Frama-C to merge the C files into a single compilation unit.

$ frama-c -print *.c > merged.c

The merged file isn’t quite buildable — there’s some junk at the top and there’s a minor problem handling a builtin — but this is easy to fix. You can find the result here. Unfortunately, the merged file doesn’t trigger the bug any longer. Either the testcase relied on separate compilation or else Frama-C’s rewriting of the source code perturbed things badly. That’s sort of a bummer. I could easily have omitted this part of the post, but I wanted to show the whole process here, including missteps.

A Second Try

This time we’ll preprocess and merge the 128 files first, and reduce second. Again, some manual patching was necessary since the merger doesn’t quite work. The result is an 800 KB compilation unit.

Again, the reduction goes slowly:

The final result is a bit less than 4 KB. It’s still too big to be easily understood. Since we’ve run up against the limits of our tooling, further reduction will have to be by hand. Since I’m not actually going to report this long-ago-fixed bug, I didn’t bother with this step. (Back before C-Reduce existed I did a lot of manual testcase reduction and while it was a somewhat satisfying and mindless activity, I ran out of patience for it.) But in any case, 4 KB of self-contained C code is quite manageable.

Is Undefined Behavior Checking Necessary?

Is it really necessary to worry about undefined behavior? In my experience, reducing a miscompilation while disregarding UB is roughly 100% likely to result in a testcase that misbehaves due to UB instead of triggering a compiler bug. Here you can see our interestingness test w/o the UB checking. If we run a reduction using it, the resulting 1 KB testcase (hacked a bit by hand so that we can use tis-interpreter to discover its fatal flaw) misbehaves via an out-of-bounds store:

$ tis-interpreter.sh merged.c
merged.c:53:[kernel] warning: Calling undeclared function cu. Old style K&R code?
[value] Analyzing a complete application starting at main
[value] Computing initial state
merged.c:11:[value] warning: during initialization of variable 'cm', size of type 'struct a []' cannot be
                 computed (Size of array without number of elements.)
merged.c:11:[kernel] imprecise size for variable cm
[value] Initial state computed
merged.c:51:[kernel] warning: out of bounds write. assert \valid(&cp->b);
                  stack: main
[value] Stopping at nth alarm
[value] user error: Degeneration occurred:
                    results are not correct for lines of code that can be reached from the degeneration point.

So we definitely need some sort of UB detection. But do we need something as heavyweight as tis-interpreter? I ran a few reductions using UBSan + ASan (see one of them here) and didn’t have good luck in getting a defined final testcase. The reduction kept introducing issues such as uninitialized storage and function declarators with empty parentheses and incompatible uses of function pointers, all of which can lead to real trouble. Most likely there is a combination of compilers and flags that would have let me run this reduction successfully but I ran out of energy to find it. UBSan and ASan and MSan are excellent but fundamentally they do not add up to a completely reliable UB detector.

Recommendations

More developers should:

  • Make sure unit tests go through tis-interpreter. Though not always practical (tis-interpreter doesn’t understand assembly or C++) this has many benefits since tis-interpreter implements very thorough checking. Also, when a change in compiler or compiler options breaks a test case that is clean with respect to tis-interpreter, the compiler can be blamed with high reliability.
  • Instead of working around compiler bugs, reduce and report them. This can be a lot of work but tools like tis-interpreter and C-Reduce make it easier and when these bugs get fixed life is better for everyone.

Obelisk

Classes start next week so I sneaked out for a quick hike on Tuesday, climbing a minor local peak that is informally called The Obelisk. This one had eluded me for years so it felt nice to finally stand on top. Summitpost says “Obelisk is rarely climbed during the summer and provides ample solitude,” and I found out why: the ascent involves a long, steep boulder field, with many of the rocks just barely balanced. Ugh — the next time I climb this peak it’ll be when the boulders are covered by snow.

Looking out over Cottonwood Ridge.

A couple of friends.

Wild and wonderful Hogum Fork, it sits just a few miles from a half-million people and is hardly visited.

The Obelisk at the top of Obelisk Peak. I could hear the music from 2001.

You can see my summit beer in the background.

Maybird Gulch with the Pfeifferhorn looming beyond.

A Month of Invalid GCC Bug Reports, and How to Eliminate Some of Them

During July 2016 the GCC developers marked 38 bug reports as INVALID. Here’s the full list. They fall into these (subjective) categories:

  • 8 bug reports stemmed from undefined behavior in the test case (71753, 71780, 71803, 71813, 71885, 71955, 71957, 71746)
  • 1 bug report was complaining about UB exploitation in general (71892)
  • 15 bug reports came from a misunderstanding (or disagreement) about the non-UB semantics of a programming language, usually C++ but also C and Fortran (71786, 71788, 71794, 71804, 71809, 71890, 71914, 71939, 71963, 71967, 72580, 72750, 72761, 71844, 71772)
  • 4 bug reports stemmed from a misunderstanding of something besides the language semantics, such as command line flags (72736, 71729, 71995, 71777)
  • 5 bug reports were caused by an unrelated problem on the reporter’s system such as an out-of-memory error, a borked Cygwin installation, out-of-date files in a build tree, etc (71735, 71770, 71903, 71918, 71978)
  • 1 bug report was about a bug that the devs didn’t want to fix since it was in an inactive branch and had been fixed in all active branches (72051)
    4 bug reports didn’t end up demonstrating any reproducible problem (71940, 71944, 71986, 72076)

I’ve often thought that it would be nice for a compiler bug reporting system to be active instead of passively serving up files and discussions. An active bug reporting system would be able to run:

  • a wide variety of compiler versions,
  • the compiler’s output, and
  • tools for finding undefined behaviors.

Of course not all bug reports would be able to make use of these features. However, one can imagine that there is a significant subset of compiler bug reports where the reporter, cooperating with the system, would be able to conclusively demonstrate that the compiler crashes or generates wrong code. In cases where this cannot be demonstrated, the process of interacting with an active bug reporting system will help the reporter understand what the actual issue is without wasting a compiler developer’s time.

An active bug reporting system can run lots of experiments to determine how many compiler versions, how many target platforms, and how many optimization levels are affected by the bug. It can also determine which revision introduced the problem and who committed the breaking change — suggesting an initial owner for the bug. It can run a testcase reducer to make the program triggering the bug smaller. All of these things will help compiler developers prioritize among reported bugs. The system should also be able to automatically flag duplicate bug reports. When a bug stops reproducing, the bug reporting system will notice this and flag the PR as being ready to close, and could also help out by packaging up an addition to the regression test suite.

Update:

A few additional details. During July a total of 328 bugs were reported, ignoring those marked as spam. 143 of these were resolved: 22 as duplicates, 81 as fixed, 38 as invalid, 1 as wontfix, and 1 as worksforme. Out of the remaining 185 unresolved bugs, 15 are assigned, 86 are new, 1 is reopened, 74 are unconfirmed, and 9 are waiting.

I believe that an active bug reporting system will make many of these 290 non-invalid bug reports easier to deal with, as opposed to only helping with the invalid ones!

Note: In the initial version of this post I mentioned 36 invalid bugs, not 38, because I was only searching for bugs that were marked as resolved. Also searching for closed bugs brings the total to 38.

Perseids

Matthew Flatt, my 9 year old son, and I stayed out last night watching the Perseid meteor shower. To find some dark skies we drove out to the Utah-Nevada border, along the way passing a sign that said “NEXT GAS 130 MILES” — always a good sign on a road trip. We arrived around 12:30 and the show started right away, we saw several meteors before even getting out of the vehicle. Here’s Matthew in the moonlight along with a very faint meteor:

After the moon set things got better although the sky never got as truly dark as it should have in that location — I don’t think the air was very clear. Even so, the milky way was pretty jaw-dropping and we saw hundreds of meteors. I know nothing about astrophotography and don’t really have the right gear but I did manage to capture a few meteors.

After a while Isaac timed out and went to sleep in the truck and Matthew and I started to get really cold so we left, getting home before dawn. Definitely worth losing a little sleep to see this.