A Quick Look at Code Size

As part of some recent work we compiled 1,000,000 randomly generated C programs to x86 using a variety of versions of GCC and LLVM. Over the next few weeks I’ll post about some of the different things we learned from this experiment; today’s post is about code size. Each program was compiled at -O0, -O1, -O2, -Os, and -O3. We computed the code size for each compiled program using the size command. For each program that was compiled successfully by all compilers (about 10% of the programs crashed one or more compilers; more on that later) we considered the code size for a given compiler to be the smallest code size across all optimization options. The graph below shows the average code size for each compiler across all of the test cases. While nobody (besides us) is actually interested in compiling random C programs, the size does seem to be a least vaguely correlated with the optimizer’s ability to spot opportunities to propagate constants, eliminate dead code, destroy useless loops, etc.

The main difference between real and random C code is that real programs are often fairly tight: the optimizer’s opportunities are mainly in low-level transformations like inlining, register allocation, common subexpression elimination, etc. On the other hand, random programs contain a huge amount of dead or useless code that the compiler can eliminate. A C compiler that does a good job optimizing random programs is not necessarily a better compiler, but it probably is one that can be relied upon to clean up junky code.

It’s interesting that LLVM-GCC generates smaller code than Clang. Of course, these compilers share middle- and back-ends. Most likely what is happening is that the GCC frontend is performing cleanups and random optimizations that have not yet been implemented in LLVM proper. Note that performing optimizations in the frontend is not necessarily desirable.

Verifying a Driver

Last week my student Jianjun Duan gave a talk about his PhD work at the Workshop for Systems Software Verification in Vancouver. Although preliminary — he verified only three small functions — this work is pretty cool. Jianjun started with a model for the ARM instruction set in HOL4 and augmented it so that memory-mapped hardware peripherals could be plugged into it. The basic idea is that real systems are not just CPU+memory. This is obvious, but almost all existing results on formal verification of software ignore devices. This doesn’t work for embedded software that is closely tied to dozens of hardware devices integrated into a system on chip. Jianjun proved some basic sanity results like “if you have a system that works, it still works after plugging in a new hardware device.” Once you are dealing with a theorem prover, even simple results like this become nontrivial due to the extraordinary level of detail that is required.

The case study in this paper was to:

  1. Formalize one of the UART (serial port) devices found on the NXP LPC2129, a microcontroller built around the ARM7TDMI CPU
  2. After plugging the UART model into the ARM model, prove full correctness for an open-source device driver that was compiled to ARM assembly from C

“Full correctness” means something like:

The driver terminates, correctly transmits the characters it is asked to transmit, correctly receives characters that arrive over the wire, respects memory safety, and never puts the UART into a bad state.

Perhaps the coolest thing about Jianjun’s proofs are that they include timing properties. For example, the proof does not work if the UART’s clock divider is not set high enough. Reasoning about timing made the proofs a lot harder, even for a very simple timing model (which is reasonable for this class of CPU). The slides from the talk give a pretty good idea of what is going on, and also the paper is here.

Maze Rejection #2

Bill and I planned a backpacking trip to the Fins area of the Maze District during Spring Break 2010, but were foiled when several feet of abnormally late snow made the Flint Trail (the “roughest routinely traveled road in Utah”) impassable. This week, during the University of Utah’s Fall Break, we tried again, but were stopped by a serious puddle about 25 miles down the dirt road that leads to the Hans Flat ranger station. Normally, passenger cars can drive this section of road but torrential rains Tuesday, Wednesday, and Thursday of last week caused a lot of local flooding. I put on sandals and walked the 100′ long puddle and at the deepest part it came up almost to my knees; this gave us misgivings. While we were sitting there figuring out what to do, a big commercial pickup with knobbier tires and a lot more clearance than Bill’s 4runner drove through and came close to getting stuck — that helped us decide not to give it a try. There wasn’t any easy way to bypass the puddle so we turned around. Also Bill (who started out the trip with the beginnings of a cold) was starting to feel sicker. Not wanting to head home, we drove 2.5 hours to Boulder Utah where our friend John Veranth spends part of the year. The last time we crashed with John, two years ago, the cabin he was building was framed but lacked power and water. This time, it was all finished up and is very livable. Bill rested while John, Martha, and I went on a nice hike to a lake at 9600′ on Boulder Mountain, getting back around dark. In the morning Bill felt even crappier and he relaxed with the truck while John and I hiked up onto Durffey Mesa, a couple miles out of Boulder on the Burr Trail. John loves obscure routes; this time the goal was to scout out old cattle trails giving access to this mesa (like most mesas in this area, it is well-protected by sandstone cliffs) and down to Boulder Creek on the other side. By the time we got back it was past noon and Bill was ready to get home and sleep, so we left. Maybe next time.

Because I was bummed, I didn’t take very many pictures. But here’s the killer puddle:

It doesn’t look so big here but it was knee deep, I swear.

Here’s the stuffed sheriff in Loa, he gets me to slow down every time I drive through that town.

His name tag reads “Parker Bogus Dolittle” and he’s wearing a “Wayne County Junior Deputy Sheriff” badge.

Bill is trying to stay on his good side.

Temperature and Longevity

The other day I was talking to a friend whose PhD work (many years ago) involved experiments to determine whether it was safe to expose humans to ultrasound, for example in sonography. The hypothesis was that heating effects are the primary way that ultrasound causes tissue damage. One of the experiments involved determining the length of time required to create lesions in soft tissue when it is heated to various temperatures. As expected, a nice curve emerged. One of the students in the lab had the bright idea of extrapolating this curve to see how long it took for tissue damage to occur at body temperature; the result was not far from the average human lifespan. The implication is that at 98.6 degrees, we are slowly cooking ourselves to death.

More Saturating Arithmetic

In a previous post I guessed that 91 bytes was close to the minimum size for implementing signed and unsigned saturating 32-bit addition and subtraction on x86. A commenter rightly pointed out that it should be possible to do better. Attached is my best shot: 83 bytes. Given the crappy x86 calling convention I’m going to guess it’s hard to do much better unless the saturating SSE instructions offer a way. Or can the branches in the signed functions be avoided? It seems like it, but I couldn’t figure out how.

sat_unsigned_add:
	movl	4(%esp), %eax
	xor     %edx, %edx
	not     %edx
	addl	8(%esp), %eax
	cmovb	%edx, %eax
	ret

sat_unsigned_sub:
	movl	4(%esp), %eax
	xor     %edx, %edx
	subl	8(%esp), %eax
	cmovb	%edx, %eax
	ret

sat_signed_add:
	movl	4(%esp), %eax
	movl	$2147483647, %edx
	movl	$-2147483648, %ecx
	addl	8(%esp), %eax
	jno     out1
	cmovg	%edx, %eax
	cmovl	%ecx, %eax
out1:	ret

sat_signed_sub:
	movl	4(%esp), %eax
	movl	$2147483647, %edx
	movl	$-2147483648, %ecx
	subl	8(%esp), %eax
	jno     out2
	cmovg	%edx, %eax
	cmovl	%ecx, %eax
out2:	ret

Another question: How difficult is it to create a superoptimizer that turns obvious C formulations of saturating operations into the code shown here? This is beyond the capabilities of current superoptimizers, I believe, while being obviously feasible.

Feedback is appreciated.

Fun With Saturating Arithmetic

An assignment that I often give my Advanced Embedded Systems class early in the semester is to implement saturating versions of signed and unsigned addition and subtraction. Saturating operations “stick” at the maximum or minimum value. (INT_MAX +sat 1) therefore evaluates to INT_MAX. The only wrinkle in the assignment is that solutions must be capable of being compiled (via a typedef) for integers of 8, 16, 32, or 64 bits. I give this assignment for several reasons:

  • I like people to see saturating arithmetic since it is useful in some embedded applications
  • it should be easy and fun, but ends up being a bit tricky
  • the (typically) poor performance of the class helps motivate later course material on testing and on integer problems

Over the years I’ve ended up with nearly 100 submissions from students. Tonight I ran them all through my grading script and found that 23 students got all 4 functions right, for all integer widths. Also I rediscovered a few gems:

Segfaults From Integer Code

Two students managed to create solutions that segfault. How is this impressive result even possible, you ask? It was accomplished by implementing addition and subtraction as mutually recursive functions and failing to stop the recursion. Thus, the crash is a stack overflow.

Exponential Running Time

Two more students implemented iterative solutions with running time exponential in the size of the inputs. This can be accomplished for unsigned addition by, for example, repeatedly incrementing one of the arguments and decrementing the other, stopping when the first saturates or the second reaches zero. While spectacularly inefficient, one of these solutions was correct, as far as I could tell. I had to give the student full points because the assignment failed to specify that the functions should terminate in a reasonable time in the 64-bit case.

The Most Creative Way To Pass Tests

My test harness invokes the students’ saturating operations with random inputs as well as values at or near corner cases. By far the most creative wrong solution came from a student who implemented three of the four functions correctly, but forgot to return a value from the fourth. Of course this is an undefined behavior in C, but not a compile time error. Astoundingly, Intel CC, a recent Clang snapshot, and GCC 3.4 all report zero errors from this student’s code, when compiling with optimizations. The reason for this is entertaining:

  1. The compiler, seeing a function that always relies on undefined behavior (I call these Type 3 functions), turns it into a nop.
  2. My test harness calls the reference version of the saturating function, which leaves its result in eax.
  3. The value is copied into ebx, but eax still retains a copy.
  4. The student’s code is called. It is supposed to store its result in eax, but remember that it is now a nop.
  5. The test harness compares eax against ebx. Of course they are equal, and so the test passes.

Have I said before that the way a C compiler silently destroys Type 3 functions is effectively evil? I have, and I’m saying it again.

A Corner Case

Each year a few people implement saturating signed subtraction like this:

mysint sat_signed_sub (mysint a, mysint b)
{
  return sat_signed_add (a, -b);
}

This is tantalizingly simple, but wrong.

Big Code == Wrong Code

After being compiled (for the 32-bit integer case, using a recent version of gcc for x86 at -Os) solutions I’ve received range from 76 to 1159 bytes of object code, totaled across all four functions. The largest correct solution is 226 bytes. Bigger solutions tend to be more complicated, and complicated solutions are usually wrong.

The Best Code

I tried quite a few compilers and could not manage to generate a correct solution smaller than 91 bytes of x86 code (again, for the 32-bit case). This is from a recent GCC snapshot. The C code for this solution was produced by one of the brightest undergrads I’ve had the pleasure of teaching; it succeeds by using bit operators instead of arithmetic. I’m sure a good assembly programmer could shave some bytes off of this, but I suspect that the optimal solution is not a whole lot smaller. I’d love to include the code here but would like to avoid invalidating this entertaining assignment.