Scary Compiler Code Motion

I ran across this blog post about reader-writer locks and thought it would be fun to present them to the Advanced Operating Systems class that I’m running this spring. Here’s my version of the code, which needed to be adapted a bit to work with GCC:

struct RWSpinLock
{
  volatile uint32_t lock;
};

void LockReader(struct RWSpinLock *l)
{
  while (1) {
    uint32_t OldLock = l->lock & 0x7fffffff;
    uint32_t NewLock = OldLock + 1;   
    if (__sync_val_compare_and_swap(&l->lock, OldLock, NewLock) == OldLock) 
      return;
  }
}
  
void UnlockReader(struct RWSpinLock *l)
{
  __sync_sub_and_fetch(&l->lock, 1);
}

void LockWriter(struct RWSpinLock *l)
{
 again:
  {
    uint32_t OldLock = l->lock & 0x7fffffff;
    uint32_t NewLock = OldLock | 0x80000000;    
    if (!(__sync_val_compare_and_swap(&l->lock, OldLock, NewLock) == OldLock)) 
      goto again;
    while (l->lock & 0x7fffffff) { }
  }
}

void UnlockWriter(struct RWSpinLock *l)
{
  l->lock = 0;
}

At first I was suspicious that some sort of fence is required in the UnlockWriter() function. However, I believe that that is not the case: TSO will naturally make sure that all cores see operations inside the locked region prior to seeing the lock release. On some non-TSO platforms this unlock function would be too weak.

But the hardware is only one source of reordering that can mess up locks — the compiler is another. The store to the volatile lock field in UnlockWriter() cannot be moved around another access to a volatile-qualified object, but (in my reading of the standard) stores to non-volatile objects can be moved around stores to volatiles.

Can we get a compiler to translate the (in principle) broken code into assembly that actually doesn’t work? Turns out this is easy:

int important_data;
struct RWSpinLock l;

void update (void)
{
  LockWriter (&l);
  important_data /= 20;
  UnlockWriter (&l);
}

Using the default compiler on an x86-64 Ubuntu 12.04 machine (GCC 4.6.3) at -O2, the resulting assembly is:

update: movl $l, %edi call LockWriter movl important_data(%rip), %ecx movl $1717986919, %edx movl $0, l(%rip) movl %ecx, %eax sarl $31, %ecx imull %edx sarl $3, %edx subl %ecx, %edx movl %edx, important_data(%rip) ret

Notice that important_data is read inside the lock and written well outside of it. Yikes! Here’s how to stop the compiler from doing that:

void UnlockWriter(struct RWSpinLock *l)
{
  // this is necessary or else GCC will move operations out of the locked region
  asm volatile("" ::: "memory");
  l->lock = 0;
}

Versions of Clang that I tried (3.3 and 3.4 for x86-64) didn’t want to move the store to important_data outside of the lock, but it’s possible that I just wasn’t asking the right way.

February 4, 2014

regehr

Compilers, Computer Science, Software Correctness

21 responses to “Scary Compiler Code Motion”

Jesse Ruderman says:

February 4, 2014 at 3:27 pm

TSO stands for “total store order”.
mcmcc says:

February 4, 2014 at 3:59 pm

The use of ‘volatile’ is a red herring. ‘volatile’ has nothing to do with thread synchronization. Therein lies the problem.
regehr says:

February 4, 2014 at 4:28 pm

Hi mcmcc, I would argue that the volatile variable + the volatile asm touching “memory” are necessary and sufficient to create a correct reader-writer lock for TSO, assuming that we wish to avoid instructions with barrier/fence semantics in the writer unlock code.

Also see:

http://blog.regehr.org/archives/28

Jesse, yep.
kme says:

February 4, 2014 at 7:59 pm

The volatile asm with a memory clobber acts as a compiler barrier regardless of whether l->lock is volatile itself or not – the compiler must assume that the asm might write to l->lock.
Joshua Cranmer says:

February 4, 2014 at 8:33 pm

No, volatile has nothing to do with thread synchronization, as C11 and C++11 reaffirm. You need to use the atomics support to truly get that.
regehr says:

February 4, 2014 at 8:43 pm

Hi Joshua, who said anything about threads? These locks are going to be used in a multicore OS we’re running in class, it runs on bare metal and has no threads.

The volatile qualifier places an obligation on the compiler to generate loads and stores at certain points in the program. We can use these loads and stores for whatever purpose we want.
mcmcc says:

February 4, 2014 at 10:12 pm

Put it this way, I think it is a mistake to teach young programmers that ‘volatile’ is in any sense a substitute for real synch. primitives. In general, it is neither necessary or sufficient for the job. Concurrent programming is difficult enough without getting caught up in non-portable premature optimizations.

Why would you not teach C11/C++11’s atomic_* API instead? It’s logical, well-reasoned, flexible, and most importantly, standard.
regehr says:

February 4, 2014 at 11:01 pm

Hi mcmcc, I think the code makes it pretty clear that volatile isn’t substituting for anything. Sure, the C/C++ atomic stuff is great. But nobody has ported it to the teaching OS we’re using in class that — as I said in an earlier comment — doesn’t even have threads. In contract, the functions above are exactly the ones we want if we replace some of xv6’s spinlocks with R/W locks — a project I’ve been thinking about handing to the class.
Joshua Cranmer says:

February 4, 2014 at 11:17 pm

It is worth pointing out that the C and C++ specifications, IIRC, make it perfectly legal to introduce spurious writes to volatile objects. In the case of a lock, it is perfectly legal therefore the compiler to introduce a write that unlocks the lock immediately after locking it. The only requirement on preventing spurious reads/writes is that compilers cannot introduce data races that wouldn’t occur otherwise, and it’s worth noting that having a data race means your program is undefined.

The synchronization that says whether or not data race occurs flows not from volatile variables but from atomic variables. As a result, your program is (under C11 and C++11) undefined behavior, and the compiler could choose to launch the nukes.

The only thing volatile does in the standard is tell the compiler to emit the relevant load or store. It does not tell the compiler to not introduce speculative or spurious accesses. It does not tell the compiler to not reorder them (at least, not with respect to non-volatile variables). It does not tell the compiler to insert appropriate memory fences to prevent the hardware from reordering them. It does not tell the compiler that simultaneous access by two threads of execution (or two separate cores) does not constitute a data race. It does not tell the compiler that it can’t split the load or store into byte or bit-sized chunks that happen five seconds apart from another. If you want any of those features, you need to use atomic variables.

On the other hand, it is great fodder for a compiler to do truly evil things to code under the label of “undefined behavior.”
Jeff Walden says:

February 5, 2014 at 2:00 am

I’m not so sure about spurious volatile writes being permitted. In some version of C++, I seem to remember the observable behavior consisting of the sequence of reads and writes to volatile objects, and the system calls made. Spurious writes wouldn’t be allowed under that understanding. But I just checked N3337 and can’t find that there, in that sort of description, so I think that might have been in C++98 and/or C99. I’m not familiar enough with C++11 wording to divine whether it permits spurious volatile writes or not, but I’d be pretty surprised if it does. After all, spurious writes to volatile objects would probably have broken the original device-register use case.

Unless I’m missing something in all this, of course. Which is entirely possible!
Jeroen Mostert says:

February 5, 2014 at 3:04 am

I’m pretty sure “spurious” writes to volatiles are prohibited by the standard. This is because writes to volatiles are observable side effects of the abstract machine — allowing the compiler to introduce arbitrary new side effects would violate the semantics of translation. The (C) standard does say “in the abstract machine, all expressions are evaluated as specified by the semantics” but there are two sides to that coin: the compiler is allowed to omit evaluation of expressions if it follows from the semantics that their values are never used and no side effects happen, but at the same time it is not allowed to introduce side effects that cannot be implied by the evaluation of any expression. The same thing applies to spurious reads, by the way.

But what’s “spurious”? Keep in mind that the standard does not completely define how objects map to the underlying implementation’s memory/registers/what have you. If “long” is 64 bits on my implementation but it has only 32 bit accesses, and I write to a “volatile long”, it is perfectly OK for that to be two store operations, because it corresponds to only one write to the volatile object. The compiler is not required to turn this into an atomic operation (if it even could). It is also perfectly OK if the compiler splits these stores and intersperses any number of writes to non-volatiles if they are between the same sequence points — this is not observable on the abstract machine.

What exactly happens when you access a volatile depends a great deal on your implementation. So yes, using volatile in synchronization code is fraught with peril as the standard’s guarantees are fairly weak — but sometimes it’s all you have, and synchronization code tends to be highly implementation dependent anyway, so it’s not like we’re making everything a lot worse.
RadosÅ‚aw Cybulski says:

February 5, 2014 at 5:28 am

What is the purpose of this lock without threads? Shared memory and processes as threads?

Besides doesnt C++ standard mark simultaneous access to object from different threads as undefined behavior? Also store to unsigned int variable doesnt have to be atomic as far as i know.

I think, original coder should use atomic operations to access variable all the time, rather than doing premature optimization.
Francesco Zappa says:

February 5, 2014 at 7:00 am

Hi John

I was wondering if compilers reorder non-volatile accesses with volatile ones; it seems you have found the answer (and in real code!). Nice.

-francesco
bcs says:

February 5, 2014 at 7:53 am

@mcmcc: one reason for not using the standard provided primitives is where the point of the assignment is to gain a better understanding of how such primitives are made to work by actually implementing them.

Yet another, generally non-academic, reason is when you are relegated to a version of C/C++ that doesn’t even have them.
mcmcc says:

February 5, 2014 at 8:54 am

> the code makes it pretty clear that volatile isnâ€™t substituting for anything

Except that is — in fact, you just said it is because your compiler doesn’t support atomics. If it wasn’t, it wouldn’t be in the code.

‘volatile’ has a long history (as I’m sure you know) of being misused/abused because it _happens_ to have _some_ useful side-effects that make it useful on _some_ platforms under _certain_ circumstances. Caveats galore… It’s time to head that insanity off at the pass — we have better tools to work with now. Let’s use them.

My recommendation: if your compiler doesn’t offer , write your own and teach your students the correct idioms for portable synchronization. I have no doubt you have the technical know-how to do it (at least for the one platform you care about).

http://en.cppreference.com/w/c/atomic
mcmcc says:

February 5, 2014 at 8:58 am

[Edit: it ate my #include brackets]

My recommendation: if your compiler doesnâ€™t offer stdatomic.h, write your own and teach your students the correct idioms for portable synchronization. I have no doubt you have the technical know-how to do it (at least for the one platform you care about).
mcmcc says:

February 5, 2014 at 10:06 am

… and if you want to add a section to your Advanced OS course entitled “Zen & the Art of Emulating stdatomic.h”, I approve 100%. ;^)
donbock says:

February 7, 2014 at 3:21 pm

Any use of volatile reminds of this paper:
Eide E, Regehr J. Volatiles Are Miscompiled, and What to Do about It. Proceedings of the Eighth ACM and IEEE International Conference on Embedded Software (EMSOFT). Atlanta, Georgia USA. Oct 2008.

I notice that you used function wrappers for each volatile access. That suggests you’re familiar with the paper.
regehr says:

February 7, 2014 at 5:01 pm

Hi donbock, I can’t tell if you’re being ironic or not! If not, of course I was one of the authors of that paper.
Glenn Strauss says:

February 13, 2014 at 12:30 am

http://www.kernel.org/doc/Documentation/memory-barriers.txt provides a nice (technical) discussion of memory barriers. Highly recommended reading for an advanced operating systems course.

Not yet mentioned in comments above is the difference between memory barriers for typical program usage in a hosted environment (using cachable memory) and memory barriers required when accessing other memory types (e.g. uncachable, write-combining, writeback, writethrough), such as when writing device drivers or accessing GPU video memory.

Portable macros wrapping memory barriers and lots of additional links, including to vendor CPU architecure manuals can be found in https://github.com/gstrauss/plasma/blob/master/plasma_membar.h
regehr says:

February 18, 2014 at 12:00 pm

Glenn, thanks for the links!