Scary Compiler Code Motion


I ran across this blog post about reader-writer locks and thought it would be fun to present them to the Advanced Operating Systems class that I’m running this spring. Here’s my version of the code, which needed to be adapted a bit to work with GCC:

struct RWSpinLock
{
  volatile uint32_t lock;
};

void LockReader(struct RWSpinLock *l)
{
  while (1) {
    uint32_t OldLock = l->lock & 0x7fffffff;
    uint32_t NewLock = OldLock + 1;   
    if (__sync_val_compare_and_swap(&l->lock, OldLock, NewLock) == OldLock) 
      return;
  }
}
  
void UnlockReader(struct RWSpinLock *l)
{
  __sync_sub_and_fetch(&l->lock, 1);
}

void LockWriter(struct RWSpinLock *l)
{
 again:
  {
    uint32_t OldLock = l->lock & 0x7fffffff;
    uint32_t NewLock = OldLock | 0x80000000;    
    if (!(__sync_val_compare_and_swap(&l->lock, OldLock, NewLock) == OldLock)) 
      goto again;
    while (l->lock & 0x7fffffff) { }
  }
}

void UnlockWriter(struct RWSpinLock *l)
{
  l->lock = 0;
}

At first I was suspicious that some sort of fence is required in the UnlockWriter() function. However, I believe that that is not the case: TSO will naturally make sure that all cores see operations inside the locked region prior to seeing the lock release. On some non-TSO platforms this unlock function would be too weak.

But the hardware is only one source of reordering that can mess up locks — the compiler is another. The store to the volatile lock field in UnlockWriter() cannot be moved around another access to a volatile-qualified object, but (in my reading of the standard) stores to non-volatile objects can be moved around stores to volatiles.

Can we get a compiler to translate the (in principle) broken code into assembly that actually doesn’t work? Turns out this is easy:

int important_data;
struct RWSpinLock l;

void update (void)
{
  LockWriter (&l);
  important_data /= 20;
  UnlockWriter (&l);
}

Using the default compiler on an x86-64 Ubuntu 12.04 machine (GCC 4.6.3) at -O2, the resulting assembly is:

update:
	movl	$l, %edi
	call	LockWriter
	movl	important_data(%rip), %ecx
	movl	$1717986919, %edx
	movl	$0, l(%rip)
	movl	%ecx, %eax
	sarl	$31, %ecx
	imull	%edx
	sarl	$3, %edx
	subl	%ecx, %edx
	movl	%edx, important_data(%rip)
	ret

Notice that important_data is read inside the lock and written well outside of it. Yikes! Here’s how to stop the compiler from doing that:

void UnlockWriter(struct RWSpinLock *l)
{
  // this is necessary or else GCC will move operations out of the locked region
  asm volatile("" ::: "memory");
  l->lock = 0;
}

Versions of Clang that I tried (3.3 and 3.4 for x86-64) didn’t want to move the store to important_data outside of the lock, but it’s possible that I just wasn’t asking the right way.


21 responses to “Scary Compiler Code Motion”

  1. The use of ‘volatile’ is a red herring. ‘volatile’ has nothing to do with thread synchronization. Therein lies the problem.

  2. Hi mcmcc, I would argue that the volatile variable + the volatile asm touching “memory” are necessary and sufficient to create a correct reader-writer lock for TSO, assuming that we wish to avoid instructions with barrier/fence semantics in the writer unlock code.

    Also see:

    http://blog.regehr.org/archives/28

    Jesse, yep.

  3. The volatile asm with a memory clobber acts as a compiler barrier regardless of whether l->lock is volatile itself or not – the compiler must assume that the asm might write to l->lock.

  4. Hi Joshua, who said anything about threads? These locks are going to be used in a multicore OS we’re running in class, it runs on bare metal and has no threads.

    The volatile qualifier places an obligation on the compiler to generate loads and stores at certain points in the program. We can use these loads and stores for whatever purpose we want.

  5. Put it this way, I think it is a mistake to teach young programmers that ‘volatile’ is in any sense a substitute for real synch. primitives. In general, it is neither necessary or sufficient for the job. Concurrent programming is difficult enough without getting caught up in non-portable premature optimizations.

    Why would you not teach C11/C++11’s atomic_* API instead? It’s logical, well-reasoned, flexible, and most importantly, standard.

  6. Hi mcmcc, I think the code makes it pretty clear that volatile isn’t substituting for anything. Sure, the C/C++ atomic stuff is great. But nobody has ported it to the teaching OS we’re using in class that — as I said in an earlier comment — doesn’t even have threads. In contract, the functions above are exactly the ones we want if we replace some of xv6’s spinlocks with R/W locks — a project I’ve been thinking about handing to the class.

  7. It is worth pointing out that the C and C++ specifications, IIRC, make it perfectly legal to introduce spurious writes to volatile objects. In the case of a lock, it is perfectly legal therefore the compiler to introduce a write that unlocks the lock immediately after locking it. The only requirement on preventing spurious reads/writes is that compilers cannot introduce data races that wouldn’t occur otherwise, and it’s worth noting that having a data race means your program is undefined.

    The synchronization that says whether or not data race occurs flows not from volatile variables but from atomic variables. As a result, your program is (under C11 and C++11) undefined behavior, and the compiler could choose to launch the nukes.

    The only thing volatile does in the standard is tell the compiler to emit the relevant load or store. It does not tell the compiler to not introduce speculative or spurious accesses. It does not tell the compiler to not reorder them (at least, not with respect to non-volatile variables). It does not tell the compiler to insert appropriate memory fences to prevent the hardware from reordering them. It does not tell the compiler that simultaneous access by two threads of execution (or two separate cores) does not constitute a data race. It does not tell the compiler that it can’t split the load or store into byte or bit-sized chunks that happen five seconds apart from another. If you want any of those features, you need to use atomic variables.

    On the other hand, it is great fodder for a compiler to do truly evil things to code under the label of “undefined behavior.”

  8. I’m not so sure about spurious volatile writes being permitted. In some version of C++, I seem to remember the observable behavior consisting of the sequence of reads and writes to volatile objects, and the system calls made. Spurious writes wouldn’t be allowed under that understanding. But I just checked N3337 and can’t find that there, in that sort of description, so I think that might have been in C++98 and/or C99. I’m not familiar enough with C++11 wording to divine whether it permits spurious volatile writes or not, but I’d be pretty surprised if it does. After all, spurious writes to volatile objects would probably have broken the original device-register use case.

    Unless I’m missing something in all this, of course. Which is entirely possible!

  9. I’m pretty sure “spurious” writes to volatiles are prohibited by the standard. This is because writes to volatiles are observable side effects of the abstract machine — allowing the compiler to introduce arbitrary new side effects would violate the semantics of translation. The (C) standard does say “in the abstract machine, all expressions are evaluated as specified by the semantics” but there are two sides to that coin: the compiler is allowed to omit evaluation of expressions if it follows from the semantics that their values are never used and no side effects happen, but at the same time it is not allowed to introduce side effects that cannot be implied by the evaluation of any expression. The same thing applies to spurious reads, by the way.

    But what’s “spurious”? Keep in mind that the standard does not completely define how objects map to the underlying implementation’s memory/registers/what have you. If “long” is 64 bits on my implementation but it has only 32 bit accesses, and I write to a “volatile long”, it is perfectly OK for that to be two store operations, because it corresponds to only one write to the volatile object. The compiler is not required to turn this into an atomic operation (if it even could). It is also perfectly OK if the compiler splits these stores and intersperses any number of writes to non-volatiles if they are between the same sequence points — this is not observable on the abstract machine.

    What exactly happens when you access a volatile depends a great deal on your implementation. So yes, using volatile in synchronization code is fraught with peril as the standard’s guarantees are fairly weak — but sometimes it’s all you have, and synchronization code tends to be highly implementation dependent anyway, so it’s not like we’re making everything a lot worse.

  10. What is the purpose of this lock without threads? Shared memory and processes as threads?

    Besides doesnt C++ standard mark simultaneous access to object from different threads as undefined behavior? Also store to unsigned int variable doesnt have to be atomic as far as i know.

    I think, original coder should use atomic operations to access variable all the time, rather than doing premature optimization.

  11. Hi John

    I was wondering if compilers reorder non-volatile accesses with volatile ones; it seems you have found the answer (and in real code!). Nice.

    -francesco

  12. @mcmcc: one reason for not using the standard provided primitives is where the point of the assignment is to gain a better understanding of how such primitives are made to work by actually implementing them.

    Yet another, generally non-academic, reason is when you are relegated to a version of C/C++ that doesn’t even have them.

  13. > the code makes it pretty clear that volatile isn’t substituting for anything

    Except that is — in fact, you just said it is because your compiler doesn’t support atomics. If it wasn’t, it wouldn’t be in the code.

    ‘volatile’ has a long history (as I’m sure you know) of being misused/abused because it _happens_ to have _some_ useful side-effects that make it useful on _some_ platforms under _certain_ circumstances. Caveats galore… It’s time to head that insanity off at the pass — we have better tools to work with now. Let’s use them.

    My recommendation: if your compiler doesn’t offer , write your own and teach your students the correct idioms for portable synchronization. I have no doubt you have the technical know-how to do it (at least for the one platform you care about).

    http://en.cppreference.com/w/c/atomic

  14. [Edit: it ate my #include brackets]

    My recommendation: if your compiler doesn’t offer stdatomic.h, write your own and teach your students the correct idioms for portable synchronization. I have no doubt you have the technical know-how to do it (at least for the one platform you care about).

  15. … and if you want to add a section to your Advanced OS course entitled “Zen & the Art of Emulating stdatomic.h”, I approve 100%. ;^)

  16. Any use of volatile reminds of this paper:
    Eide E, Regehr J. Volatiles Are Miscompiled, and What to Do about It. Proceedings of the Eighth ACM and IEEE International Conference on Embedded Software (EMSOFT). Atlanta, Georgia USA. Oct 2008.

    I notice that you used function wrappers for each volatile access. That suggests you’re familiar with the paper.

  17. http://www.kernel.org/doc/Documentation/memory-barriers.txt provides a nice (technical) discussion of memory barriers. Highly recommended reading for an advanced operating systems course.

    Not yet mentioned in comments above is the difference between memory barriers for typical program usage in a hosted environment (using cachable memory) and memory barriers required when accessing other memory types (e.g. uncachable, write-combining, writeback, writethrough), such as when writing device drivers or accessing GPU video memory.

    Portable macros wrapping memory barriers and lots of additional links, including to vendor CPU architecure manuals can be found in https://github.com/gstrauss/plasma/blob/master/plasma_membar.h