Simplest Program Showing the Difference Between Sequential Consistency and TSO


The effects of weak memory models on programs running on shared-memory multiprocessors can be very difficult to understand. Today I was looking for the simplest program that illustrates the difference between sequentially consistent execution and TSO (the memory model provided by x86 and x86-64). Based on an example from Section 8.2.3.4 of the big Intel reference manual, I came up with this:

#include <pthread.h>
#include <stdio.h>
#include <assert.h>

volatile int x, y, tmp1, tmp2;

void *t0 (void *arg)
{
  x = 1;
  tmp1 = y;
  return 0;
}

void *t1 (void *arg)
{
  y = 1;
  tmp2 = x;
  return 0;
}

int main (void)
{
  while (1) {
    int res;
    pthread_t thread0, thread1;
    x = y = tmp1 = tmp2 = 0;
    res = pthread_create (&thread0, NULL, t0, NULL);
    assert (res==0);
    res = pthread_create (&thread1, NULL, t1, NULL);
    assert (res==0);
    res = pthread_join (thread0, NULL);
    assert (res==0);
    res = pthread_join (thread1, NULL);
    assert (res==0);
    printf ("%d %d\n", tmp1, tmp2);
    if (tmp1==0 && tmp2==0) break;
  }
  return 0;
}

It is easy to see that this program cannot terminate on a sequentially consistent multiprocessor (assuming that all system calls succeed). On the other hand, the program can and does terminate under TSO. Because individual processors are sequentially consistent, the program will not terminate on a multicore Linux machine if you pin it to a single core:

$ gcc -O tso.c -o tso -pthread
$ taskset -c 1 ./tso

It’s easy to “fix” this code so that it does not terminate in the multicore case by adding one memory fence per thread. In GCC on x86 and x86-64 a memory fence is:

asm volatile ("mfence");

In some cases this version is preferable since it also stops some kinds of reordering done by the compiler:

asm volatile ("mfence" ::: "memory");

The stronger version should be unnecessary here: since the relevant reads and writes are to volatile-qualified objects, the compiler is already not allowed to reorder them across sequence points.

Thinking abstractly about this stuff always messes with my head; it’s better to just write some code. Also I recommend Section 8.2 of Intel manual, it contains a lot of excellent material.

,

11 responses to “Simplest Program Showing the Difference Between Sequential Consistency and TSO”

  1. This a small nit, but can be important:

    For portability one should not compile with -lpthread but with -pthread as the former does not set preprocessor flags or link in every required library.

  2. I was under the impression that in the new C/C++11 world order, volatile means absolutely nothing with respect to sharing memory between threads. So this program is full of data races, which means it is undefined. This may be completely tangential to the point you’re trying to make, but I wanted to point it out.

  3. Hi Ben, I wasn’t really thinking about the new memory models when I wrote this, but that’s a good point. As far as I can tell, it’s not possible to write this program in a race-free way since destroying the kind of behavior I’m showing here is kind of the point of DRF. Perhaps the way to make this point in (somewhat) conforming C11/C++11 is to write a bit of inline assembly language.

  4. Hi John

    if “volatile” is replaced by “atomic” and the memory accesses are labelled as “relaxed”, the resulting program will be data-race free (and well-defined according to C11/C++11) but it will still exhibit the 0,0 behaviour when executed on x86 (or Power/ARM) hardware.

    And, ahem, some advertising based on my experience:

    To explore the hardware memory model I suggest to rely on the “litmus” tool, which takes care of the proper thread synchronisation (and adds some noise in the memory subsystem to increase the probability of observing non-sc behaviours). The “litmus” tool can be downloaded from http://diy.inria.fr/sources/index.html

    To compute the behaviours allowed by a C11/C++11 program, then the interactive “cppmem” tool can be handy (especially when exploring some corners of the semantics of low-level atomics): http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

    -francesco

  5. Well there are the weak atomic operations, which: (1) I know very little about and (2) I suspect have pretty spotty support at this point. In theory those let you write programs that don’t violate the C/C++11 memory model, but do have a kind of data race.

  6. There are several interesting variations of this example in C/C++11 using different flavours of relaxed atomics – e.g., with relaxed atomics or release/acquire pairs the non-SC behaviour is allowed, while with SC atomics it isn’t. In the terminology of the standard, the intuitive races don’t count as “data races”. If you want to experiment with the examples, you can try the “SB” tests in this tool: http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/index.html (built from Mark Batty et al.’s formal model for C/C++11 concurrency).

    If you instead want to illustrate the raw x86 hardware behaviour, without the possibility of confusion from compiler reordering, then one nice way is with the “litmus” tool by Luc Maranget, Susmit Sarkar and others: http://diy.inria.fr/doc/litmus.html
    (the first example in the doc there is basically the same as the one you have above), which runs an litmus-test assembly program in a test harness. I quite often do this in lectures – it’s nice to see the non-SC behaviour for real, and also to see the variation in how often it shows up.

  7. Thanks Peter! In fact I was looking for the simplest program in order to better explain these issues to my operating systems class, I’ll think about using Litmus in this capacity as well.

  8. Another interesting question is: “What’s the simplest C++11 program that shows a difference between putting seq_cst fences between every operation and making the operations themselves seq_cst?”

    The simplest one I know of is “IRIW”, “Independent Reads of Independent Writes”.