Nibble Sort Programming Contest

The problem is to sort the 4-bit pieces of a 64-bit word with (unsigned) smaller values towards the small end of the word. The nibble sort of 0xbadbeef is 0xfeedbba000000000. The function you implement will perform this sorting operation on a buffer of 1024 64-bit integers:

void nibble_sort(unsigned long *buf); 

I’ll give a small prize to the submitter of the entry that performs this operation the fastest on my Core i7-4770. If the winner makes non-trivial use of SIMD, I’ll give a prize to that entry and also to the fastest non-SIMD entry. I’ll use GCC 4.9.2 to compile your C99 code, which may use SIMD intrinsics and also inline assembly if you choose. There’s no requirement for portability. The machine runs Ubuntu 14.04 in 64-bit mode. If you require any command line options other than “gcc -std=gnu99 -O3” you need to tell me that. You may assume that the buffer starts at the start of a cache line. The buffer will be loaded with uniformly chosen random values. Submit code by emailing me. Submit entries before the end of January 2015.

Here’s a slow (but hopefully correct) reference implementation takes about 180 380 microseconds to do the job:

int read_nibble(unsigned long w, int i) {
  assert(i >= 0 && i < 16);
  unsigned long res = w >> (i * 4);
  return res & 0xf;

void write_nibble(unsigned long *w, int i, int v) {
  assert(i >= 0 && i < 16);
  unsigned long mask = 0xf;
  mask <<= (i * 4);
  *w &= ~mask;
  unsigned long prom = v;
  prom <<= (i * 4);
  *w |= prom;

unsigned long nibble_sort_word(unsigned long arg) {
  for (int i = 0; i < 16; ++i) {
    int min = i;
    for (int j = i+1; j < 16; ++j) {
      if (read_nibble(arg, j) < read_nibble(arg, min))
        min = j;
    if (min != i) {
      int tmp = read_nibble(arg, i);
      write_nibble(&arg, i, read_nibble(arg, min));
      write_nibble(&arg, min, tmp);
  return arg;

void nibble_sort(unsigned long *buf) {
  for (int i=0; i<1024; i++)
    buf[i] = nibble_sort_word(buf[i]);

Update from Feb 1: Ok, contest is over. Thanks for the very impressive submissions, folks! Provisional results with and w/o SIMD are below. These are with turbo boost turned off. I’m now working on the blog post explaining the results, which is going to take some time because there are a lot of solutions here and some of them are very clever. I’m also still working to secure permission to use codes so I can put as many as possible in Github. There are fewer entries in the no-SIMD category because many codes had explicit use of vectors. Please let me know if you think I’ve made any mistakes.

Update from Feb 2: There were a few mistakes and omissions. Results that I hope are final are below.

regehr@regehr-M51AC:nibble_sort$ ./ 
              ns     ns / 8 bytes       entry name           errors
            1115             1.09       alexander4                0
            1238             1.21          arseny2                0
            1596             1.56          arseny1                0
            1755             1.71              uwe                0
            2630             2.57        pdewacht2                0
            4205             4.11         beekman2                0
            4466             4.36         beekman1                0
            4867             4.75       alexander3                0
            6084             5.94       alexander1                0
            6364             6.21         koorogi2                0
            8472             8.27             falk                0
            8520             8.32           jerome                0
           10331            10.09          vetter4                0
           10374            10.13           jepler                0
           10950            10.69       alexander2                0
           12096            11.81          vetter3               64
           12714            12.42         koorogi1                0
           14185            13.85              tom                0
           15532            15.17             rjk9                0
           16829            16.43          parnell                0
           16890            16.49        pdewacht1                0
           18804            18.36           chucky                0
           20041            19.57          burton2                0
           20649            20.17            nadav                0
           24908            24.32         bosmans1               60
           25104            24.52         bosmans2                0
           25486            24.89          vetter2                0
           28957            28.28             hans                0
           29928            29.23             anon                0
           30228            29.52            jrose                0
           31007            30.28           carlos                0
           32952            32.18            joris                0
           34562            33.75          vetter1                0
           45440            44.38             mats                0
           47511            46.40            frank                0
           50162            48.99            robin                0
           72762            71.06          grayson                0
           74465            72.72            rosen                0
           79752            77.88            payer                0
           92970            90.79          burton1                0
           94343            92.13           mentre                0
           95522            93.28             vlad                0
           97877            95.58           rogers                0
           99077            96.75             mike                0
          101913            99.52            bloom                0
          103945           101.51           jarkko                0
          109191           106.63            karim                0
          120928           118.09           justin                0
          160599           156.83           mikael                0
          200943           196.23           scotty                0
          416207           406.45              ref                0
regehr@regehr-M51AC:nibble_sort$ ./ NOSIMD
              ns     ns / 8 bytes       entry name           errors
            8593             8.39           jerome                0
           10321            10.08           jepler                0
           10346            10.10          vetter4                0
           10959            10.70       alexander2                0
           12105            11.82          vetter3               64
           15537            15.17             rjk9                0
           16842            16.45          parnell                0
           18847            18.41           chucky                0
           19995            19.53        pdewacht1                0
           21631            21.12            nadav                0
           25516            24.92          vetter2                0
           29418            28.73             hans                0
           29928            29.23             anon                0
           30529            29.81            jrose                0
           32956            32.18            joris                0
           34286            33.48             falk                0
           41028            40.07          burton2                0
           42308            41.32          vetter1                0
           44415            43.37         bosmans2                0
           45480            44.41             mats                0
           50179            49.00            robin                0
           53189            51.94            frank                0
           71285            69.61          grayson                0
           73883            72.15            rosen                0
           79744            77.88            payer                0
           94780            92.56          burton1                0
           94970            92.74           mentre                0
           95509            93.27             vlad                0
           98658            96.35           rogers                0
           98922            96.60             mike                0
          101722            99.34            bloom                0
          104880           102.42           jarkko                0
          113147           110.50            karim                0
          119168           116.38           justin                0
          415529           405.79              ref                0

Update from Feb 3: Here’s the Github repo containing all of the entries — let me know how it looks.

91 replies on “Nibble Sort Programming Contest”

  1. Just to clarify: does the sort need to operate on each 64bit word independently or on all nibbles of the buffer?

    i.e. given input [0xbadbeef, 0xbadbeef]
    does it produce [0xfeedbba000000000, 0xfeedbba000000000]
    or [0xfefeededbbbba0a0, 0000000000000000] ?

    The later case seems much less interesting because it can be done very simply by counting nibbles and then writing them in order.

  2. Here is a quick test framework I made for writing and testing your contest entries.

    Just put your algorithn in the “my_nibble_sort” function and the program will show you how fast it is compared to the Regehr algorithm. Regehr’s time will be on the first line of the output, and your time will be on the second.

    Does anyone have suggestions on how to do the timing more accurately? I’m just using clock() right now. I made the size of the buffers be 1024*128 instead of 1024 so the timing would be more accurate.

  3. Use clock_gettime() with CLOCK_MONOTONIC for benchmarking. On older systems you must link with -lrt. With that change my benchmarking setup looks a lot like yours, but instead of extending the size of the buffer I repeat the test a number of times and take the smallest observed elapsed time.

  4. Here is my current timing code (not very pretty):

    I’d appreciate feedback. I can switch from CLOCK_REALTIME to CLOCK_MONOTONIC if that’s a better idea, but I’m more interested in figuring out how to defeat my smart processor. I already turned off frequency scaling. I probably need to disable turbo-boost or whatever it’s called in the BIOS. I probably should pin the test program to a single processor. Anything else?

  5. A whole legion of Windows programmer are being left out of this challenge. Is there anything specific to gcc you are looking for or is it just infrastructural/toolset limitations?

  6. Dilip, sorry, I’m not trying to exclude people, I just don’t happen to have a Windows machine available for testing. My wife does have a Windows machine but it’s old and doesn’t have Visual Studio installed. It does appear that gcc-4.9.2 is available for Cygwin if that works for you.

  7. To add to John’s suggestion about Cygwin, there’s also a lighter option: MinGW-w64, and you can also examine gcc-4.9’s assembly output online at I think disclosure of testing environment helps to focus participants and thus boost competition. How dull would it be if the compiler and cflags were not announced, “forcing” people to submit their entries as inline assembly blobs (in case cflags is “-O0”)?

    It’s true that if you have access to same software and hardware as John, you’ll have an easier time preparing your entry, but if not, you’re not excluded.

  8. Jason, I get less than 20ns elapsed time for back-to-back calls to clock_gettime(). This is far less than the time required to do the nibble sort of 1024 words, so I’m not inclined to pursue a lower-level timing strategy.

  9. So clock_gettime() isn’t even dropping into the kernel, it looks like. Oddly, when a program using clock_gettime() is linked statically, it gets a different implementation that does use the kernel.

  10. Scotty, I’m taking the best time though that is only correct when we’re 100% free of low outliers, which could happen due to moving between processors with unsynchronized timestamp counters.

  11. I have two entries. First, a serious one: which does a counting sort, putting the nibble counts themselves in nibbles so more work is done in registers. (around 20000ns on Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz with gcc 4.9.1)

    Second, an entry that is deliberately destructive of play, based on knowledge of the driver program: (times under 50ns? achieved!)

    One aspect of my serious submission is that I realized it initially had a bug for all-nibbles-same, which is currently fixed; I’m not aware of other buggy inputs. maybe your test harness should pass in some specific values likely to hit bugs. Though the next guy probably has different buggy cases than I do.

  12. I have no solution to propose, but can verify (literally) John’s reference is correct (for buffers of 10, not buffers of 1024, admittedly). I cross-checked by also confirming at a byte granularity to make sure no bugs in the nibble extractor code which looked a lot like the code verified.

    Of course, good example of the pointlessness of verification sometimes. Hard to see how this is wrong unless assumptions behind any verification are also off.

  13. Alex, very cool. What tool are you using here? But take my word for it that some of the other entries are *much* harder to verify since they drop down a level of abstraction into SIMD land. I’ll publish them in a github repo in early Feb (if the authors agree this is ok).

  14. Jeff, does the null entry really pass my validator? I didn’t check, but have seen exactly this happen before, since the correct answer is just sitting there in a register, and it’s not 100% clear how to defend against this in C. Maybe also run the tests on code compiled at -O0, or just run an UB detector.

  15. I’m just using CBMC, which is why I dropped the buffer size down from 1024 to 10 (I don’t see how it can cause a problem, and that’s a lot of bits for SAT — takes it a few minutes even at 10).

    SIMD or inline assembler… I’m not sure there’s really a tool out there that can handle it. I know some prototype commercial stuff that does bit level processor bounded MC, but AFAIK it’s just PowerPC, and not ready for prime time.

  16. At buffer size 10 it’s not too bad:

    1591318 variables, 3702680 clauses
    SAT checker: negated claim is UNSATISFIABLE, i.e., holds
    Runtime decision procedure: 683.586s

  17. Is it possible to use clang instead of GCC ? GCC register allocator seem to hate my code and clang is more than 3 times faster (Same options: just -O3 -std=c99).

  18. Out of curiosity, is (potentially) undefined behavior allowed provided that, on the system you specified, the results are predictable (and correct)?

  19. John, bad.c replaces rand() with a function that always returns zero. Since for any all-nibbles-equal input, the output is equal to the input, the function which doesn’t even touch its buffer it passes the test.

  20. Alex: nice! I need to use CBMC, have never done more than played with it briefly.

    Jeff: Cool! I’ve run into this while grading assignments where a bunch of students who didn’t implement some function were mysteriously getting all the points.

  21. I’ve always felt that “SAT checker: negated claim is UNSATISFIABLE, i.e., holds” is one of those lines that undermines any claim we actually want normal developers to use these tools.

  22. Why use clock_gettime() when one can actually count the cycles by reading the clock-counter via the rdtsc-instruction (several runs may be required to get the minumum)?

    BTW: Is there an actual application of the nibble-sorting or is it “just academic”? I’m curious.


  23. Alex, is it possible for you to post your use of CBMC for nibble sort verification somewhere?

    I am new to CBMC, so I’m not sure how to properly use it.

    So far, I did use it to prove incorrect the variant of my counting sort which I knew had a bug for all-nibbles-same numbers. However, with 150 CPU minutes under its belt it hasn’t finished chewing on the version which I think is correct.

    My test harness is at — note that I’ve ensured that only one number is being “sorted” (-DBUFSIZE=1)

    Tle last output from cbmc was:
    Solving with MiniSAT 2.2.0 with simplifier
    180519 variables, 644810 clauses
    .. which is fewer variables and clauses than you reported when you did your testing.

  24. Nobody, as comment #22 says the overhead of clock_gettime() is <20ns, which is absolutely negligible in the context. I'd certainly use the TSC if its use was warranted.

    I know of no particular use case for sorting the nibbles in a word! It's just a fun problem.

    I'll eventually make available a github repo containing all of the entries (assuming the authors agree) and then it would be great to see if different timing harnesses all get the same results.

  25. John, I updated my gist to license the code explicitly as Apache 2.0, and also explicitly grant you the right to aggregate it for benchmarking/pedagogical purposes.

    Not that I would have complained if you’d done so without the explicit license, but it’s always nicer when these things are clear and explicit.

  26. John, the contest is about halfway through, how about publishing some sort of “current standings” to make things a bit more interesting to onlookers, and give some idea of contest level for people considering to join in the rest of the week? Does that sound like a good idea?

Comments are closed.