Nibble Sort Programming Contest

The problem is to sort the 4-bit pieces of a 64-bit word with (unsigned) smaller values towards the small end of the word. The nibble sort of 0xbadbeef is 0xfeedbba000000000. The function you implement will perform this sorting operation on a buffer of 1024 64-bit integers:

void nibble_sort(unsigned long *buf);

I’ll give a small prize to the submitter of the entry that performs this operation the fastest on my Core i7-4770. If the winner makes non-trivial use of SIMD, I’ll give a prize to that entry and also to the fastest non-SIMD entry. I’ll use GCC 4.9.2 to compile your C99 code, which may use SIMD intrinsics and also inline assembly if you choose. There’s no requirement for portability. The machine runs Ubuntu 14.04 in 64-bit mode. If you require any command line options other than “gcc -std=gnu99 -O3” you need to tell me that. You may assume that the buffer starts at the start of a cache line. The buffer will be loaded with uniformly chosen random values. Submit code by emailing me. Submit entries before the end of January 2015.

Here’s a slow (but hopefully correct) reference implementation takes about ~~180~~ 380 microseconds to do the job:

int read_nibble(unsigned long w, int i) {
  assert(i >= 0 && i < 16);
  unsigned long res = w >> (i * 4);
  return res & 0xf;
}

void write_nibble(unsigned long *w, int i, int v) {
  assert(i >= 0 && i < 16);
  unsigned long mask = 0xf;
  mask <<= (i * 4);
  *w &= ~mask;
  unsigned long prom = v;
  prom <<= (i * 4);
  *w |= prom;
}

unsigned long nibble_sort_word(unsigned long arg) {
  for (int i = 0; i < 16; ++i) {
    int min = i;
    for (int j = i+1; j < 16; ++j) {
      if (read_nibble(arg, j) < read_nibble(arg, min))
        min = j;
    }
    if (min != i) {
      int tmp = read_nibble(arg, i);
      write_nibble(&arg, i, read_nibble(arg, min));
      write_nibble(&arg, min, tmp);
    }
  }
  return arg;
}

void nibble_sort(unsigned long *buf) {
  for (int i=0; i<1024; i++)
    buf[i] = nibble_sort_word(buf[i]);
}

Update from Feb 1: Ok, contest is over. Thanks for the very impressive submissions, folks! Provisional results with and w/o SIMD are below. These are with turbo boost turned off. I’m now working on the blog post explaining the results, which is going to take some time because there are a lot of solutions here and some of them are very clever. I’m also still working to secure permission to use codes so I can put as many as possible in Github. There are fewer entries in the no-SIMD category because many codes had explicit use of vectors. Please let me know if you think I’ve made any mistakes.

Update from Feb 2: There were a few mistakes and omissions. Results that I hope are final are below.

regehr@regehr-M51AC:nibble_sort$ ./go.sh 
              ns     ns / 8 bytes       entry name           errors
            1115             1.09       alexander4                0
            1238             1.21          arseny2                0
            1596             1.56          arseny1                0
            1755             1.71              uwe                0
            2630             2.57        pdewacht2                0
            4205             4.11         beekman2                0
            4466             4.36         beekman1                0
            4867             4.75       alexander3                0
            6084             5.94       alexander1                0
            6364             6.21         koorogi2                0
            8472             8.27             falk                0
            8520             8.32           jerome                0
           10331            10.09          vetter4                0
           10374            10.13           jepler                0
           10950            10.69       alexander2                0
           12096            11.81          vetter3               64
           12714            12.42         koorogi1                0
           14185            13.85              tom                0
           15532            15.17             rjk9                0
           16829            16.43          parnell                0
           16890            16.49        pdewacht1                0
           18804            18.36           chucky                0
           20041            19.57          burton2                0
           20649            20.17            nadav                0
           24908            24.32         bosmans1               60
           25104            24.52         bosmans2                0
           25486            24.89          vetter2                0
           28957            28.28             hans                0
           29928            29.23             anon                0
           30228            29.52            jrose                0
           31007            30.28           carlos                0
           32952            32.18            joris                0
           34562            33.75          vetter1                0
           45440            44.38             mats                0
           47511            46.40            frank                0
           50162            48.99            robin                0
           72762            71.06          grayson                0
           74465            72.72            rosen                0
           79752            77.88            payer                0
           92970            90.79          burton1                0
           94343            92.13           mentre                0
           95522            93.28             vlad                0
           97877            95.58           rogers                0
           99077            96.75             mike                0
          101913            99.52            bloom                0
          103945           101.51           jarkko                0
          109191           106.63            karim                0
          120928           118.09           justin                0
          160599           156.83           mikael                0
          200943           196.23           scotty                0
          416207           406.45              ref                0
regehr@regehr-M51AC:nibble_sort$ ./go.sh NOSIMD
              ns     ns / 8 bytes       entry name           errors
            8593             8.39           jerome                0
           10321            10.08           jepler                0
           10346            10.10          vetter4                0
           10959            10.70       alexander2                0
           12105            11.82          vetter3               64
           15537            15.17             rjk9                0
           16842            16.45          parnell                0
           18847            18.41           chucky                0
           19995            19.53        pdewacht1                0
           21631            21.12            nadav                0
           25516            24.92          vetter2                0
           29418            28.73             hans                0
           29928            29.23             anon                0
           30529            29.81            jrose                0
           32956            32.18            joris                0
           34286            33.48             falk                0
           41028            40.07          burton2                0
           42308            41.32          vetter1                0
           44415            43.37         bosmans2                0
           45480            44.41             mats                0
           50179            49.00            robin                0
           53189            51.94            frank                0
           71285            69.61          grayson                0
           73883            72.15            rosen                0
           79744            77.88            payer                0
           94780            92.56          burton1                0
           94970            92.74           mentre                0
           95509            93.27             vlad                0
           98658            96.35           rogers                0
           98922            96.60             mike                0
          101722            99.34            bloom                0
          104880           102.42           jarkko                0
          113147           110.50            karim                0
          119168           116.38           justin                0
          415529           405.79              ref                0

Update from Feb 3: Here’s the Github repo containing all of the entries — let me know how it looks.

January 24, 2015

regehr

Computer Science

91 responses to “Nibble Sort Programming Contest”

Hugues says:

January 24, 2015 at 11:47 pm

Just to clarify: does the sort need to operate on each 64bit word independently or on all nibbles of the buffer?

i.e. given input [0xbadbeef, 0xbadbeef]
does it produce [0xfeedbba000000000, 0xfeedbba000000000]
or [0xfefeededbbbba0a0, 0000000000000000] ?

The later case seems much less interesting because it can be done very simply by counting nibbles and then writing them in order.
regehr says:

January 25, 2015 at 12:41 am

Sorry to be unclear– I meant the first one, words are nibble-sorted individually.
Richard Kettlewell says:

January 25, 2015 at 5:16 am

More clarification: Do you want the function signature as given at the top, or as in the reference version?
Richard Kettlewell says:

January 25, 2015 at 8:11 am

Oh, I think you must want the former, and I’m just confused by the coincidence of names.
regehr says:

January 25, 2015 at 9:40 am

Richard, sorry, I’ve updated the code to be more clear. I can’t believe how slow my code is.
Carlos says:

January 25, 2015 at 11:45 am

Damn it, John. I had a lecture to prepare! 🙂
regehr says:

January 25, 2015 at 12:25 pm

Carlos, current record is about 30 microseconds to sort the 1024 words, if that helps :).
David Grayson says:

January 25, 2015 at 2:42 pm

Here is a quick test framework I made for writing and testing your contest entries.

https://gist.github.com/DavidEGrayson/f331d8863ef46fa343f8

Just put your algorithn in the “my_nibble_sort” function and the program will show you how fast it is compared to the Regehr algorithm. Regehr’s time will be on the first line of the output, and your time will be on the second.

Does anyone have suggestions on how to do the timing more accurately? I’m just using clock() right now. I made the size of the buffers be 1024*128 instead of 1024 so the timing would be more accurate.
anonymous says:

January 25, 2015 at 9:48 pm

Use clock_gettime() with CLOCK_MONOTONIC for benchmarking. On older systems you must link with -lrt. With that change my benchmarking setup looks a lot like yours, but instead of extending the size of the buffer I repeat the test a number of times and take the smallest observed elapsed time.
regehr says:

January 25, 2015 at 10:46 pm

Here is my current timing code (not very pretty):

http://pastebin.com/34559reG

I’d appreciate feedback. I can switch from CLOCK_REALTIME to CLOCK_MONOTONIC if that’s a better idea, but I’m more interested in figuring out how to defeat my smart processor. I already turned off frequency scaling. I probably need to disable turbo-boost or whatever it’s called in the BIOS. I probably should pin the test program to a single processor. Anything else?
regehr says:

January 25, 2015 at 10:57 pm

Just a tiny bit prettier version:

http://pastebin.com/bw7SR9SP
regehr says:

January 25, 2015 at 10:57 pm

Also I never use hyperthreading.
regehr says:

January 25, 2015 at 11:14 pm

And my driver script if anyone cares:

http://pastebin.com/QpbDRuhL
regehr says:

January 25, 2015 at 11:30 pm

If anyone has time and energy to create an optimized nibble sort using STOKE I would be really interested to see the results:

https://github.com/eschkufz/stoke-release
xakz says:

January 26, 2015 at 3:04 am

Multi-threading allowed ?
regehr says:

January 26, 2015 at 7:32 am

Just one thread please.
Dilip says:

January 26, 2015 at 8:24 am

A whole legion of Windows programmer are being left out of this challenge. Is there anything specific to gcc you are looking for or is it just infrastructural/toolset limitations?
regehr says:

January 26, 2015 at 8:58 am

Dilip, sorry, I’m not trying to exclude people, I just don’t happen to have a Windows machine available for testing. My wife does have a Windows machine but it’s old and doesn’t have Visual Studio installed. It does appear that gcc-4.9.2 is available for Cygwin if that works for you.
Jason Schulz says:

January 26, 2015 at 11:36 am

I’m not sure if there’s any overhead associated with clock() or clock_gettime(), but rdtsc, or probably rdtscp may give more accurate results. I’ll post a link to my scaffold later.
Alexander Monakov says:

January 26, 2015 at 11:56 am

To add to John’s suggestion about Cygwin, there’s also a lighter option: MinGW-w64, and you can also examine gcc-4.9’s assembly output online at gcc.godbolt.org. I think disclosure of testing environment helps to focus participants and thus boost competition. How dull would it be if the compiler and cflags were not announced, “forcing” people to submit their entries as inline assembly blobs (in case cflags is “-O0”)?

It’s true that if you have access to same software and hardware as John, you’ll have an easier time preparing your entry, but if not, you’re not excluded.
David Grayson says:

January 26, 2015 at 11:59 am

You can still develop the code for this contest in Windows as long as it is cross-platform.
regehr says:

January 26, 2015 at 2:39 pm

Jason, I get less than 20ns elapsed time for back-to-back calls to clock_gettime(). This is far less than the time required to do the nibble sort of 1024 words, so I’m not inclined to pursue a lower-level timing strategy.
regehr says:

January 26, 2015 at 2:56 pm

So clock_gettime() isn’t even dropping into the kernel, it looks like. Oddly, when a program using clock_gettime() is linked statically, it gets a different implementation that does use the kernel.
Mike G. says:

January 26, 2015 at 3:12 pm

Here’s a fairly naive implementation – no ASM, no SIMD…

How does it do?

https://gist.github.com/RMGiroux/89dfd9eb3582627b293e
Scotty says:

January 26, 2015 at 3:57 pm

How is timing done? Best time you ever see or average of N test runs?
regehr says:

January 26, 2015 at 4:09 pm

Scotty, I’m taking the best time though that is only correct when we’re 100% free of low outliers, which could happen due to moving between processors with unsynchronized timestamp counters.
regehr says:

January 26, 2015 at 4:29 pm

Hi Mike, about 87 microseconds on my machine.
Jeff Epler says:

January 26, 2015 at 9:50 pm

I have two entries. First, a serious one: http://emergent.unpythonic.net/files/sandbox/nibblesort/jepler.c which does a counting sort, putting the nibble counts themselves in nibbles so more work is done in registers. (around 20000ns on Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz with gcc 4.9.1)

Second, an entry that is deliberately destructive of play, based on knowledge of the driver program: http://emergent.unpythonic.net/files/sandbox/nibblesort/bad.c (times under 50ns? achieved!)

One aspect of my serious submission is that I realized it initially had a bug for all-nibbles-same, which is currently fixed; I’m not aware of other buggy inputs. maybe your test harness should pass in some specific values likely to hit bugs. Though the next guy probably has different buggy cases than I do.
Alex Groce says:

January 26, 2015 at 10:09 pm

I have no solution to propose, but can verify (literally) John’s reference is correct (for buffers of 10, not buffers of 1024, admittedly). I cross-checked by also confirming at a byte granularity to make sure no bugs in the nibble extractor code which looked a lot like the code verified.

Of course, good example of the pointlessness of verification sometimes. Hard to see how this is wrong unless assumptions behind any verification are also off.
regehr says:

January 27, 2015 at 12:32 am

Alex, very cool. What tool are you using here? But take my word for it that some of the other entries are *much* harder to verify since they drop down a level of abstraction into SIMD land. I’ll publish them in a github repo in early Feb (if the authors agree this is ok).
regehr says:

January 27, 2015 at 12:33 am

Jeff, does the null entry really pass my validator? I didn’t check, but have seen exactly this happen before, since the correct answer is just sitting there in a register, and it’s not 100% clear how to defend against this in C. Maybe also run the tests on code compiled at -O0, or just run an UB detector.
Alex Groce says:

January 27, 2015 at 12:55 am

I’m just using CBMC, which is why I dropped the buffer size down from 1024 to 10 (I don’t see how it can cause a problem, and that’s a lot of bits for SAT — takes it a few minutes even at 10).

SIMD or inline assembler… I’m not sure there’s really a tool out there that can handle it. I know some prototype commercial stuff that does bit level processor bounded MC, but AFAIK it’s just PowerPC, and not ready for prime time.
Alex Groce says:

January 27, 2015 at 1:03 am

At buffer size 10 it’s not too bad:

1591318 variables, 3702680 clauses
SAT checker: negated claim is UNSATISFIABLE, i.e., holds
Runtime decision procedure: 683.586s
VERIFICATION SUCCESSFUL
Robin Morisset says:

January 27, 2015 at 5:00 am

Is it possible to use clang instead of GCC ? GCC register allocator seem to hate my code and clang is more than 3 times faster (Same options: just -O3 -std=c99).
Joshua Green says:

January 27, 2015 at 5:15 am

Out of curiosity, is (potentially) undefined behavior allowed provided that, on the system you specified, the results are predictable (and correct)?
Jeff Epler says:

January 27, 2015 at 6:10 am

John, bad.c replaces rand() with a function that always returns zero. Since for any all-nibbles-equal input, the output is equal to the input, the function which doesn’t even touch its buffer it passes the test.
regehr says:

January 27, 2015 at 7:36 am

Robin, sure, but to keep things simple let’s (like gcc) use the last released version: 3.5.1.
regehr says:

January 27, 2015 at 7:37 am

Joshua, UB is fine! I’ll just turn off warnings and hold my nose 🙂
regehr says:

January 27, 2015 at 7:39 am

Alex: nice! I need to use CBMC, have never done more than played with it briefly.

Jeff: Cool! I’ve run into this while grading assignments where a bunch of students who didn’t implement some function were mysteriously getting all the points.
Alex Groce says:

January 27, 2015 at 11:01 am

I’ve always felt that “SAT checker: negated claim is UNSATISFIABLE, i.e., holds” is one of those lines that undermines any claim we actually want normal developers to use these tools.
nobody says:

January 27, 2015 at 2:20 pm

Why use clock_gettime() when one can actually count the cycles by reading the clock-counter via the rdtsc-instruction (several runs may be required to get the minumum)?

BTW: Is there an actual application of the nibble-sorting or is it “just academic”? I’m curious.

Thanks.
Jeff Epler says:

January 27, 2015 at 5:39 pm

Alex, is it possible for you to post your use of CBMC for nibble sort verification somewhere?

I am new to CBMC, so I’m not sure how to properly use it.

So far, I did use it to prove incorrect the variant of my counting sort which I knew had a bug for all-nibbles-same numbers. However, with 150 CPU minutes under its belt it hasn’t finished chewing on the version which I think is correct.

My test harness is at https://gist.github.com/jepler/baf8e7da47f24e50bc2e — note that I’ve ensured that only one number is being “sorted” (-DBUFSIZE=1)

Tle last output from cbmc was:
Solving with MiniSAT 2.2.0 with simplifier
180519 variables, 644810 clauses
.. which is fewer variables and clauses than you reported when you did your testing.
regehr says:

January 27, 2015 at 6:27 pm

Jeff, sometimes you need to try different solvers, Z3, STP, and Boolector are all good choices.
regehr says:

January 27, 2015 at 6:29 pm

Nobody, as comment #22 says the overhead of clock_gettime() is <20ns, which is absolutely negligible in the context. I’d certainly use the TSC if its use was warranted.

I know of no particular use case for sorting the nibbles in a word! It’s just a fun problem.

I’ll eventually make available a github repo containing all of the entries (assuming the authors agree) and then it would be great to see if different timing harnesses all get the same results.
Mike G. says:

January 27, 2015 at 8:43 pm

John, I updated my gist to license the code explicitly as Apache 2.0, and also explicitly grant you the right to aggregate it for benchmarking/pedagogical purposes.

Not that I would have complained if you’d done so without the explicit license, but it’s always nicer when these things are clear and explicit.
ja says:

January 27, 2015 at 11:36 pm

since there are 1024 words, then i need to sort 8×1024 nibbles? meaning a nibble on one word can be placed on another word?
regehr says:

January 28, 2015 at 8:58 am

ja the problem is to sort the nibbles within each word, see the reference implementation.
Alexander Monakov says:

January 28, 2015 at 10:04 am

John, the contest is about halfway through, how about publishing some sort of “current standings” to make things a bit more interesting to onlookers, and give some idea of contest level for people considering to join in the rest of the week? Does that sound like a good idea?
regehr says:

January 28, 2015 at 11:00 am

Alexander, yes, a good idea, I’ll do this tonight.

I should add that I’ve been amazed by the variety and quality of the submissions!
Anders Waldenborg says:

January 28, 2015 at 1:10 pm

regehr, nobody: Here is a practical application: nibble-based BWT.