The problem is to sort the 4-bit pieces of a 64-bit word with (unsigned) smaller values towards the small end of the word. The nibble sort of 0xbadbeef is 0xfeedbba000000000. The function you implement will perform this sorting operation on a buffer of 1024 64-bit integers:
void nibble_sort(unsigned long *buf);
I’ll give a small prize to the submitter of the entry that performs this operation the fastest on my Core i7-4770. If the winner makes non-trivial use of SIMD, I’ll give a prize to that entry and also to the fastest non-SIMD entry. I’ll use GCC 4.9.2 to compile your C99 code, which may use SIMD intrinsics and also inline assembly if you choose. There’s no requirement for portability. The machine runs Ubuntu 14.04 in 64-bit mode. If you require any command line options other than “gcc -std=gnu99 -O3” you need to tell me that. You may assume that the buffer starts at the start of a cache line. The buffer will be loaded with uniformly chosen random values. Submit code by emailing me. Submit entries before the end of January 2015.
Here’s a slow (but hopefully correct) reference implementation takes about 180 380 microseconds to do the job:
int read_nibble(unsigned long w, int i) { assert(i >= 0 && i < 16); unsigned long res = w >> (i * 4); return res & 0xf; } void write_nibble(unsigned long *w, int i, int v) { assert(i >= 0 && i < 16); unsigned long mask = 0xf; mask <<= (i * 4); *w &= ~mask; unsigned long prom = v; prom <<= (i * 4); *w |= prom; } unsigned long nibble_sort_word(unsigned long arg) { for (int i = 0; i < 16; ++i) { int min = i; for (int j = i+1; j < 16; ++j) { if (read_nibble(arg, j) < read_nibble(arg, min)) min = j; } if (min != i) { int tmp = read_nibble(arg, i); write_nibble(&arg, i, read_nibble(arg, min)); write_nibble(&arg, min, tmp); } } return arg; } void nibble_sort(unsigned long *buf) { for (int i=0; i<1024; i++) buf[i] = nibble_sort_word(buf[i]); }
Update from Feb 1: Ok, contest is over. Thanks for the very impressive submissions, folks! Provisional results with and w/o SIMD are below. These are with turbo boost turned off. I’m now working on the blog post explaining the results, which is going to take some time because there are a lot of solutions here and some of them are very clever. I’m also still working to secure permission to use codes so I can put as many as possible in Github. There are fewer entries in the no-SIMD category because many codes had explicit use of vectors. Please let me know if you think I’ve made any mistakes.
Update from Feb 2: There were a few mistakes and omissions. Results that I hope are final are below.
regehr@regehr-M51AC:nibble_sort$ ./go.sh ns ns / 8 bytes entry name errors 1115 1.09 alexander4 0 1238 1.21 arseny2 0 1596 1.56 arseny1 0 1755 1.71 uwe 0 2630 2.57 pdewacht2 0 4205 4.11 beekman2 0 4466 4.36 beekman1 0 4867 4.75 alexander3 0 6084 5.94 alexander1 0 6364 6.21 koorogi2 0 8472 8.27 falk 0 8520 8.32 jerome 0 10331 10.09 vetter4 0 10374 10.13 jepler 0 10950 10.69 alexander2 0 12096 11.81 vetter3 64 12714 12.42 koorogi1 0 14185 13.85 tom 0 15532 15.17 rjk9 0 16829 16.43 parnell 0 16890 16.49 pdewacht1 0 18804 18.36 chucky 0 20041 19.57 burton2 0 20649 20.17 nadav 0 24908 24.32 bosmans1 60 25104 24.52 bosmans2 0 25486 24.89 vetter2 0 28957 28.28 hans 0 29928 29.23 anon 0 30228 29.52 jrose 0 31007 30.28 carlos 0 32952 32.18 joris 0 34562 33.75 vetter1 0 45440 44.38 mats 0 47511 46.40 frank 0 50162 48.99 robin 0 72762 71.06 grayson 0 74465 72.72 rosen 0 79752 77.88 payer 0 92970 90.79 burton1 0 94343 92.13 mentre 0 95522 93.28 vlad 0 97877 95.58 rogers 0 99077 96.75 mike 0 101913 99.52 bloom 0 103945 101.51 jarkko 0 109191 106.63 karim 0 120928 118.09 justin 0 160599 156.83 mikael 0 200943 196.23 scotty 0 416207 406.45 ref 0 regehr@regehr-M51AC:nibble_sort$ ./go.sh NOSIMD ns ns / 8 bytes entry name errors 8593 8.39 jerome 0 10321 10.08 jepler 0 10346 10.10 vetter4 0 10959 10.70 alexander2 0 12105 11.82 vetter3 64 15537 15.17 rjk9 0 16842 16.45 parnell 0 18847 18.41 chucky 0 19995 19.53 pdewacht1 0 21631 21.12 nadav 0 25516 24.92 vetter2 0 29418 28.73 hans 0 29928 29.23 anon 0 30529 29.81 jrose 0 32956 32.18 joris 0 34286 33.48 falk 0 41028 40.07 burton2 0 42308 41.32 vetter1 0 44415 43.37 bosmans2 0 45480 44.41 mats 0 50179 49.00 robin 0 53189 51.94 frank 0 71285 69.61 grayson 0 73883 72.15 rosen 0 79744 77.88 payer 0 94780 92.56 burton1 0 94970 92.74 mentre 0 95509 93.27 vlad 0 98658 96.35 rogers 0 98922 96.60 mike 0 101722 99.34 bloom 0 104880 102.42 jarkko 0 113147 110.50 karim 0 119168 116.38 justin 0 415529 405.79 ref 0
Update from Feb 3: Here’s the Github repo containing all of the entries — let me know how it looks.
91 responses to “Nibble Sort Programming Contest”
Just to clarify: does the sort need to operate on each 64bit word independently or on all nibbles of the buffer?
i.e. given input [0xbadbeef, 0xbadbeef]
does it produce [0xfeedbba000000000, 0xfeedbba000000000]
or [0xfefeededbbbba0a0, 0000000000000000] ?
The later case seems much less interesting because it can be done very simply by counting nibbles and then writing them in order.
Sorry to be unclear– I meant the first one, words are nibble-sorted individually.
More clarification: Do you want the function signature as given at the top, or as in the reference version?
Oh, I think you must want the former, and I’m just confused by the coincidence of names.
Richard, sorry, I’ve updated the code to be more clear. I can’t believe how slow my code is.
Damn it, John. I had a lecture to prepare! 🙂
Carlos, current record is about 30 microseconds to sort the 1024 words, if that helps :).
Here is a quick test framework I made for writing and testing your contest entries.
https://gist.github.com/DavidEGrayson/f331d8863ef46fa343f8
Just put your algorithn in the “my_nibble_sort” function and the program will show you how fast it is compared to the Regehr algorithm. Regehr’s time will be on the first line of the output, and your time will be on the second.
Does anyone have suggestions on how to do the timing more accurately? I’m just using clock() right now. I made the size of the buffers be 1024*128 instead of 1024 so the timing would be more accurate.
Use clock_gettime() with CLOCK_MONOTONIC for benchmarking. On older systems you must link with -lrt. With that change my benchmarking setup looks a lot like yours, but instead of extending the size of the buffer I repeat the test a number of times and take the smallest observed elapsed time.
Here is my current timing code (not very pretty):
http://pastebin.com/34559reG
I’d appreciate feedback. I can switch from CLOCK_REALTIME to CLOCK_MONOTONIC if that’s a better idea, but I’m more interested in figuring out how to defeat my smart processor. I already turned off frequency scaling. I probably need to disable turbo-boost or whatever it’s called in the BIOS. I probably should pin the test program to a single processor. Anything else?
Just a tiny bit prettier version:
http://pastebin.com/bw7SR9SP
Also I never use hyperthreading.
And my driver script if anyone cares:
http://pastebin.com/QpbDRuhL
If anyone has time and energy to create an optimized nibble sort using STOKE I would be really interested to see the results:
https://github.com/eschkufz/stoke-release
Multi-threading allowed ?
Just one thread please.
A whole legion of Windows programmer are being left out of this challenge. Is there anything specific to gcc you are looking for or is it just infrastructural/toolset limitations?
Dilip, sorry, I’m not trying to exclude people, I just don’t happen to have a Windows machine available for testing. My wife does have a Windows machine but it’s old and doesn’t have Visual Studio installed. It does appear that gcc-4.9.2 is available for Cygwin if that works for you.
I’m not sure if there’s any overhead associated with clock() or clock_gettime(), but rdtsc, or probably rdtscp may give more accurate results. I’ll post a link to my scaffold later.
To add to John’s suggestion about Cygwin, there’s also a lighter option: MinGW-w64, and you can also examine gcc-4.9’s assembly output online at gcc.godbolt.org. I think disclosure of testing environment helps to focus participants and thus boost competition. How dull would it be if the compiler and cflags were not announced, “forcing” people to submit their entries as inline assembly blobs (in case cflags is “-O0”)?
It’s true that if you have access to same software and hardware as John, you’ll have an easier time preparing your entry, but if not, you’re not excluded.
You can still develop the code for this contest in Windows as long as it is cross-platform.
Jason, I get less than 20ns elapsed time for back-to-back calls to clock_gettime(). This is far less than the time required to do the nibble sort of 1024 words, so I’m not inclined to pursue a lower-level timing strategy.
So clock_gettime() isn’t even dropping into the kernel, it looks like. Oddly, when a program using clock_gettime() is linked statically, it gets a different implementation that does use the kernel.
Here’s a fairly naive implementation – no ASM, no SIMD…
How does it do?
https://gist.github.com/RMGiroux/89dfd9eb3582627b293e
How is timing done? Best time you ever see or average of N test runs?
Scotty, I’m taking the best time though that is only correct when we’re 100% free of low outliers, which could happen due to moving between processors with unsynchronized timestamp counters.
Hi Mike, about 87 microseconds on my machine.
I have two entries. First, a serious one: http://emergent.unpythonic.net/files/sandbox/nibblesort/jepler.c which does a counting sort, putting the nibble counts themselves in nibbles so more work is done in registers. (around 20000ns on Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz with gcc 4.9.1)
Second, an entry that is deliberately destructive of play, based on knowledge of the driver program: http://emergent.unpythonic.net/files/sandbox/nibblesort/bad.c (times under 50ns? achieved!)
One aspect of my serious submission is that I realized it initially had a bug for all-nibbles-same, which is currently fixed; I’m not aware of other buggy inputs. maybe your test harness should pass in some specific values likely to hit bugs. Though the next guy probably has different buggy cases than I do.
I have no solution to propose, but can verify (literally) John’s reference is correct (for buffers of 10, not buffers of 1024, admittedly). I cross-checked by also confirming at a byte granularity to make sure no bugs in the nibble extractor code which looked a lot like the code verified.
Of course, good example of the pointlessness of verification sometimes. Hard to see how this is wrong unless assumptions behind any verification are also off.
Alex, very cool. What tool are you using here? But take my word for it that some of the other entries are *much* harder to verify since they drop down a level of abstraction into SIMD land. I’ll publish them in a github repo in early Feb (if the authors agree this is ok).
Jeff, does the null entry really pass my validator? I didn’t check, but have seen exactly this happen before, since the correct answer is just sitting there in a register, and it’s not 100% clear how to defend against this in C. Maybe also run the tests on code compiled at -O0, or just run an UB detector.
I’m just using CBMC, which is why I dropped the buffer size down from 1024 to 10 (I don’t see how it can cause a problem, and that’s a lot of bits for SAT — takes it a few minutes even at 10).
SIMD or inline assembler… I’m not sure there’s really a tool out there that can handle it. I know some prototype commercial stuff that does bit level processor bounded MC, but AFAIK it’s just PowerPC, and not ready for prime time.
At buffer size 10 it’s not too bad:
1591318 variables, 3702680 clauses
SAT checker: negated claim is UNSATISFIABLE, i.e., holds
Runtime decision procedure: 683.586s
VERIFICATION SUCCESSFUL
Is it possible to use clang instead of GCC ? GCC register allocator seem to hate my code and clang is more than 3 times faster (Same options: just -O3 -std=c99).
Out of curiosity, is (potentially) undefined behavior allowed provided that, on the system you specified, the results are predictable (and correct)?
John, bad.c replaces rand() with a function that always returns zero. Since for any all-nibbles-equal input, the output is equal to the input, the function which doesn’t even touch its buffer it passes the test.
Robin, sure, but to keep things simple let’s (like gcc) use the last released version: 3.5.1.
Joshua, UB is fine! I’ll just turn off warnings and hold my nose 🙂
Alex: nice! I need to use CBMC, have never done more than played with it briefly.
Jeff: Cool! I’ve run into this while grading assignments where a bunch of students who didn’t implement some function were mysteriously getting all the points.
I’ve always felt that “SAT checker: negated claim is UNSATISFIABLE, i.e., holds” is one of those lines that undermines any claim we actually want normal developers to use these tools.
Why use clock_gettime() when one can actually count the cycles by reading the clock-counter via the rdtsc-instruction (several runs may be required to get the minumum)?
BTW: Is there an actual application of the nibble-sorting or is it “just academic”? I’m curious.
Thanks.
Alex, is it possible for you to post your use of CBMC for nibble sort verification somewhere?
I am new to CBMC, so I’m not sure how to properly use it.
So far, I did use it to prove incorrect the variant of my counting sort which I knew had a bug for all-nibbles-same numbers. However, with 150 CPU minutes under its belt it hasn’t finished chewing on the version which I think is correct.
My test harness is at https://gist.github.com/jepler/baf8e7da47f24e50bc2e — note that I’ve ensured that only one number is being “sorted” (-DBUFSIZE=1)
Tle last output from cbmc was:
Solving with MiniSAT 2.2.0 with simplifier
180519 variables, 644810 clauses
.. which is fewer variables and clauses than you reported when you did your testing.
Jeff, sometimes you need to try different solvers, Z3, STP, and Boolector are all good choices.
Nobody, as comment #22 says the overhead of clock_gettime() is <20ns, which is absolutely negligible in the context. I’d certainly use the TSC if its use was warranted.
I know of no particular use case for sorting the nibbles in a word! It’s just a fun problem.
I’ll eventually make available a github repo containing all of the entries (assuming the authors agree) and then it would be great to see if different timing harnesses all get the same results.
John, I updated my gist to license the code explicitly as Apache 2.0, and also explicitly grant you the right to aggregate it for benchmarking/pedagogical purposes.
Not that I would have complained if you’d done so without the explicit license, but it’s always nicer when these things are clear and explicit.
since there are 1024 words, then i need to sort 8×1024 nibbles? meaning a nibble on one word can be placed on another word?
ja the problem is to sort the nibbles within each word, see the reference implementation.
John, the contest is about halfway through, how about publishing some sort of “current standings” to make things a bit more interesting to onlookers, and give some idea of contest level for people considering to join in the rest of the week? Does that sound like a good idea?
Alexander, yes, a good idea, I’ll do this tonight.
I should add that I’ve been amazed by the variety and quality of the submissions!
regehr, nobody: Here is a practical application: nibble-based BWT.