The problem is to sort the 4-bit pieces of a 64-bit word with (unsigned) smaller values towards the small end of the word. The nibble sort of 0xbadbeef is 0xfeedbba000000000. The function you implement will perform this sorting operation on a buffer of 1024 64-bit integers:
void nibble_sort(unsigned long *buf);
I’ll give a small prize to the submitter of the entry that performs this operation the fastest on my Core i7-4770. If the winner makes non-trivial use of SIMD, I’ll give a prize to that entry and also to the fastest non-SIMD entry. I’ll use GCC 4.9.2 to compile your C99 code, which may use SIMD intrinsics and also inline assembly if you choose. There’s no requirement for portability. The machine runs Ubuntu 14.04 in 64-bit mode. If you require any command line options other than “gcc -std=gnu99 -O3” you need to tell me that. You may assume that the buffer starts at the start of a cache line. The buffer will be loaded with uniformly chosen random values. Submit code by emailing me. Submit entries before the end of January 2015.
Here’s a slow (but hopefully correct) reference implementation takes about 180 380 microseconds to do the job:
int read_nibble(unsigned long w, int i) { assert(i >= 0 && i < 16); unsigned long res = w >> (i * 4); return res & 0xf; } void write_nibble(unsigned long *w, int i, int v) { assert(i >= 0 && i < 16); unsigned long mask = 0xf; mask <<= (i * 4); *w &= ~mask; unsigned long prom = v; prom <<= (i * 4); *w |= prom; } unsigned long nibble_sort_word(unsigned long arg) { for (int i = 0; i < 16; ++i) { int min = i; for (int j = i+1; j < 16; ++j) { if (read_nibble(arg, j) < read_nibble(arg, min)) min = j; } if (min != i) { int tmp = read_nibble(arg, i); write_nibble(&arg, i, read_nibble(arg, min)); write_nibble(&arg, min, tmp); } } return arg; } void nibble_sort(unsigned long *buf) { for (int i=0; i<1024; i++) buf[i] = nibble_sort_word(buf[i]); }
Update from Feb 1: Ok, contest is over. Thanks for the very impressive submissions, folks! Provisional results with and w/o SIMD are below. These are with turbo boost turned off. I’m now working on the blog post explaining the results, which is going to take some time because there are a lot of solutions here and some of them are very clever. I’m also still working to secure permission to use codes so I can put as many as possible in Github. There are fewer entries in the no-SIMD category because many codes had explicit use of vectors. Please let me know if you think I’ve made any mistakes.
Update from Feb 2: There were a few mistakes and omissions. Results that I hope are final are below.
regehr@regehr-M51AC:nibble_sort$ ./go.sh ns ns / 8 bytes entry name errors 1115 1.09 alexander4 0 1238 1.21 arseny2 0 1596 1.56 arseny1 0 1755 1.71 uwe 0 2630 2.57 pdewacht2 0 4205 4.11 beekman2 0 4466 4.36 beekman1 0 4867 4.75 alexander3 0 6084 5.94 alexander1 0 6364 6.21 koorogi2 0 8472 8.27 falk 0 8520 8.32 jerome 0 10331 10.09 vetter4 0 10374 10.13 jepler 0 10950 10.69 alexander2 0 12096 11.81 vetter3 64 12714 12.42 koorogi1 0 14185 13.85 tom 0 15532 15.17 rjk9 0 16829 16.43 parnell 0 16890 16.49 pdewacht1 0 18804 18.36 chucky 0 20041 19.57 burton2 0 20649 20.17 nadav 0 24908 24.32 bosmans1 60 25104 24.52 bosmans2 0 25486 24.89 vetter2 0 28957 28.28 hans 0 29928 29.23 anon 0 30228 29.52 jrose 0 31007 30.28 carlos 0 32952 32.18 joris 0 34562 33.75 vetter1 0 45440 44.38 mats 0 47511 46.40 frank 0 50162 48.99 robin 0 72762 71.06 grayson 0 74465 72.72 rosen 0 79752 77.88 payer 0 92970 90.79 burton1 0 94343 92.13 mentre 0 95522 93.28 vlad 0 97877 95.58 rogers 0 99077 96.75 mike 0 101913 99.52 bloom 0 103945 101.51 jarkko 0 109191 106.63 karim 0 120928 118.09 justin 0 160599 156.83 mikael 0 200943 196.23 scotty 0 416207 406.45 ref 0 regehr@regehr-M51AC:nibble_sort$ ./go.sh NOSIMD ns ns / 8 bytes entry name errors 8593 8.39 jerome 0 10321 10.08 jepler 0 10346 10.10 vetter4 0 10959 10.70 alexander2 0 12105 11.82 vetter3 64 15537 15.17 rjk9 0 16842 16.45 parnell 0 18847 18.41 chucky 0 19995 19.53 pdewacht1 0 21631 21.12 nadav 0 25516 24.92 vetter2 0 29418 28.73 hans 0 29928 29.23 anon 0 30529 29.81 jrose 0 32956 32.18 joris 0 34286 33.48 falk 0 41028 40.07 burton2 0 42308 41.32 vetter1 0 44415 43.37 bosmans2 0 45480 44.41 mats 0 50179 49.00 robin 0 53189 51.94 frank 0 71285 69.61 grayson 0 73883 72.15 rosen 0 79744 77.88 payer 0 94780 92.56 burton1 0 94970 92.74 mentre 0 95509 93.27 vlad 0 98658 96.35 rogers 0 98922 96.60 mike 0 101722 99.34 bloom 0 104880 102.42 jarkko 0 113147 110.50 karim 0 119168 116.38 justin 0 415529 405.79 ref 0
Update from Feb 3: Here’s the Github repo containing all of the entries — let me know how it looks.
91 responses to “Nibble Sort Programming Contest”
Anders, how so? In BWT you sort strings, not individual bytes/nibbles?
Replace your call to nibble_sort_word() in nibble_sort() with the code found here:
http://pastebin.com/JQTzyQuR
Should get a 2-2.5x speedup.
Justin, your entry is included in the preliminary results above.
Jeff,
I lost the code at the moment (it’s not in the directory I thought I developed in) but in the next week or so I’ll either find it or re-write. Note that your harness is doing two sorts — one with John’s reference and one with your sort, which may be fewer SAT vars but may be much more complex than my harness for just John’s reference. Instead of sorting twice (since I’m checked the ref, so where do I get a ref?) I
0) Back up the buffer to be sorted
1) Call John’s code
2) Make sure John’s code produces sorted buffer items
3) Check each sorted buffer item to make sure it’s a permutation of the original
I did it by nibbles and also did the sort check (not permutations of course) by bytes since nibbles sorted implies byte-sorted. The SAT numbers I give are for bytes.
2 + 3 is simpler semantics than comparing two sorts, I suspect. My bigger SAT is just the 10 buffers, which probably are less a problem than the “proof” (UNSAT) construction of equivalence of the two
sorts. I’m also running an old CBMC, maybe that helps 🙂
Jeff, I think your (non-joke) solution is going to be tough to beat, though I’m currently having fun playing with some amusing alternatives. Your initial check for all-nibbles-same is likely to lead to a hardware cyclic shift, hence it’s probably optimal. Still, multiplies are pretty fast on x86, hence it might be fun to test an alternative:
if ((word * 0xfffffffffffffff1) < 16) {
return word;
}
“So clock_gettime() isn’t even dropping into the kernel, it looks like. Oddly, when a program using clock_gettime() is linked statically, it gets a different implementation that does use the kernel.”
I wonder if that function is implemented as a virtual syscall (vDSO), essentially an optimisation to avoid the overhead of a full syscall and switch to kernel mode, which is getting pretty expensive today.
https://lwn.net/Articles/446528/
https://lwn.net/Articles/615809/
Ron, indeed it’s supplied in a vdso by the kernel. You can see the list of vdso functions (varies by architecture) here for example: http://man7.org/linux/man-pages/man7/vdso.7.html
Ron and Alexander, thanks! I had not seen this mechanism before.
Status update: I’m traveling and have fallen behind in my email responses. Nevertheless, I’ll post the winners on Sunday and write an entry examining the winning strategies as soon as I can.
Thanks Alexander that manpage has lots of intersting details.
That STOKE tool is quite neat, but it is being quite a challenge to go from the popcount hello world example to nibblesort
For performance measurement, I used PAPI (http://icl.cs.utk.edu/papi/):
#define NUM_EVENTS 2
#define NUM_WORDS 1024
char *counter_names[NUM_EVENTS] = { “PAPI_TOT_INS”, “PAPI_TOT_CYC”};
int counter_events[NUM_EVENTS] = {PAPI_TOT_INS, PAPI_TOT_CYC};
long_long counter_values1[NUM_EVENTS], counter_values2[NUM_EVENTS];
assert(PAPI_start_counters(counter_events, NUM_EVENTS) == PAPI_OK);
assert(PAPI_read_counters(counter_values1, NUM_EVENTS) == PAPI_OK);
sort_array(random_word_array, NUM_WORDS);
assert(PAPI_read_counters(counter_values2, NUM_EVENTS) == PAPI_OK);
for (i=0; i < NUM_EVENTS; i++) {
printf("%s/word: %lld\n", counter_names[i],
(counter_values2[i] – counter_values1[i])/1024);
}
See provisional results in the updated post.
My last two submissions to not (yet) seem to be included?
Please note that the results may change a bit before the final announcement, for example my ‘alexander2’ entry should appear in the non-simd category, but nevertheless it shouldn’t change the winner there (so close!). Congratulations to Jeff Epler for securing impressive performance “the classic way”, and big thanks to everyone, it’s been great fun!
Christian– I’ll look into it on Monday.
Christian, sorry about that! This is fixed now. Your fourth entry is almost the fastest non-SIMD code (it sometimes wins against jepler, sometimes loses) but it fails some of my test cases (I didn’t look at which ones but I’m guessing it is the all-same-nibble case that caused trouble for several other entries).
I can’t wait to see (and have explained to me!) the SIMD code that beats my traditional code by nearly 10x.
I did try my hand at writing a SIMD version based on the bitonic sort, but I’m not very practiced at writing SIMD code so I didn’t have anything working by the submission deadline (and my non-working code was only 25% faster than my counting sort code anyway, a lot of time spent getting bits in the right place)
Jeff, I’ve just submitted comments for the winning simd entry to John. I’m also very curious to see your entry and where those 7% come from! I also used counting sort, but with some different design choices compared to the code you posted openly (which, it appears, you later improved upon).
Hmm, strange, passes on my machine (I ran all-same + a few billion random ones). Maybe a “unsigned long long” vs “unsigned long” / “ULL” missing kind of thing (I was mostly testing on Windows, and since an unsigned long is only 32 bit there, I used unsigned long long)
I don’t understand the timing difference for pdewacht1. It’s plain C code and the generated code doesn’t contain any SIMD instructions (at least not on my machine).
And the top entries really are stunning. I tried hard to improve on pdewacht2, but I couldn’t find any approach that worked.
Alexander, looking at Jeffs solution, it seems he did something similar to me, minus one optimization: I am reconstructing the solution two-nibble-types at-a-time ( http://pastebin.com/08PXcSF0 ). Maybe he did the same later on.
Peter I’ll take a look tonight, and also hopefully get the github repo published, which will remove all remaining mysteries.
Yes, the entries by Arseny and Alexander are sort of shockingly efficient.
Christian, Alexander, looking up the shift from a table was a trick I overlooked actually. My base versions is not identical to whta I submitted, but this seems to save 20% on my system (11273 -> 8969 here, so maybe around 8200 on the test machine). The program I submitted counted 3 nibbles at a time and shifted 2. It should be in the git repository when reghr announces it.
I am definitely looking forward to the code. Also it would be interesting how much more you could get out of it by combining a few tricks from different solutions.
Github repo is here:
https://github.com/regehr/nibble-sort
Blog post will take a few more days!
It looks like submissions were processed by a formatting tool (clang-format)? :/
Alexander, did it mess up your code? I thought uniform formatting would be nice. I can revert it if it made things messy…
On one hand, it’s not a big deal and I see value in uniform formatting, on the other hand I don’t like how some “paragraphs” of code now make a narrow page-long column of interleaved statements (imo making it even less comprehensible). I don’t like that the original authors’ style is not shown. I also feel like the fact that it came unannounced is making my feathers more ruffled.
I felt the need to express some disapproval, but I’m explicitely _not_ asking to revert (in any case perhaps let others chance to chime in?). On this point I absolutely won’t mind if you decide that I’m throwing a tantrum and move on; most important is the fact that you’re hosting this fun little thing, and it was good.
For avoidance of doubt, if you decide to later post my commented version, alexander4c.c, I’m asking that such automatic formatting is not applied there.
Alexander, here’s what I’ll do: revert the formatting of each entry where someone asks me to do that. I have no wish to destroy the natural beauty of the code!
Awesome stuff, Alexander (and runners up)!
I see the efficient transpose is a non-trivial part of the instruction count. But after the sort is complete, have you considered only doing a partial transpose and early-scattering slightly smaller memory transactions to their proper addresses?
No idea if this would squeeze out a few more ns. or reduce performance.
I ask because that kind of trick works well with GPU transposes where tiny 32-byte transactions are still performant and performing a full 128+ byte transpose on data already in registers is fairly complex.
Jerome, my hat’s off to you.
I second that, really well done, Jerome. I hope you’re reading these comments because I have a question for you, have you tried other table sizes?
Allan, I don’t see a way to make your idea work, because unlike in NV GPUs where you can sort of implicitely compact and/or cut virtual vector lanes to make a narrower memory access, on AVX the only way you get to do that, I think, is when you load/store a lower part of a register. Related to that, there’s no vector scatter insn, and vector gather insns are slow on Haswell.
However, as I said in an earlier email to John, we don’t need an exact transpose _before_ doing the sort, as we only need to get bytes/nibbles into vector lanes, in any order within a lane. So perhaps there’s some room for improvement that way, but I don’t see it.
[ Looking at Arseny’s second entry, his transpose approach (that I wasn’t aware of when writing my code) is tighter on instruction count, but I’m saved by the fact that my mix of shifts/blends/logic ops there can dual- or triple-issue. ]
Oh man, I forgot about VPMAXUB. Other than that, my solution is within an inch of alexander4. PBLEND instructions can have higher throughput than PUNPCK, but on the other hand I don’t need the shifts. Fascinating, thank you!
Is the 64-bit unpack stage actually required in arceny2.c?..
I don’t have time to run it right now, so I could be completely off base here, but won’t jepler.c return the wrong value for the input 0x123456789abcdef?
If I’m right, then at least there appears to be a simple fix: replace line 91 with
return word & 0xf;
and lines 105 and 106 with
if (counts < 16) {
return counts * 0x1111111111111111;
Joshua, I realized that bug and submitted a revised version to John before the deadline, but I guess he didn’t incorporate it in the github version.
Alexander, I found that combining Jerome’s insertion method with code to count 3 nibbles at a time (32kB table) was faster. I think that doing it in two phases like my submission was also a win with the 32kB table.
Just for the sake of curiosity, I “ported” the code to ILP32 systems and benchmarked on a x86_64 desktop with gcc -m32, and also on a 32-bit arm (exynos 4xxx).
Results with NOSIMD -m32 on an x86_64 machine: http://paste.debian.net/143973/
Results on my arm machine, NOSIMD: http://paste.debian.net/143974/ (vetter4 fails probably because it used undefined behavior of shifts on x86)
Jeff– I thought that I updated your submission but would double check tonight.
About the question of whether your entry wins the no-SIMD category or Jerome’s does, I’m not really sure — I had been allowing people to submit minor fixes after the deadline and I didn’t know if Jerome’s fastest submission counts as a minor fix to his earlier one or not! Let me know what you think.
Hi Jeff, it looks like the entry for you in github is (and has been all along) the correct updated one. Of course you should double check…
I took another look at
https://github.com/regehr/nibble-sort/blob/master/jepler.c
and I still don’t see what I’m missing. If word == 0x123456789abcdef, then the initial check in count_nibbles is bypassed, count_nibbles returns 0x1111111111111111, and fill_nibbles(0x1111111111111111) = 0x1111111111111111 due to its initial check.
Alexander, without a 64-bit unpack the transpose is incomplete – since unpack works with 16-byte long portions of the 32-byte vector, you end up with 8×8 matrix stored in 16×4 region (4 16-byte parts of a vector).
I managed to convince myself when I was writing this code that it’s impossible to do a transpose “perfectly” (using 3 passes) with unpacks, although it would be possible to do it with a generic byte shuffle (akin to vec_perm from Altivec). Maybe I’m wrong.
It’s fascinating that we ended up using almost the same solution. It makes sense that blend would be faster than unpack but it’s interesting that the combination of blend+shift is still faster. Congratulations!