# Nibble Sort Programming Contest

The problem is to sort the 4-bit pieces of a 64-bit word with (unsigned) smaller values towards the small end of the word. The nibble sort of 0xbadbeef is 0xfeedbba000000000. The function you implement will perform this sorting operation on a buffer of 1024 64-bit integers:

```void nibble_sort(unsigned long *buf);
```

I’ll give a small prize to the submitter of the entry that performs this operation the fastest on my Core i7-4770. If the winner makes non-trivial use of SIMD, I’ll give a prize to that entry and also to the fastest non-SIMD entry. I’ll use GCC 4.9.2 to compile your C99 code, which may use SIMD intrinsics and also inline assembly if you choose. There’s no requirement for portability. The machine runs Ubuntu 14.04 in 64-bit mode. If you require any command line options other than “gcc -std=gnu99 -O3” you need to tell me that. You may assume that the buffer starts at the start of a cache line. The buffer will be loaded with uniformly chosen random values. Submit code by emailing me. Submit entries before the end of January 2015.

Here’s a slow (but hopefully correct) reference implementation takes about 180 380 microseconds to do the job:

```int read_nibble(unsigned long w, int i) {
assert(i >= 0 && i < 16);
unsigned long res = w >> (i * 4);
return res & 0xf;
}

void write_nibble(unsigned long *w, int i, int v) {
assert(i >= 0 && i < 16);
unsigned long prom = v;
prom <<= (i * 4);
*w |= prom;
}

unsigned long nibble_sort_word(unsigned long arg) {
for (int i = 0; i < 16; ++i) {
int min = i;
for (int j = i+1; j < 16; ++j) {
min = j;
}
if (min != i) {
write_nibble(&arg, min, tmp);
}
}
return arg;
}

void nibble_sort(unsigned long *buf) {
for (int i=0; i<1024; i++)
buf[i] = nibble_sort_word(buf[i]);
}
```

Update from Feb 1: Ok, contest is over. Thanks for the very impressive submissions, folks! Provisional results with and w/o SIMD are below. These are with turbo boost turned off. I’m now working on the blog post explaining the results, which is going to take some time because there are a lot of solutions here and some of them are very clever. I’m also still working to secure permission to use codes so I can put as many as possible in Github. There are fewer entries in the no-SIMD category because many codes had explicit use of vectors. Please let me know if you think I’ve made any mistakes.

Update from Feb 2: There were a few mistakes and omissions. Results that I hope are final are below.

```regehr@regehr-M51AC:nibble_sort\$ ./go.sh
ns     ns / 8 bytes       entry name           errors
1115             1.09       alexander4                0
1238             1.21          arseny2                0
1596             1.56          arseny1                0
1755             1.71              uwe                0
2630             2.57        pdewacht2                0
4205             4.11         beekman2                0
4466             4.36         beekman1                0
4867             4.75       alexander3                0
6084             5.94       alexander1                0
6364             6.21         koorogi2                0
8472             8.27             falk                0
8520             8.32           jerome                0
10331            10.09          vetter4                0
10374            10.13           jepler                0
10950            10.69       alexander2                0
12096            11.81          vetter3               64
12714            12.42         koorogi1                0
14185            13.85              tom                0
15532            15.17             rjk9                0
16829            16.43          parnell                0
16890            16.49        pdewacht1                0
18804            18.36           chucky                0
20041            19.57          burton2                0
24908            24.32         bosmans1               60
25104            24.52         bosmans2                0
25486            24.89          vetter2                0
28957            28.28             hans                0
29928            29.23             anon                0
30228            29.52            jrose                0
31007            30.28           carlos                0
32952            32.18            joris                0
34562            33.75          vetter1                0
45440            44.38             mats                0
47511            46.40            frank                0
50162            48.99            robin                0
72762            71.06          grayson                0
74465            72.72            rosen                0
79752            77.88            payer                0
92970            90.79          burton1                0
94343            92.13           mentre                0
97877            95.58           rogers                0
99077            96.75             mike                0
101913            99.52            bloom                0
103945           101.51           jarkko                0
109191           106.63            karim                0
120928           118.09           justin                0
160599           156.83           mikael                0
200943           196.23           scotty                0
416207           406.45              ref                0
regehr@regehr-M51AC:nibble_sort\$ ./go.sh NOSIMD
ns     ns / 8 bytes       entry name           errors
8593             8.39           jerome                0
10321            10.08           jepler                0
10346            10.10          vetter4                0
10959            10.70       alexander2                0
12105            11.82          vetter3               64
15537            15.17             rjk9                0
16842            16.45          parnell                0
18847            18.41           chucky                0
19995            19.53        pdewacht1                0
25516            24.92          vetter2                0
29418            28.73             hans                0
29928            29.23             anon                0
30529            29.81            jrose                0
32956            32.18            joris                0
34286            33.48             falk                0
41028            40.07          burton2                0
42308            41.32          vetter1                0
44415            43.37         bosmans2                0
45480            44.41             mats                0
50179            49.00            robin                0
53189            51.94            frank                0
71285            69.61          grayson                0
73883            72.15            rosen                0
79744            77.88            payer                0
94780            92.56          burton1                0
94970            92.74           mentre                0
98658            96.35           rogers                0
98922            96.60             mike                0
101722            99.34            bloom                0
104880           102.42           jarkko                0
113147           110.50            karim                0
119168           116.38           justin                0
415529           405.79              ref                0
```

Update from Feb 3: Here’s the Github repo containing all of the entries — let me know how it looks.

## 91 Replies to “Nibble Sort Programming Contest”

1. Alexander Monakov says:

Anders, how so? In BWT you sort strings, not individual bytes/nibbles?

2. Justin L. says:

Replace your call to nibble_sort_word() in nibble_sort() with the code found here:

http://pastebin.com/JQTzyQuR

Should get a 2-2.5x speedup.

3. Justin, your entry is included in the preliminary results above.

4. Jeff,

I lost the code at the moment (it’s not in the directory I thought I developed in) but in the next week or so I’ll either find it or re-write. Note that your harness is doing two sorts — one with John’s reference and one with your sort, which may be fewer SAT vars but may be much more complex than my harness for just John’s reference. Instead of sorting twice (since I’m checked the ref, so where do I get a ref?) I

0) Back up the buffer to be sorted
1) Call John’s code
2) Make sure John’s code produces sorted buffer items
3) Check each sorted buffer item to make sure it’s a permutation of the original

I did it by nibbles and also did the sort check (not permutations of course) by bytes since nibbles sorted implies byte-sorted. The SAT numbers I give are for bytes.

2 + 3 is simpler semantics than comparing two sorts, I suspect. My bigger SAT is just the 10 buffers, which probably are less a problem than the “proof” (UNSAT) construction of equivalence of the two
sorts. I’m also running an old CBMC, maybe that helps đź™‚

5. Joshua Green says:

Jeff, I think your (non-joke) solution is going to be tough to beat, though I’m currently having fun playing with some amusing alternatives. Your initial check for all-nibbles-same is likely to lead to a hardware cyclic shift, hence it’s probably optimal. Still, multiplies are pretty fast on x86, hence it might be fun to test an alternative:

if ((word * 0xfffffffffffffff1) < 16) {
return word;
}

6. Ron Collinson says:

“So clock_gettime() isnâ€™t even dropping into the kernel, it looks like. Oddly, when a program using clock_gettime() is linked statically, it gets a different implementation that does use the kernel.”

I wonder if that function is implemented as a virtual syscall (vDSO), essentially an optimisation to avoid the overhead of a full syscall and switch to kernel mode, which is getting pretty expensive today.

7. Ron and Alexander, thanks! I had not seen this mechanism before.

8. Status update: I’m traveling and have fallen behind in my email responses. Nevertheless, I’ll post the winners on Sunday and write an entry examining the winning strategies as soon as I can.

9. Ron Collinson says:

Thanks Alexander that manpage has lots of intersting details.

That STOKE tool is quite neat, but it is being quite a challenge to go from the popcount hello world example to nibblesort

For performance measurement, I used PAPI (http://icl.cs.utk.edu/papi/):

#define NUM_EVENTS 2
#define NUM_WORDS 1024

char *counter_names[NUM_EVENTS] = { “PAPI_TOT_INS”, “PAPI_TOT_CYC”};
int counter_events[NUM_EVENTS] = {PAPI_TOT_INS, PAPI_TOT_CYC};
long_long counter_values1[NUM_EVENTS], counter_values2[NUM_EVENTS];

assert(PAPI_start_counters(counter_events, NUM_EVENTS) == PAPI_OK);

sort_array(random_word_array, NUM_WORDS);

for (i=0; i < NUM_EVENTS; i++) {
printf("%s/word: %lld\n", counter_names[i],
(counter_values2[i] – counter_values1[i])/1024);
}

11. See provisional results in the updated post.

12. Christian Vetter says:

My last two submissions to not (yet) seem to be included?

13. Alexander Monakov says:

Please note that the results may change a bit before the final announcement, for example my ‘alexander2’ entry should appear in the non-simd category, but nevertheless it shouldn’t change the winner there (so close!). Congratulations to Jeff Epler for securing impressive performance “the classic way”, and big thanks to everyone, it’s been great fun!

14. Christian– I’ll look into it on Monday.

15. Christian, sorry about that! This is fixed now. Your fourth entry is almost the fastest non-SIMD code (it sometimes wins against jepler, sometimes loses) but it fails some of my test cases (I didn’t look at which ones but I’m guessing it is the all-same-nibble case that caused trouble for several other entries).

16. I can’t wait to see (and have explained to me!) the SIMD code that beats my traditional code by nearly 10x.

I did try my hand at writing a SIMD version based on the bitonic sort, but I’m not very practiced at writing SIMD code so I didn’t have anything working by the submission deadline (and my non-working code was only 25% faster than my counting sort code anyway, a lot of time spent getting bits in the right place)

17. Alexander Monakov says:

Jeff, I’ve just submitted comments for the winning simd entry to John. I’m also very curious to see your entry and where those 7% come from! I also used counting sort, but with some different design choices compared to the code you posted openly (which, it appears, you later improved upon).

18. Christian Vetter says:

Hmm, strange, passes on my machine (I ran all-same + a few billion random ones). Maybe a “unsigned long long” vs “unsigned long” / “ULL” missing kind of thing (I was mostly testing on Windows, and since an unsigned long is only 32 bit there, I used unsigned long long)

19. pdewacht says:

I don’t understand the timing difference for pdewacht1. It’s plain C code and the generated code doesn’t contain any SIMD instructions (at least not on my machine).

And the top entries really are stunning. I tried hard to improve on pdewacht2, but I couldn’t find any approach that worked.

20. Christian Vetter says:

Alexander, looking at Jeffs solution, it seems he did something similar to me, minus one optimization: I am reconstructing the solution two-nibble-types at-a-time ( http://pastebin.com/08PXcSF0 ). Maybe he did the same later on.

21. Peter I’ll take a look tonight, and also hopefully get the github repo published, which will remove all remaining mysteries.

Yes, the entries by Arseny and Alexander are sort of shockingly efficient.

22. Christian, Alexander, looking up the shift from a table was a trick I overlooked actually. My base versions is not identical to whta I submitted, but this seems to save 20% on my system (11273 -> 8969 here, so maybe around 8200 on the test machine). The program I submitted counted 3 nibbles at a time and shifted 2. It should be in the git repository when reghr announces it.

23. Christian Vetter says:

I am definitely looking forward to the code. Also it would be interesting how much more you could get out of it by combining a few tricks from different solutions.

24. Alexander Monakov says:

It looks like submissions were processed by a formatting tool (clang-format)? :/

25. Alexander, did it mess up your code? I thought uniform formatting would be nice. I can revert it if it made things messy…

26. Alexander Monakov says:

On one hand, it’s not a big deal and I see value in uniform formatting, on the other hand I don’t like how some “paragraphs” of code now make a narrow page-long column of interleaved statements (imo making it even less comprehensible). I don’t like that the original authors’ style is not shown. I also feel like the fact that it came unannounced is making my feathers more ruffled.

I felt the need to express some disapproval, but I’m explicitely _not_ asking to revert (in any case perhaps let others chance to chime in?). On this point I absolutely won’t mind if you decide that I’m throwing a tantrum and move on; most important is the fact that you’re hosting this fun little thing, and it was good.

For avoidance of doubt, if you decide to later post my commented version, alexander4c.c, I’m asking that such automatic formatting is not applied there.

27. Alexander, here’s what I’ll do: revert the formatting of each entry where someone asks me to do that. I have no wish to destroy the natural beauty of the code!

28. Awesome stuff, Alexander (and runners up)!

I see the efficient transpose is a non-trivial part of the instruction count. But after the sort is complete, have you considered only doing a partial transpose and early-scattering slightly smaller memory transactions to their proper addresses?

No idea if this would squeeze out a few more ns. or reduce performance.

I ask because that kind of trick works well with GPU transposes where tiny 32-byte transactions are still performant and performing a full 128+ byte transpose on data already in registers is fairly complex.

29. Alexander Monakov says:

I second that, really well done, Jerome. I hope you’re reading these comments because I have a question for you, have you tried other table sizes?

Allan, I don’t see a way to make your idea work, because unlike in NV GPUs where you can sort of implicitely compact and/or cut virtual vector lanes to make a narrower memory access, on AVX the only way you get to do that, I think, is when you load/store a lower part of a register. Related to that, there’s no vector scatter insn, and vector gather insns are slow on Haswell.

However, as I said in an earlier email to John, we don’t need an exact transpose _before_ doing the sort, as we only need to get bytes/nibbles into vector lanes, in any order within a lane. So perhaps there’s some room for improvement that way, but I don’t see it.

[ Looking at Arseny’s second entry, his transpose approach (that I wasn’t aware of when writing my code) is tighter on instruction count, but I’m saved by the fact that my mix of shifts/blends/logic ops there can dual- or triple-issue. ]

30. uwe says:

Oh man, I forgot about VPMAXUB. Other than that, my solution is within an inch of alexander4. PBLEND instructions can have higher throughput than PUNPCK, but on the other hand I don’t need the shifts. Fascinating, thank you!

31. Alexander Monakov says:

Is the 64-bit unpack stage actually required in arceny2.c?..

32. Joshua Green says:

I don’t have time to run it right now, so I could be completely off base here, but won’t jepler.c return the wrong value for the input 0x123456789abcdef?

33. Joshua Green says:

If I’m right, then at least there appears to be a simple fix: replace line 91 with

return word & 0xf;

and lines 105 and 106 with

if (counts < 16) {
return counts * 0x1111111111111111;

34. Joshua, I realized that bug and submitted a revised version to John before the deadline, but I guess he didn’t incorporate it in the github version.

Alexander, I found that combining Jerome’s insertion method with code to count 3 nibbles at a time (32kB table) was faster. I think that doing it in two phases like my submission was also a win with the 32kB table.

Just for the sake of curiosity, I “ported” the code to ILP32 systems and benchmarked on a x86_64 desktop with gcc -m32, and also on a 32-bit arm (exynos 4xxx).

Results with NOSIMD -m32 on an x86_64 machine: http://paste.debian.net/143973/

Results on my arm machine, NOSIMD: http://paste.debian.net/143974/ (vetter4 fails probably because it used undefined behavior of shifts on x86)

35. Jeff– I thought that I updated your submission but would double check tonight.

About the question of whether your entry wins the no-SIMD category or Jerome’s does, I’m not really sure — I had been allowing people to submit minor fixes after the deadline and I didn’t know if Jerome’s fastest submission counts as a minor fix to his earlier one or not! Let me know what you think.

36. Hi Jeff, it looks like the entry for you in github is (and has been all along) the correct updated one. Of course you should double check…

37. Joshua Green says:

I took another look at
https://github.com/regehr/nibble-sort/blob/master/jepler.c
and I still don’t see what I’m missing. If word == 0x123456789abcdef, then the initial check in count_nibbles is bypassed, count_nibbles returns 0x1111111111111111, and fill_nibbles(0x1111111111111111) = 0x1111111111111111 due to its initial check.

38. Arseny says:

Alexander, without a 64-bit unpack the transpose is incomplete – since unpack works with 16-byte long portions of the 32-byte vector, you end up with 8×8 matrix stored in 16×4 region (4 16-byte parts of a vector).

I managed to convince myself when I was writing this code that it’s impossible to do a transpose “perfectly” (using 3 passes) with unpacks, although it would be possible to do it with a generic byte shuffle (akin to vec_perm from Altivec). Maybe I’m wrong.

It’s fascinating that we ended up using almost the same solution. It makes sense that blend would be faster than unpack but it’s interesting that the combination of blend+shift is still faster. Congratulations!