Nibble Sort Denouement – Embedded in Academia

Back in January my nibble sort contest resulted in entries that dramatically exceeded my expectations. Since then I’ve been trying to write up a post explaining the various strategies that people used and since you don’t care about my excuses I won’t tell you them, but I never got it written. However! I want to point everyone to three great blog entries:

Jethro Beekman explains his approach based on a sorting network
Jordan Rose walks us through the series of optimizations that he implemented, and also explains the fastest non-SIMD entry
Hans Wennborg adds an explanation of Alexander Monakov’s tour de force which can sort the nibbles in a 64-bit word in just over 1ns

The thing that I liked most about this contest is that the fastest entries ended up being way faster than I predicted — mostly because I didn’t understand how excellent the Haswell vector unit is. I’ve sent my standard programming contest prize (Amazon gift certificate covering the cost of a paper copy of Hacker’s Delight) to Alexander and Jeff (Jerome’s non-SIMD entry, alas, arrived a bit after the contest expired). Thanks again everyone!

Final note: as a compiler researcher I think these contest entries could be very useful as a case study in how variance among algorithms and among implementations of algorithms can lead to extreme performance differences. In the long run we need compilers that understand how to automatically switch between different versions such as these. Regular compiler technology won’t cut it, nor will (at present) program synthesis. But it’s a good goal.

One response to “Nibble Sort Denouement”

Jeff Epler says:

July 20, 2015 at 4:50 pm

My hat remains off to Jerome, who figured out a neat trick I overlooked. (and of course to the SIMD entries!)

I recently got my hands on a 64-bit ARM machine (dragonboard 410c) and ran the nibble sort benchmark there. Interestingly, alexander2 and vetter4 pull ahead of jepler and jerome; jepler is faster than jerome after tweaking its main lookup table to operate 2 nibbles at a time instead of three. I didn’t look whether the same change can be easily done to jerome’s code. My results: https://gist.github.com/jepler/333668bb8a3a4e92580b