Rotating a computer integer is just like shifting, except that when a bit falls off one end of the register, it is used to fill the vacated bit position at the other end. Rotate is used in encryption and decryption, so we want it to be fast. The obvious C/C++ code for left rotate is:
uint32_t rotl32a (uint32_t x, uint32_t n) { return (x<<n) | (x>>(32-n)); }
Modern compilers will recognize this code and emit a rotate instruction, if available:
rotl32a: movb %sil, %cl roll %cl, %edi movl %edi, %eax ret
The source and assembly code for right rotate are analogous.
Although this x86-64 rotate instruction will have the expected behavior (acting as a nop) when asked to rotate by 0 or 32 places, this C/C++ function must only be called with n in the range 1-31. Outside of that range, the code has undefined behavior due to shifting a 32-bit value by 32 places. In general, crypto codes are designed to avoid rotating 32-bit words by 32 places. However, as seen in my last post, the current versions of five of the 10 open source crypto libraries that I examined perform rotations by 0 places — a potentially serious bug because C/C++ compilers have unpredictable results when shifting past bitwidth. Bear in mind that the well-definedness of the eventual instruction does not rescue the situation: it is the source code that places (or in this case, fails to place) obligations upon the compiler.
Here’s a safer rotate function:
uint32_t rotl32b (uint32_t x, uint32_t n) { assert (n<32); if (!n) return x; return (x<<n) | (x>>(32-n)); }
This one protects against the expected case where n is zero by avoiding the erroneous shift and also against the disallowed case of rotating by too many places. If we don’t want the overhead of the assert for production compiles, we can define NDEBUG in the preprocessor. The problem with this code is that (even with the assert compiled out) it contains a branch, which is sort of ugly. Clang generates the obvious branching code whereas GCC chooses to use a conditional move:
rotl32b: movl %edi, %eax movb %sil, %cl roll %cl, %eax testl %esi, %esi cmove %edi, %eax ret
I figured that was the end of the story but then I saw this version:
uint32_t rotl32c (uint32_t x, uint32_t n) { assert (n<32); return (x<<n) | (x>>(-n&31)); }
This one is branchless, meaning that it results in a single basic block of code, potentially making it easier to optimize. At the moment it defeats Clang’s rotate recognizer:
rotl32c: movl %esi, %ecx movl %edi, %eax shll %cl, %eax negl %ecx shrl %cl, %edi orl %eax, %edi movl %edi, %eax ret
However, they are working on this.
GCC (built today) has the desired output:
rotl32c: movl %edi, %eax movb %sil, %cl roll %cl, %eax ret
Here is their discussion of that issue.
Summary: If you want to write portable C/C++ code (that is, you don’t want to use intrinsics or inline assembly) and you want to correctly deal with rotating by zero places, the rotl32c() function from this post, and its obvious right-rotate counterpart, are probably the best choices. GCC already generates excellent code and the LLVM people are working on it.
28 responses to “Safe, Efficient, and Portable Rotate in C/C++”
How well do things handle:
uint32_t rotl32c (uint32_t x, uint32_t n)
{
n &= 31;
return (x<>(32-n));
}
And reformatted to avoid HTML mucking it up:
return (x>>(32-n)) | (x<<n);
bcs, this one (let’s call it rotl32d) may be the new winner. gcc gives the same code as rotl32c, clang gives better code than rotl32c, and IMO it’s clearer than rotl32c.
Wait, nevermind, roll32d executes undefined behavior in the zero case!
Shift by *zero* is UB? I’m not finding any sources for that.
If n==0, then n &= 31 leaves n unchanged. Then, 32-n = 32.
Oh!!!, then it needs to be:
return (x >> ((32-n)&31)) | x << (n&31);
The last one is a wash: basically the same code as rotl32c from each compiler.
Can I just say this situation is crazy? You have a simple, common programming task. In order to implement it in C, you have to write some unreadable bit-masking code. This code is not designed to be executed as such, but only be recognized as rotate by the compiler.
Hi David, I certainly see your point of view, and agree that the rules in C/C++ are heavily biased towards making the compiler’s job easier, as opposed to the programmer’s, but I might phrase it a bit differently.
How about this: The code is designed to unambiguously express the rotate function in the context of the somewhat arcane restrictions imposed by C/C++. Furthermore, it has the purpose of being recognizably a rotate that can be exploited by the compiler to generate an appropriate instruction should it be available (and certainly it will not be on some platforms).
So the purpose is not only to be recognized as a rotate. In that case, we have better ways to say the same thing, e.g. __builtin_rotate(x,n). The difference is that the code in this post gives the compiler enough information to implement a correct rotate in all circumstances, including those where it has an instruction and those where it doesn’t.
I don’t mean to be a C/C++ apologist, I hate all this stuff. But even so, I think it’s valuable to document the current best practice (as I understand it). If 5 out of 10 crypto library authors are doing it wrong, then clearly there’s a need for this information to be put out there.
John — I didn’t mean to downplay the importance of this blog post. I have certainly needed to write high-performance rotate code in C, and I definitely wish I had seen this posting when I struggled to do it the right way. I am angry that there is not a simple, direct way to express this instead of having to go through this complicated charade.
David, agreed.
Rotation constants in cryptographic code are in an overwhelming amount of cases known at compile time (RC6 is one of the few exceptions). This allows us to resolve this at compile time with either macros (C), or templates (C++). C++11 example:
template
using rot_t = std::integral_constant;
template
uint32_t rol(uint32_t x, rot_t)
{
return (x <> ((32-c)&31));
}
// x = rol(x, rot());
This way, the backend will be able to properly optimize the rotation. And since the type of the constant is distinct of plain uint32_t, this can be used side-by-side with a non-compile-time function that does the proper checks.
The (-n&31) trick to eliminate the branch is neat, but doesn’t C++ still permit one’s complement representation for signed integers, in which case it gets the wrong answer (e.g., for n=1 we get (-n&31) == 30)?
Hi pabigot, my understanding is that uint32_t has to be two’s complement, even on a machine where a regular int is ones’ complement.
Good timing for this post and the last one, as I just reviewed a patch adding centralized rotation methods to Mozilla’s code (to cut down on duplicates overall) last week, with the shift-by-0 error in them, then stumbled across these posts only a day or two later. 🙂
It looks like all Mozilla’s rotation uses rotate by constants. So there’s no issue with existing code in these ways, just a lurking one for future uses. (Does non-constant rotation have use cases? None spring immediately to mind.) I too agree with David that it’s kind of sad C/C++ make it so difficult to do this.
Note that negating an unsigned type triggers compiler warnings with some compilers (MSVC particularly), because negating a non-negative number and getting back a non-negative number is a bit confusing. (Notwithstanding that it’s well-defined in C/C++. But only sometimes — thanks, integral promotion!) So even if every compiler optimized rotl32c properly, I still probably wouldn’t use it.
regehr: uint32_t is unsigned so it can’t represent negative values and therefore isn’t two’s complement on any architecture. My thought was that taking the negative of it would undergo integral conversion to a signed value. I think 0-n would do that, but I see that unary negative applied to an unsigned integral value (in C++11 at least) does produce an unsigned integral value (defined to be 2^b – n when b is the number of bits in the promoted type of n).
So this trick should work on a one’s complement system (in at least C++11, and assuming integral promotion doesn’t somehow enter undefined behavior space).
You can avoid UB rotating a little each time:
unsigned int rol_ub(unsigned int value, unsigned int amount)
{
assert(amount > 0);
assert(amount < 32);
return (value << amount) | (value >> (32 – amount));
}
unsigned int rol_step(unsigned int value, unsigned int amount)
{
unsigned result = value;
result = rol_ub(result, 1 + amount/2);
result = rol_ub(result, 1 + (amount – amount/2));
result = rol_ub(result, 30);
return result;
}
Neither clang nor gcc fuse the rotations, though.
The language used in the standard says unsigned integral types behave as the integers modulo 2^n. This is equivalent to 2’s complement.
pabigot: 0-n works just as well as -n: the 0 is promoted to the same unsigned type as n and then the subtraction is done using the rules for unsigned types, which are that the result has one less than the maximum number representable in that type added or subtracted until the result is in range.
(This is as long as the type of n is at least as wide as ‘unsigned int’ – if it’s narrower, it may be promoted to signed int, but that problem also exists for the -n variant).
Err I meant “one more tahn the maximum number representable…” of course.
“The language used in the standard says unsigned integral types behave as the integers modulo 2^n. This is equivalent to 2′s complement.”
Yes, ISO/IEC 14882:2011 section 3.9.1 graf 4 specifies that unsigned integer types obey the laws of arithmetic modulo 2^N (first sentence above). Inferring from this any operational relationship to 2’s complement (second sentence above) is, I believe, a mistake. I suppose you could say that it’s equivalent to [arithmetic in] two’s complement as long as the resulting value remains within the non-negative portion of the two’s complement value range, but that’s not a useful generalization.
It is still permissible for implementations to use other representations for signed integer types (3.9.1 graf 7), and there are still situations where an unsigned operand is converted to a signed type. For example, had the code been (0-n)&0x1F I believe the wrong answer would be obtained on implementations that (a) had sizeof(int)>sizeof(uint32_t) (5 graf 10) and (b) used one’s complement representation for signed integral values. The fact that n happened to be an unsigned integral type is irrelevant because it would have been promoted to a signed value.
My error was in assuming -n was the same as 0-n, which involves arithmetic conversion and thus may produce a negative signed value. For unsigned integral types the expression -n (see 5.3.1 graf 8) is not the same as the expression 0-n (see 5.7 et alia). The code of rotl32c never introduces signed values so the question of signed representation is moot, and I’m sorry my confusion suggested the issue arose.
pabigot: The unary – operator performs the integer promotions on its operand. As you note, if int can represent all the values of uint32_t, the integer promotions will promote n to int.
In this situation though (where ‘int’ is wider than 32 bits), the left shift is also problematic because of the promotion of x! A fully portable version must convert the uint32_t x value to unsigned long before shifting (in fact the n function argument could/should simply be unsigned int).
kme: Yes, I hadn’t thought that all the way through. 4.5 graf 1 and 2 makes clear that “small” unsigned types will be promoted to signed int, potentially opening up a big can of mess.
However, I’m not entirely convinced for the case of -n versus 0-n because 5.3.1 graf 8 specifies the calculation for negative of an “unsigned quantity” to be subtraction from 2^n which will produce a positive value, unlike subtraction from 0 which produces a non-positive value. Does the fact the promoted type is signed override the fact that the operand is known to be an “unsigned quantity”? If not, the negation and subsequent bitmask still produces the right value regardless of representation of negative integral values.
(I also agree that the type of n would better have been unsigned int to begin with.)
That the value-to-be-rotated might become signed as a result of integral promotion is a different question that I hadn’t considered, and I agree (per 5.8 graf 2) that this means a left shift that overflows evokes undefined behavior. Though I believe a cast of x to unsigned int would be sufficient in the case in question (where the rank of uint32_t is less than the rank of int); no need to go all the way to unsigned long.
Fortunately C++11 has the expressive power to ensure the operation is always well-defined, through static_assert so the implementation of the function can verify at compile-time the conditions required, or std::enable_if or material in to specialize an alternative implementation that does any necessary casts.
> Furthermore, it has the purpose of being recognizably a rotate that can be exploited by the compiler to generate an appropriate instruction should it be available (and certainly it will not be on some platforms).
There are certainly platforms without rotate instructions, C is useful on these platform OK, now can these platforms really run C++?
I really doubt it!
So while C’s behaviour is understandable, C++ should provide a rotate operation.
Why should C++ provide a rotate operation when it’s so easy to do it yourself?
Potential irony aside, https://gist.github.com/pabigot/7550454 eliminates all sources of undefined behavior that I’ve identified, and will produce a compile-time error if the algorithm’s preconditions are not met. It also generates similarly optimal code under current gcc (-std=c++11 -O3) as is produced for rotl32c. I like C++11.
I believe comments on the approach can be made on the gist page if anybody finds a mistake.
I suggest casting x to unsigned long because that type must be at least 32 bits wide, whereas unsigned int does not.
pabigot, very cool. My C++ is getting further and further behind…