Left-shift of signed integers in C99, C11, and C++11 is difficult to use because shifting a 1 bit into or past the sign bit (assuming two’s complement, of course) is an undefined behavior. Many medium and large C and C++ programs do this. For example, many codes use 1<<31 for INT_MIN. IOC can detect this problem, but it happens so commonly, and developers are so uninterested in hearing about it, that we added command line flags that tell IOC to not check for it, in order to avoid annoying people.
Today a reader pointed me to this active issue that the C++ standards committee is evaluating. Basically the proposal is to make it permissible to shift a 1 into the sign bit, but not past it. The effect will be to un-break a significant number of C/C++ programs that currently execute undefined behavior when compiled as C99, C11, or C++11. Issue 1457 is marked as “ready,” meaning (as far as I can tell) that it stands a good chance of being ratified and incorporated into future versions of the C11 and C++11 standards. This seems like a step in the right direction.
20 responses to “Slightly More Sensible Signed Left-Shifts in C11 and C++11”
It seems to me rather crazy to change the standard in this way:
why (1<<30) * 2 and (1<<30) << 1 should lead to different behavior?
@abramo: because multiplying by 2 and left-shifting by 1 are not the same operation? Multiplication respects sign (mathematically speaking, at least), bit-shifting doesn’t (the sign bit is the same as any other bit). One operates on the value, the other on the binary representation.
It seems peculiar to say “since (1 << 30) * 2 is undefined behavior, (1 << 30) << 1 must be undefined too". You should rather be happy one of these is no longer undefined. Note also that *because* the signed overflow case is undefined, the compiler is allowed to exhibit the same behavior if it wants to.
Of course, all this is assuming 32-bit ints, but this assumption is one of those things "developers are uninterested in hearing about". In fairness, INT_MIN is of course 1) more characters and 2) not clever.
The proposal would allow left shifts — essentially multiplications by 2⿠— to change the sign of a number, but not to shift an already negative number, even when the product would be perfectly representable. Isn’t this a little inconsistent?
Changing the standard to permit shifting a 1 into the sign bit will do nothing to help software which does this by accident. All modern systems use two’s complement integer representation, and overflowing shifts behave the same way. With the de facto standard made a requirement, an accidental overflow will still produce just as incorrect a value.
This change to the spec would also have the effect of implicitly mandating two’s complement integers where C currently allows for other representations as well. While not necessarily a bad thing, effectively dropping support for certain classes of hardware should probably be done with some caution.
Something I’d much rather see is a standardised right shift of negative values so the sign extending behaviour seen in every modern implementation can be relied upon.
Am I interpreting the specification incorrectly or does it really say that the sign bit can get set just because there is a 1 somewhere in the value that gets shifted through the sign bit? I think is what Mattias was saying, too.
If not, it seems to me that the proper thing to do is to save the state of the sign bit *before* any shifting is done, do the shift and then set the sign bit to the saved state. That way the sign bit does not get set just because there is a 1 somewhere in the value that gets shifted through the sign bit.
@Jeroen Mostert: I guess you’re saying that to mandate in C/C++ standard that << on signed does logical shift (and not arithmetic shift) is a good thing and I might agree with you.
But IMHO the truth is that with the proposal contained in the DR1457 we'd go nowhere: it is only a rather arbitrary exception to current standard where it was meant that you cannot know if the shift done is logical or arithmetic.
The arithmetic/logical distinction exists only for right shifts, not left.
@Mans: sorry I’ve abused terminology.
I meant that for left shift we’d have two family of sensible behaviours:
1) logical shift
2) arithmetic behavior i.e. x << y interpreted as x * 2^y
My opinion is that to do only a minimal incomplete step toward 1) is not going to help and risk to confuse things.
“Basically the proposal is to make it permissible to shift a 1 into the sign bit, but not past it.” — From the wording of the proposed resolution, it sounds like they are considering turning it from undefined behavior to implementation-defined behavior (because the conversion of 0x80000000u to signed int is implementation-defined). That certainly doesn’t help “unbreak” any real-world programs; those that have always worked will continue to work, and those that require portability to the DeathStation 9000 will continue to have behavior that is under-specified by the standard.
IMHO as a compiler developer, the time is loooooong past for the C and C++ committees to mandate two’s complement for integer arithmetic. I really don’t understand what holds them back. The only effect of such a pronouncement would be that a handful of dangerous optimizations would have to be disabled by default, and the vendors of C-to-INTERCAL compilers would have to take the “ISO approved!” stickers off their packaging (and replace them with “Mostly ISO approved!” stickers). Both of these seem like really good things for the world.
@abramo: I still don’t understand what you’re trying to propose. A right-shift by N that does not overflow is exactly equivalent to a multiplication by 2^N. If it does overflow, then so would the multiplication, and you get undefined behaviour either way.
Arthur, I agree, specifying two’s complement would fix a lot of silly problems with C/C++.
For now, I’d be happy to use a compiler that considered shifting a signed value to be a compile-time error.
@Mans (10): I believe abramo’s option 2 corresponds to the current C89/C99 behavior of left-shift, and abramo’s option 1 corresponds to his desired behavior — in which for example (1 << 31 == INT_MIN) and (1 << 100 == 0).
C11 seems poised to take a *small* step toward option 1; namely, they're going to (sort-of, not-really) define the behavior of (1 << 31). But the behavior of (1 <31 will remain completely undefined, as will the behavior of (1 << n) for n<0. This small step will break a lot of implementations, but it won't make the language significantly more well-defined: the absolute fraction of 32-bit operands for which the behavior of left-shift is well-defined will go from approximately 0.000000722 percent to 0.000000745 percent.
Side note: Mandating (1 << 100 == 0) would indeed have a performance penalty on at least one common architecture: x86. For hardware reasons, the x86 SHL instruction silently ignores the high bits of its right-hand operand; thus (1 << 100) will naturally equal (1 << 4), unless the compiler inserts tests and branches to handle the saturation case. However, IMHO this is a cost greatly outweighed by the benefit of predictable program behavior.
Arthur, it’s worth noting that even Java got shifts badly wrong: shifting left by 100 is equivalent to shifting left by 4.
Arthur, adding a test and branch to every shift is *not* acceptable from a performance point of view. It is furthermore not necessary in many cases where it is more or less trivial to prove statically that the shift amount is within the defined range, even if the compiler is unable to do so. Besides, whatever rules are defined for overflowing arithmetic or shifts, if it actually happens, the application will most likely malfunction regardless.
Mans, many shifts are by constant amounts, so the branches wouldn’t necessarily be happening all that often.
It would be nice if processors supported a comprehensive set of trapping math operations, including shifts, to reduce the overhead of sane semantics.
I was talking about variable shifts, should have made that more explicit.
There are many things it would be nice if processors could do. Until they actually do these things, we should not define languages (and certainly not the C language) in ways that rely on them for an efficient implementation.
In fact, I would argue that leaving the behaviour unspecified is a good thing as this allows us to create special tools, such as IOC, which trap these operations and warn us about them. The moment we give overflows defined semantics, programmers are free to use them intentionally, and such tools will cause perfectly valid programs to fail.
Mans, I agree that this is an obscure benefit of undefined behavior. Java, for example, mandates the shift-by-mod-32 behavior.
I remain unconvinced that very many real codes would suffer unduly from checked shifts. What kind of codes do you have in mind? Maybe I’ll hack IOC to check only shift amounts and we can see…
Software for any kind of data compression using variable-length codes comes to mind.
@Mans (14, 18): I dare you to find some real code that would be significantly slowed down by the requirement that (1 <31. I suspect that 99% of real code has constants on the RHS anyway, and at least 50% of the remainder uses explicit masks such as “x = (y << (z & 31))" which are trivial for the compiler to analyze. Finally, in order for the branch to cause a "significant" slowdown, you'd have to be doing the shift inside a tight loop — yet using a different shift amount each time, or else the branch would likely be hoisted out of the loop.
I know this argument sounds a lot like "Given a sufficiently smart compiler…", but as a compiler dev, none of the optimizations I wrote above sounds far-fetched. I do *have* some far-fetched pet ideas, such as a compiler that could understand "assert(0 < mask && mask <= 31);" as an optimization hint, but I don't think we need to invoke those ideas in order to deal with real-world shift code.
Arthur, did you read my last comment (18)? Decompressing huffman-coded (or any other variable-length-coded) data involves variable shifts in a tight loop.
You are probably right that the majority of shifts are constant, though I don’t think the ratio is quite as high as you suggest. In libavcodec about 1/3 of all shift statements are variable (I don’t know about runtime). Admittedly, libavcodec is probably higher than average in variable shifts (due to the compressed data handling), but it still serves as a data point.
You then state that most variable shifts would mask the amount to a valid range. Why on earth would anyone do that? Either you know that the value is safe and masking would have no effect, or you’ll get incorrect results regardless. Only in very special circumstances would you want to shift something by the low-order bits of a large value.
As for smart compilers, modern ones do understand assert-like hints.