In a previous post I guessed that 91 bytes was close to the minimum size for implementing signed and unsigned saturating 32-bit addition and subtraction on x86. A commenter rightly pointed out that it should be possible to do better. Attached is my best shot: 83 bytes. Given the crappy x86 calling convention I’m going to guess it’s hard to do much better unless the saturating SSE instructions offer a way. Or can the branches in the signed functions be avoided? It seems like it, but I couldn’t figure out how.
sat_unsigned_add: movl 4(%esp), %eax xor %edx, %edx not %edx addl 8(%esp), %eax cmovb %edx, %eax ret sat_unsigned_sub: movl 4(%esp), %eax xor %edx, %edx subl 8(%esp), %eax cmovb %edx, %eax ret sat_signed_add: movl 4(%esp), %eax movl $2147483647, %edx movl $-2147483648, %ecx addl 8(%esp), %eax jno out1 cmovg %edx, %eax cmovl %ecx, %eax out1: ret sat_signed_sub: movl 4(%esp), %eax movl $2147483647, %edx movl $-2147483648, %ecx subl 8(%esp), %eax jno out2 cmovg %edx, %eax cmovl %ecx, %eax out2: ret
Another question: How difficult is it to create a superoptimizer that turns obvious C formulations of saturating operations into the code shown here? This is beyond the capabilities of current superoptimizers, I believe, while being obviously feasible.
Feedback is appreciated.
17 responses to “More Saturating Arithmetic”
You can do better with two tricks: tail-merging the signed versions (saves a lot) and synthesizing MIN_INT from MAX_INT (saves a couple bytes). New signed versions shown below:
sat_signed_add:
movl 4(%esp), %eax
addl 8(%esp), %eax
jo 1f
ret
.globl sat_signed_sub
.type sat_signed_sub,%function
sat_signed_sub:
movl 4(%esp), %eax
subl 8(%esp), %eax
jno 2f
1:
movl $2147483647, %edx
leal 1(%edx), %ecx
cmovg %edx, %eax
cmovl %ecx, %eax
2: ret
You can save a byte in unsigned add, too:
sat_unsigned_add:
movl 4(%esp), %eax
xor %edx, %edx
dec %edx
addl 8(%esp), %eax
cmovb %edx, %eax
ret
dec is one byte shorter than neg. That’s 60 bytes.
Nice, Nathan!
Nice, I didn’t know x86 had cmovCC, thanks ๐
I’ve seen windows binaries that use fastcall as their calling convention internally, so I wouldn’t say the overhead is due to x86.
You could get rid of the branches in the signed arithmetic by simply cmovno’ing eax into both ecx and edx, but that makes it 6 bytes longer. Jumps are pretty short instructions, so I would expect size to increase though for any non-branching code.
I’m not sure if this would quite work, but I saw some results recently that, in effect, showed you could create compilers for really weird ISAs automatically. The two approaches I saw both used SMT solvers as their backend, and it seems that could take the place of a superoptimizer, but I’m not sure anyone would want to wait for such a compile unless you absolutely have to.
I count 84 bytes in John’s solution, and 65 in Nathan’s. With a bit of trivia knowledge you can get 62:
00000000 :
0: 8b 44 24 04 mov 0x4(%esp),%eax
4: 31 c9 xor %ecx,%ecx
6: 03 44 24 08 add 0x8(%esp),%eax
a: e2 0a loop 16
0000000c :
c: 8b 44 24 04 mov 0x4(%esp),%eax
10: 31 c9 xor %ecx,%ecx
12: 2b 44 24 08 sub 0x8(%esp),%eax
16: 0f 42 c1 cmovb %ecx,%eax
19: c3 ret
0000001a :
1a: 8b 44 24 04 mov 0x4(%esp),%eax
1e: 03 44 24 08 add 0x8(%esp),%eax
22: 70 0b jo 2f
24: c3 ret
00000025 :
25: 8b 44 24 04 mov 0x4(%esp),%eax
29: 2b 44 24 08 sub 0x8(%esp),%eax
2d: 71 0e jno 3d
2f: ba ff ff ff 7f mov $0x7fffffff,%edx
34: 8d 4a 01 lea 0x1(%edx),%ecx
37: 0f 48 c2 cmovs %edx,%eax
3a: 0f 49 c1 cmovns %ecx,%eax
3d: c3 ret
59 bytes! The trick is to notice that the part under Nathan’s label “1” is executed only when overflow has definitely occurred.
sat_unsigned_add:
movl 4(%esp), %eax
xorl %ecx, %ecx
addl 8(%esp), %eax
loop 1f
sat_unsigned_sub:
movl 4(%esp), %eax
xorl %ecx, %ecx
subl 8(%esp), %eax
1:
cmovc %ecx, %eax
ret
sat_signed_add:
movl 4(%esp), %eax
addl 8(%esp), %eax
jo 1f
ret
sat_signed_sub:
movl 4(%esp), %eax
subl 8(%esp), %eax
jno 2f
1:
setns %al
andl $1, %eax
addl $0x7fffffff, %eax
2:
ret
AC- Nice!
Also see reddit, where sfuerst has proposed an even shorter solution: http://www.reddit.com/r/programming/comments/eqbey/smallest_saturating_arithmetic_functions_in_x86/.
55 bytes (I didn’t test code but seems fine to me)
Still “feels big” to me, my feeling is that there are still bytes to get by connecting the unsigned and signed parts together.
I rewrote the unsigned part and won 3 bytes
I found a one-byte optimisation in the signed part
(I’m using NASM assembler convention but NASM is a bad choice as it doesn’t optimize opcode size)
USE32
sat_unsigned_add:
stc ; cf = 1
db 0x73 ; opcode 0x73 = JNC (2 bytes)
; It doesn’t matter where it would jump as the jump is never taken because CF = 1
; The goal is that the next “clc” gets ignored because part of this jump instruction
sat_unsigned_sub:
clc ; cf = 0
; put parameters in eax and ecx
pop edx
pop eax
pop ecx
push ecx
push eax
push edx
; edx = 0xFFFFFFFF if unsigned_add, edx = 0 if unsigned_sub
sbb edx,edx
jnz perform_add
perform_sub:
sub eax,ecx
jmp short end_unsigned_function
perform_add:
add eax,ecx
end_unsigned_function:
cmovc eax,edx
ret
sat_signed_add:
mov eax,[esp+4]
add eax,[esp+8]
jo l1
ret
sat_signed_sub:
mov eax,[esp+4]
sub eax,[esp+8]
jno l2
l1:
sets al
dec al
adc eax,0x7fffffff
l2:
ret
Hello I’m sfuerst from the Reddit post. My 45 byte solution is:
sat_unsigned_add:
pop %esi
pop %eax
xor %ecx, %ecx
add (%esp), %eax
loop 1f
sat_unsigned_sub:
xor %ecx, %ecx
pop %esi
pop %eax
sub (%esp), %eax
1: cmovc %ecx, %eax
2: push %eax
jmp *%esi
sat_signed_add:
pop %esi
pop %eax
add (%esp), %eax
3: jno 2b
sar $0x1f, %eax
btc $0x1f, %eax
jmp 2b
sat_signed_sub:
pop %esi
pop %eax
sub (%esp), %eax
jmp 3b
I’ve just written an article on this problem a few days ago… but with the different goal of making all of the functions (including saturating multiplication and division) branch free. The article is at http://locklessinc.com/articles/sat_arithmetic/ if anyone is interested.
And even better, at 40 bytes is:
sat_unsigned_add:
stc
.byte 0x73
sat_signed_add:
clc
sbb %ecx, %ecx
pop %esi
pop %eax
add (%esp), %eax
jecxz 3f
1: cmovc %ecx, %eax
2: call *%esi
sat_signed_sub:
stc
.byte 0x73
sat_unsigned_sub:
clc
sbb %ecx, %ecx
pop %esi
pop %eax
sub (%esp), %eax
jecxz 1b
3: jno 4f
sar $0x1f, %eax
btc $0x1f, %eax
4: call *%esi
Sfuerst- Amazing. Do you write asm as part of your day job?
A little bit. Most of the code I write for work is in C. However, some of the time I need to create things that can only be done at a lower level, and asm is the only tool that works.
Over on Reddit, Anonymous Cowherd found a way to shave a couple more bytes off by doing:
entry1:
.byte 0x0c
entry2:
stc
The machine code at entry1 decodes to “or $0xf9,%al”, which clears the carry flag. Whereas the other entry point sets it. This is a byte shorter than the 0x73 byte method.
Then I found a way to save a couple more bytes by size-optimizing the signed overflow code. The result is 36 bytes long:
sat_unsigned_sub:
.byte 0x0c
sat_signed_sub:
stc
sbb %ecx, %ecx
pop %esi
pop %eax
sub (%esp), %eax
jecxz 1f
cmc
3: jno 2f
cdq
xchg %eax, %edx
rcr $1, %eax
jmp 2f
sat_signed_add:
.byte 0x0c
sat_unsigned_add:
stc
sbb %ecx, %ecx
pop %esi
pop %eax
add (%esp), %eax
jecxz 3b
1: cmovc %ecx, %eax
2: call *%esi
Cool. If this remains the winning entry do I have to send apples to both of you? ๐
The Stanford Superoptimizer is designed to find optimal sequences of x86, I wonder if it could improve this code?
http://theory.stanford.edu/~sbansal/superoptimizer.html
It is highly unlikely that that superoptimizer would give code like the above. The first trick I don’t think it handles is procedures with multiple entry points. A corollary is that overlapping instructions also aren’t understood. I’m also unsure if it uses register-specific instructions like jecxz and cdq.
A better bet may be the original superoptimizer. That one enumerates all possible instruction sequences. However, the drawback there is the wasted time inspecting “obviously” invalid/useless code fragments.
I don’t know who’s owed apples now, but that’s okay, I’ll forgo mine. This has been fun though. ๐
I also doubt that the Stanford Superoptimizer could improve the 36-byte code; from their paper it looks like they think it’s only useful for automatically optimizing very short (2- to 5-instruction) sequences, and what we have here is a 20- or 21-instruction sequence, before you even get into the stuff about overlapping instructions, multiple entry points, multiple branch instructions (they only allow one conditional branch per sequence)… The Stanford program is more designed to be a peephole optimizer, for automatically picking the low-hanging fruit from unoptimized compiler output.
The only place I see the *possibility* of peephole improvement in our code is “cdq; xchg %eax, %edx; rcr $1, %eax”, where there’s four bytes of uninterrupted straight-line control flow… but I suspect sfuerst’s version of that code is probably optimal, anyway.
Ok, thanks everyone! This was instructional for me, I’m a pretty bad assembly hacker (this is one of the reasons I like doing compiler work so much).
AC your criticisms are valid, there’s no way that tool is going to deal with overlapping instructions (though certainly a different superoptimizer could). Even so, I’d hesitate to predict that it would fail to squeeze out a byte somewhere; x86 is pretty complicated. I’d give it a try but a while ago one of my students couldn’t get the damn thing to work, it appears to have suffered from premature bit rot.