More Saturating Arithmetic

In a previous post I guessed that 91 bytes was close to the minimum size for implementing signed and unsigned saturating 32-bit addition and subtraction on x86. A commenter rightly pointed out that it should be possible to do better. Attached is my best shot: 83 bytes. Given the crappy x86 calling convention I’m going to guess it’s hard to do much better unless the saturating SSE instructions offer a way. Or can the branches in the signed functions be avoided? It seems like it, but I couldn’t figure out how.

sat_unsigned_add:
	movl	4(%esp), %eax
	xor     %edx, %edx
	not     %edx
	addl	8(%esp), %eax
	cmovb	%edx, %eax
	ret

sat_unsigned_sub:
	movl	4(%esp), %eax
	xor     %edx, %edx
	subl	8(%esp), %eax
	cmovb	%edx, %eax
	ret

sat_signed_add:
	movl	4(%esp), %eax
	movl	$2147483647, %edx
	movl	$-2147483648, %ecx
	addl	8(%esp), %eax
	jno     out1
	cmovg	%edx, %eax
	cmovl	%ecx, %eax
out1:	ret

sat_signed_sub:
	movl	4(%esp), %eax
	movl	$2147483647, %edx
	movl	$-2147483648, %ecx
	subl	8(%esp), %eax
	jno     out2
	cmovg	%edx, %eax
	cmovl	%ecx, %eax
out2:	ret

Another question: How difficult is it to create a superoptimizer that turns obvious C formulations of saturating operations into the code shown here? This is beyond the capabilities of current superoptimizers, I believe, while being obviously feasible.

Feedback is appreciated.

October 7, 2010

regehr

Computer Science, Education, Embedded

17 responses to “More Saturating Arithmetic”

Nathan says:

October 8, 2010 at 5:21 am

You can do better with two tricks: tail-merging the signed versions (saves a lot) and synthesizing MIN_INT from MAX_INT (saves a couple bytes). New signed versions shown below:

sat_signed_add:
movl 4(%esp), %eax
addl 8(%esp), %eax
jo 1f
ret
.globl sat_signed_sub
.type sat_signed_sub,%function
sat_signed_sub:
movl 4(%esp), %eax
subl 8(%esp), %eax
jno 2f
1:
movl $2147483647, %edx
leal 1(%edx), %ecx
cmovg %edx, %eax
cmovl %ecx, %eax
2: ret

You can save a byte in unsigned add, too:

sat_unsigned_add:
movl 4(%esp), %eax
xor %edx, %edx
dec %edx
addl 8(%esp), %eax
cmovb %edx, %eax
ret

dec is one byte shorter than neg. That’s 60 bytes.
regehr says:

October 8, 2010 at 8:31 am

Nice, Nathan!
kuza55 says:

October 10, 2010 at 4:16 pm

Nice, I didn’t know x86 had cmovCC, thanks 😀

I’ve seen windows binaries that use fastcall as their calling convention internally, so I wouldn’t say the overhead is due to x86.

You could get rid of the branches in the signed arithmetic by simply cmovno’ing eax into both ecx and edx, but that makes it 6 bytes longer. Jumps are pretty short instructions, so I would expect size to increase though for any non-branching code.

I’m not sure if this would quite work, but I saw some results recently that, in effect, showed you could create compilers for really weird ISAs automatically. The two approaches I saw both used SMT solvers as their backend, and it seems that could take the place of a superoptimizer, but I’m not sure anyone would want to wait for such a compile unless you absolutely have to.
Anonymous Cowherd says:

December 22, 2010 at 2:04 am

I count 84 bytes in John’s solution, and 65 in Nathan’s. With a bit of trivia knowledge you can get 62:

00000000 :
0: 8b 44 24 04 mov 0x4(%esp),%eax
4: 31 c9 xor %ecx,%ecx
6: 03 44 24 08 add 0x8(%esp),%eax
a: e2 0a loop 16

0000000c :
c: 8b 44 24 04 mov 0x4(%esp),%eax
10: 31 c9 xor %ecx,%ecx
12: 2b 44 24 08 sub 0x8(%esp),%eax
16: 0f 42 c1 cmovb %ecx,%eax
19: c3 ret

0000001a :
1a: 8b 44 24 04 mov 0x4(%esp),%eax
1e: 03 44 24 08 add 0x8(%esp),%eax
22: 70 0b jo 2f
24: c3 ret

00000025 :
25: 8b 44 24 04 mov 0x4(%esp),%eax
29: 2b 44 24 08 sub 0x8(%esp),%eax
2d: 71 0e jno 3d
2f: ba ff ff ff 7f mov $0x7fffffff,%edx
34: 8d 4a 01 lea 0x1(%edx),%ecx
37: 0f 48 c2 cmovs %edx,%eax
3a: 0f 49 c1 cmovns %ecx,%eax
3d: c3 ret
Anonymous Cowherd says:

December 22, 2010 at 2:25 am

59 bytes! The trick is to notice that the part under Nathan’s label “1” is executed only when overflow has definitely occurred.

sat_unsigned_add:
movl 4(%esp), %eax
xorl %ecx, %ecx
addl 8(%esp), %eax
loop 1f
sat_unsigned_sub:
movl 4(%esp), %eax
xorl %ecx, %ecx
subl 8(%esp), %eax
1:
cmovc %ecx, %eax
ret
sat_signed_add:
movl 4(%esp), %eax
addl 8(%esp), %eax
jo 1f
ret
sat_signed_sub:
movl 4(%esp), %eax
subl 8(%esp), %eax
jno 2f
1:
setns %al
andl $1, %eax
addl $0x7fffffff, %eax
2:
ret
regehr says:

December 22, 2010 at 3:56 pm

AC- Nice!
Joachim Schipper says:

December 23, 2010 at 3:49 am

Also see reddit, where sfuerst has proposed an even shorter solution: http://www.reddit.com/r/programming/comments/eqbey/smallest_saturating_arithmetic_functions_in_x86/.
DvX says:

December 23, 2010 at 8:36 am

55 bytes (I didn’t test code but seems fine to me)

Still “feels big” to me, my feeling is that there are still bytes to get by connecting the unsigned and signed parts together.

I rewrote the unsigned part and won 3 bytes
I found a one-byte optimisation in the signed part

(I’m using NASM assembler convention but NASM is a bad choice as it doesn’t optimize opcode size)

USE32

sat_unsigned_add:
stc ; cf = 1
db 0x73 ; opcode 0x73 = JNC (2 bytes)
; It doesn’t matter where it would jump as the jump is never taken because CF = 1
; The goal is that the next “clc” gets ignored because part of this jump instruction

sat_unsigned_sub:
clc ; cf = 0

; put parameters in eax and ecx
pop edx
pop eax
pop ecx
push ecx
push eax
push edx

; edx = 0xFFFFFFFF if unsigned_add, edx = 0 if unsigned_sub
sbb edx,edx

jnz perform_add

perform_sub:
sub eax,ecx
jmp short end_unsigned_function

perform_add:
add eax,ecx

end_unsigned_function:
cmovc eax,edx
ret

sat_signed_add:
mov eax,[esp+4]
add eax,[esp+8]
jo l1
ret
sat_signed_sub:
mov eax,[esp+4]
sub eax,[esp+8]
jno l2
l1:
sets al
dec al
adc eax,0x7fffffff
l2:
ret
sfuerst says:

December 23, 2010 at 11:08 am

Hello I’m sfuerst from the Reddit post. My 45 byte solution is:

sat_unsigned_add:
pop %esi
pop %eax
xor %ecx, %ecx
add (%esp), %eax
loop 1f

sat_unsigned_sub:
xor %ecx, %ecx
pop %esi
pop %eax
sub (%esp), %eax
1: cmovc %ecx, %eax
2: push %eax
jmp *%esi
sat_signed_add:
pop %esi
pop %eax
add (%esp), %eax
3: jno 2b
sar $0x1f, %eax
btc $0x1f, %eax
jmp 2b
sat_signed_sub:
pop %esi
pop %eax
sub (%esp), %eax
jmp 3b

I’ve just written an article on this problem a few days ago… but with the different goal of making all of the functions (including saturating multiplication and division) branch free. The article is at http://locklessinc.com/articles/sat_arithmetic/ if anyone is interested.
sfuerst says:

December 23, 2010 at 2:35 pm

And even better, at 40 bytes is:

sat_unsigned_add:
stc
.byte 0x73
sat_signed_add:
clc
sbb %ecx, %ecx
pop %esi
pop %eax
add (%esp), %eax
jecxz 3f
1: cmovc %ecx, %eax
2: call *%esi

sat_signed_sub:
stc
.byte 0x73
sat_unsigned_sub:
clc
sbb %ecx, %ecx
pop %esi
pop %eax
sub (%esp), %eax
jecxz 1b
3: jno 4f
sar $0x1f, %eax
btc $0x1f, %eax
4: call *%esi
regehr says:

December 23, 2010 at 6:27 pm

Sfuerst- Amazing. Do you write asm as part of your day job?
sfuerst says:

December 23, 2010 at 9:58 pm

A little bit. Most of the code I write for work is in C. However, some of the time I need to create things that can only be done at a lower level, and asm is the only tool that works.
sfuerst says:

December 24, 2010 at 12:51 pm

Over on Reddit, Anonymous Cowherd found a way to shave a couple more bytes off by doing:

entry1:
.byte 0x0c
entry2:
stc

The machine code at entry1 decodes to “or $0xf9,%al”, which clears the carry flag. Whereas the other entry point sets it. This is a byte shorter than the 0x73 byte method.

Then I found a way to save a couple more bytes by size-optimizing the signed overflow code. The result is 36 bytes long:

sat_unsigned_sub:
.byte 0x0c
sat_signed_sub:
stc
sbb %ecx, %ecx
pop %esi
pop %eax
sub (%esp), %eax
jecxz 1f
cmc
3: jno 2f
cdq
xchg %eax, %edx
rcr $1, %eax
jmp 2f

sat_signed_add:
.byte 0x0c
sat_unsigned_add:
stc
sbb %ecx, %ecx
pop %esi
pop %eax
add (%esp), %eax
jecxz 3b
1: cmovc %ecx, %eax
2: call *%esi
regehr says:

December 24, 2010 at 3:41 pm

Cool. If this remains the winning entry do I have to send apples to both of you? 🙂

The Stanford Superoptimizer is designed to find optimal sequences of x86, I wonder if it could improve this code?

http://theory.stanford.edu/~sbansal/superoptimizer.html
sfuerst says:

December 28, 2010 at 2:12 pm

It is highly unlikely that that superoptimizer would give code like the above. The first trick I don’t think it handles is procedures with multiple entry points. A corollary is that overlapping instructions also aren’t understood. I’m also unsure if it uses register-specific instructions like jecxz and cdq.

A better bet may be the original superoptimizer. That one enumerates all possible instruction sequences. However, the drawback there is the wasted time inspecting “obviously” invalid/useless code fragments.
Anonymous Cowherd says:

December 31, 2010 at 12:19 pm

I don’t know who’s owed apples now, but that’s okay, I’ll forgo mine. This has been fun though. 🙂

I also doubt that the Stanford Superoptimizer could improve the 36-byte code; from their paper it looks like they think it’s only useful for automatically optimizing very short (2- to 5-instruction) sequences, and what we have here is a 20- or 21-instruction sequence, before you even get into the stuff about overlapping instructions, multiple entry points, multiple branch instructions (they only allow one conditional branch per sequence)… The Stanford program is more designed to be a peephole optimizer, for automatically picking the low-hanging fruit from unoptimized compiler output.

The only place I see the *possibility* of peephole improvement in our code is “cdq; xchg %eax, %edx; rcr $1, %eax”, where there’s four bytes of uninterrupted straight-line control flow… but I suspect sfuerst’s version of that code is probably optimal, anyway.
regehr says:

January 3, 2011 at 8:53 pm

Ok, thanks everyone! This was instructional for me, I’m a pretty bad assembly hacker (this is one of the reasons I like doing compiler work so much).

AC your criticisms are valid, there’s no way that tool is going to deal with overlapping instructions (though certainly a different superoptimizer could). Even so, I’d hesitate to predict that it would fail to squeeze out a byte somewhere; x86 is pretty complicated. I’d give it a try but a while ago one of my students couldn’t get the damn thing to work, it appears to have suffered from premature bit rot.