More Saturating Arithmetic

In a previous post I guessed that 91 bytes was close to the minimum size for implementing signed and unsigned saturating 32-bit addition and subtraction on x86. A commenter rightly pointed out that it should be possible to do better. Attached is my best shot: 83 bytes. Given the crappy x86 calling convention I’m going to guess it’s hard to do much better unless the saturating SSE instructions offer a way. Or can the branches in the signed functions be avoided? It seems like it, but I couldn’t figure out how.

sat_unsigned_add:
	movl	4(%esp), %eax
	xor     %edx, %edx
	not     %edx
	addl	8(%esp), %eax
	cmovb	%edx, %eax
	ret

sat_unsigned_sub:
	movl	4(%esp), %eax
	xor     %edx, %edx
	subl	8(%esp), %eax
	cmovb	%edx, %eax
	ret

sat_signed_add:
	movl	4(%esp), %eax
	movl	$2147483647, %edx
	movl	$-2147483648, %ecx
	addl	8(%esp), %eax
	jno     out1
	cmovg	%edx, %eax
	cmovl	%ecx, %eax
out1:	ret

sat_signed_sub:
	movl	4(%esp), %eax
	movl	$2147483647, %edx
	movl	$-2147483648, %ecx
	subl	8(%esp), %eax
	jno     out2
	cmovg	%edx, %eax
	cmovl	%ecx, %eax
out2:	ret

Another question: How difficult is it to create a superoptimizer that turns obvious C formulations of saturating operations into the code shown here? This is beyond the capabilities of current superoptimizers, I believe, while being obviously feasible.

Feedback is appreciated.

17 Replies to “More Saturating Arithmetic”

  1. You can do better with two tricks: tail-merging the signed versions (saves a lot) and synthesizing MIN_INT from MAX_INT (saves a couple bytes). New signed versions shown below:

    sat_signed_add:
    movl 4(%esp), %eax
    addl 8(%esp), %eax
    jo 1f
    ret
    .globl sat_signed_sub
    .type sat_signed_sub,%function
    sat_signed_sub:
    movl 4(%esp), %eax
    subl 8(%esp), %eax
    jno 2f
    1:
    movl $2147483647, %edx
    leal 1(%edx), %ecx
    cmovg %edx, %eax
    cmovl %ecx, %eax
    2: ret

    You can save a byte in unsigned add, too:

    sat_unsigned_add:
    movl 4(%esp), %eax
    xor %edx, %edx
    dec %edx
    addl 8(%esp), %eax
    cmovb %edx, %eax
    ret

    dec is one byte shorter than neg. That’s 60 bytes.

  2. Nice, I didn’t know x86 had cmovCC, thanks 😀

    I’ve seen windows binaries that use fastcall as their calling convention internally, so I wouldn’t say the overhead is due to x86.

    You could get rid of the branches in the signed arithmetic by simply cmovno’ing eax into both ecx and edx, but that makes it 6 bytes longer. Jumps are pretty short instructions, so I would expect size to increase though for any non-branching code.

    I’m not sure if this would quite work, but I saw some results recently that, in effect, showed you could create compilers for really weird ISAs automatically. The two approaches I saw both used SMT solvers as their backend, and it seems that could take the place of a superoptimizer, but I’m not sure anyone would want to wait for such a compile unless you absolutely have to.

  3. I count 84 bytes in John’s solution, and 65 in Nathan’s. With a bit of trivia knowledge you can get 62:

    00000000 :
    0: 8b 44 24 04 mov 0x4(%esp),%eax
    4: 31 c9 xor %ecx,%ecx
    6: 03 44 24 08 add 0x8(%esp),%eax
    a: e2 0a loop 16

    0000000c :
    c: 8b 44 24 04 mov 0x4(%esp),%eax
    10: 31 c9 xor %ecx,%ecx
    12: 2b 44 24 08 sub 0x8(%esp),%eax
    16: 0f 42 c1 cmovb %ecx,%eax
    19: c3 ret

    0000001a :
    1a: 8b 44 24 04 mov 0x4(%esp),%eax
    1e: 03 44 24 08 add 0x8(%esp),%eax
    22: 70 0b jo 2f
    24: c3 ret

    00000025 :
    25: 8b 44 24 04 mov 0x4(%esp),%eax
    29: 2b 44 24 08 sub 0x8(%esp),%eax
    2d: 71 0e jno 3d
    2f: ba ff ff ff 7f mov $0x7fffffff,%edx
    34: 8d 4a 01 lea 0x1(%edx),%ecx
    37: 0f 48 c2 cmovs %edx,%eax
    3a: 0f 49 c1 cmovns %ecx,%eax
    3d: c3 ret

  4. 59 bytes! The trick is to notice that the part under Nathan’s label “1” is executed only when overflow has definitely occurred.

    sat_unsigned_add:
    movl 4(%esp), %eax
    xorl %ecx, %ecx
    addl 8(%esp), %eax
    loop 1f
    sat_unsigned_sub:
    movl 4(%esp), %eax
    xorl %ecx, %ecx
    subl 8(%esp), %eax
    1:
    cmovc %ecx, %eax
    ret
    sat_signed_add:
    movl 4(%esp), %eax
    addl 8(%esp), %eax
    jo 1f
    ret
    sat_signed_sub:
    movl 4(%esp), %eax
    subl 8(%esp), %eax
    jno 2f
    1:
    setns %al
    andl $1, %eax
    addl $0x7fffffff, %eax
    2:
    ret

  5. 55 bytes (I didn’t test code but seems fine to me)

    Still “feels big” to me, my feeling is that there are still bytes to get by connecting the unsigned and signed parts together.

    I rewrote the unsigned part and won 3 bytes
    I found a one-byte optimisation in the signed part

    (I’m using NASM assembler convention but NASM is a bad choice as it doesn’t optimize opcode size)

    USE32

    sat_unsigned_add:
    stc ; cf = 1
    db 0x73 ; opcode 0x73 = JNC (2 bytes)
    ; It doesn’t matter where it would jump as the jump is never taken because CF = 1
    ; The goal is that the next “clc” gets ignored because part of this jump instruction

    sat_unsigned_sub:
    clc ; cf = 0

    ; put parameters in eax and ecx
    pop edx
    pop eax
    pop ecx
    push ecx
    push eax
    push edx

    ; edx = 0xFFFFFFFF if unsigned_add, edx = 0 if unsigned_sub
    sbb edx,edx

    jnz perform_add

    perform_sub:
    sub eax,ecx
    jmp short end_unsigned_function

    perform_add:
    add eax,ecx

    end_unsigned_function:
    cmovc eax,edx
    ret

    sat_signed_add:
    mov eax,[esp+4]
    add eax,[esp+8]
    jo l1
    ret
    sat_signed_sub:
    mov eax,[esp+4]
    sub eax,[esp+8]
    jno l2
    l1:
    sets al
    dec al
    adc eax,0x7fffffff
    l2:
    ret

  6. Hello I’m sfuerst from the Reddit post. My 45 byte solution is:

    sat_unsigned_add:
    pop %esi
    pop %eax
    xor %ecx, %ecx
    add (%esp), %eax
    loop 1f

    sat_unsigned_sub:
    xor %ecx, %ecx
    pop %esi
    pop %eax
    sub (%esp), %eax
    1: cmovc %ecx, %eax
    2: push %eax
    jmp *%esi
    sat_signed_add:
    pop %esi
    pop %eax
    add (%esp), %eax
    3: jno 2b
    sar $0x1f, %eax
    btc $0x1f, %eax
    jmp 2b
    sat_signed_sub:
    pop %esi
    pop %eax
    sub (%esp), %eax
    jmp 3b

    I’ve just written an article on this problem a few days ago… but with the different goal of making all of the functions (including saturating multiplication and division) branch free. The article is at http://locklessinc.com/articles/sat_arithmetic/ if anyone is interested.

  7. And even better, at 40 bytes is:

    sat_unsigned_add:
    stc
    .byte 0x73
    sat_signed_add:
    clc
    sbb %ecx, %ecx
    pop %esi
    pop %eax
    add (%esp), %eax
    jecxz 3f
    1: cmovc %ecx, %eax
    2: call *%esi

    sat_signed_sub:
    stc
    .byte 0x73
    sat_unsigned_sub:
    clc
    sbb %ecx, %ecx
    pop %esi
    pop %eax
    sub (%esp), %eax
    jecxz 1b
    3: jno 4f
    sar $0x1f, %eax
    btc $0x1f, %eax
    4: call *%esi

  8. A little bit. Most of the code I write for work is in C. However, some of the time I need to create things that can only be done at a lower level, and asm is the only tool that works.

  9. Over on Reddit, Anonymous Cowherd found a way to shave a couple more bytes off by doing:

    entry1:
    .byte 0x0c
    entry2:
    stc

    The machine code at entry1 decodes to “or $0xf9,%al”, which clears the carry flag. Whereas the other entry point sets it. This is a byte shorter than the 0x73 byte method.

    Then I found a way to save a couple more bytes by size-optimizing the signed overflow code. The result is 36 bytes long:

    sat_unsigned_sub:
    .byte 0x0c
    sat_signed_sub:
    stc
    sbb %ecx, %ecx
    pop %esi
    pop %eax
    sub (%esp), %eax
    jecxz 1f
    cmc
    3: jno 2f
    cdq
    xchg %eax, %edx
    rcr $1, %eax
    jmp 2f

    sat_signed_add:
    .byte 0x0c
    sat_unsigned_add:
    stc
    sbb %ecx, %ecx
    pop %esi
    pop %eax
    add (%esp), %eax
    jecxz 3b
    1: cmovc %ecx, %eax
    2: call *%esi

  10. It is highly unlikely that that superoptimizer would give code like the above. The first trick I don’t think it handles is procedures with multiple entry points. A corollary is that overlapping instructions also aren’t understood. I’m also unsure if it uses register-specific instructions like jecxz and cdq.

    A better bet may be the original superoptimizer. That one enumerates all possible instruction sequences. However, the drawback there is the wasted time inspecting “obviously” invalid/useless code fragments.

  11. I don’t know who’s owed apples now, but that’s okay, I’ll forgo mine. This has been fun though. 🙂

    I also doubt that the Stanford Superoptimizer could improve the 36-byte code; from their paper it looks like they think it’s only useful for automatically optimizing very short (2- to 5-instruction) sequences, and what we have here is a 20- or 21-instruction sequence, before you even get into the stuff about overlapping instructions, multiple entry points, multiple branch instructions (they only allow one conditional branch per sequence)… The Stanford program is more designed to be a peephole optimizer, for automatically picking the low-hanging fruit from unoptimized compiler output.

    The only place I see the *possibility* of peephole improvement in our code is “cdq; xchg %eax, %edx; rcr $1, %eax”, where there’s four bytes of uninterrupted straight-line control flow… but I suspect sfuerst’s version of that code is probably optimal, anyway.

  12. Ok, thanks everyone! This was instructional for me, I’m a pretty bad assembly hacker (this is one of the reasons I like doing compiler work so much).

    AC your criticisms are valid, there’s no way that tool is going to deal with overlapping instructions (though certainly a different superoptimizer could). Even so, I’d hesitate to predict that it would fail to squeeze out a byte somewhere; x86 is pretty complicated. I’d give it a try but a while ago one of my students couldn’t get the damn thing to work, it appears to have suffered from premature bit rot.

Comments are closed.