Skip to content

A Quick Coding Contest: Convert String to Integer Without Overflow

UPDATE: As of Saturday March 9, the contest is closed. Results will be posted in a few days. I’ve created a Github repository containing all submissions and my test harness.

Regular readers will know that I’m obsessed with integer overflows. One apparently simple piece of code that many programmers get wrong in this respect is converting strings to integers. It is slightly tricky. The goal of this contest is to submit the nicest (judged by me) implementation in C for this specification:

/*
 * if input matches ^-?[0-9]+\0$ and the resulting integer is
 * representable as a long, return the integer; otherwise if
 * the input is a null-terminated string, set error to 1 and 
 * return any value; otherwise behavior is undefined
 */
extern int error;
long str2long (const char *);

I deliberately didn’t name the function atol or strtol since its semantics are slightly nonstandard. The regular expression above probably isn’t standard either; it is simply trying to specify a null-terminated string containing only a signed decimal integer of arbitrary size. No other characters (even whitespace) are permitted.

A few extra things:

  • Don’t call library functions; your code should perform the conversion itself
  • It is fine to use constants from limits.h
  • You should write conforming C code that works on x86 and x86-64. I’ll probably only test entries using GCC and Clang but even so you are not allowed to use extensions such as GCC’s 128-bit integers
  • You may assume that implementation-defined behaviors are the standard ones for the x86 and x86-64 platforms on Linux machines
  • Your code must not execute undefined behaviors (use a recent Clang to check for this)

Please mail entries to me, not only to avoid spoiling the fun but also because WordPress loves to mangle C code left in the comments. Feel free to leave comments that don’t include code, of course. I’ll announce the winner in a few days, and will send the winner a small prize. By submitting an entry into this contest you are agreeing to let me use your entry in a subsequent blog post, either as a positive or negative example.

NOTE: The best Clang flags to use are: -fsanitize=integer -fno-sanitize=unsigned-integer-overflow

UPDATE: As of Saturday March 9, the contest is closed. Results will be posted in a few days. I’ve created a Github repository containing all submissions and my test harness.

{ 58 } Comments

  1. Pascal Cuoq | March 5, 2013 at 4:44 pm | Permalink

    Of course, since it is impossible in nearly all dialects of C to detect that a const char * points to a zero-terminated string, what you actually mean when you write :

    if input matches ^-?[0-9]+$ and the resulting integer is
    representable as a long, return the integer; otherwise set
    error to 1 and return any value

    is:

    if input matches ^-?[0-9]+$ and the resulting integer is
    representable as a long, return the integer; otherwise if
    the input is a zero-terminated string, set
    error to 1 and return any value; otherwise do undefined behavior.

    right?

  2. regehr | March 5, 2013 at 5:13 pm | Permalink

    Right Pascal! Specification is updated to your version.

  3. Peter De Wachter | March 5, 2013 at 5:21 pm | Permalink

    “Slightly tricky”? I’ve seen more than one standard library that gets these functions wrong. For example, C++Builder has separate implementations for strtol and iostreams, both broken. Well, it seems strtol has been fixed now.

  4. Tennessee C-Veilleux | March 5, 2013 at 6:53 pm | Permalink

    Two questions:
    1- What’s the deadline ?
    2- I am guessing it’s OK for it performance not to be the primary factor. Is that the case ?

  5. regehr | March 5, 2013 at 7:27 pm | Permalink

    Hi Tennessee, I’m going to be traveling starting next Sunday, so let’s say the deadline is evening of Friday March 8 (giving me Saturday to process the results).

    Right, performance is not the primary factor, although I would consider extremely slow submissions to be undesirable.

  6. Andrew Hunter | March 5, 2013 at 7:28 pm | Permalink

    Are you looking for a bulletproof implementation on any weird processor, or just a carefully written one for reasonable some-number-of-bit 2s-complement processors that doesn’t rely on undefined behavior? i.e., would you throw a red flag at something that (say) assumed -LONG_MAX is representable?

  7. regehr | March 5, 2013 at 7:28 pm | Permalink

    Hi Peter, it would be fun to start a list of broken atoi-like functions.

    I took a quick look at the one from eglibc today. I wasn’t convinced that it is correct, but couldn’t conclusively show this without running it. Unfortunately eglibc is a mess of preprocessor crap and I could not easily extract this function for unit testing purposes….

  8. geek42 | March 5, 2013 at 8:42 pm | Permalink

    so if i have a stirng like this ‘9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999’

    should the function return error or just the modded value?

  9. regehr | March 5, 2013 at 9:17 pm | Permalink

    Hi geek42, this should be an error because the large number cannot be represented in a long.

  10. regehr | March 5, 2013 at 9:29 pm | Permalink

    Hi Andrew, the latter is what I have in mind– carefully written code that a modern compiler for real hardware cannot break, even after 5 or 10 years of optimizer improvements. So -LONG_MAX is perfectly acceptable but -LONG_MIN is way wrong!

  11. regehr | March 5, 2013 at 9:32 pm | Permalink

    So far, I have four submissions, all of which are prettier than my code. Two of them execute at least one undefined integer operation and two of them are correct as far as I can tell — but my tester needs more work.

  12. Franz Beaune | March 6, 2013 at 3:52 am | Permalink

    For those of us who don’t have access to a version of Clang that supports -fsanitize, here is a little C++ class that implements a checked long value: http://liveworkspace.org/code/4tL40u$0

    Simply replace long by checked_long in your implementation of str2long().

    Franz

  13. Jeff Epler | March 6, 2013 at 6:10 am | Permalink

    Is there a guide to building clang from svn with -fsanitize-integer enabled that has more hand-holding than “get the svn, run configure, make”? I apparently did something wrong, as clang is wrongly passing the -fsanitize=integer flag on to the assembler:

    $ ./Debug+Asserts/bin/clang -fsanitize=integer -c x.c
    gcc-4.7.real: error: unrecognized command line option ‘-fsanitize=integer’
    clang: error: assembler (via gcc) command failed with exit code 1 (use -v to see invocation)

    all are svn checkouts from yesterday as detailed at http://clang.llvm.org/get_started.html including a checkout of compiler-rt. Systems are debian wheezy amd64 and debian kfreebsd wheezy amd64.

  14. Mike G. | March 6, 2013 at 6:54 am | Permalink

    Minor nit, but shouldn’t the contract be

    /*
    * if input matches ^-?[0-9]+$ and the resulting integer is
    * representable as a long, set error to 0 and return the integer; otherwise if
    * the input is a null-terminated string, set error to 1 and
    * return any value; otherwise behavior is undefined
    */

    As you wrote it, there’s no guarantee that error gets cleared on success, which makes it hard to use after the first failure…

  15. tobi | March 6, 2013 at 7:48 am | Permalink

    Oh god, this is such a great contest. People will fail although they *know* what’s being asked of them and invest great care into achieving it: no undefinedness.

  16. Jeff Epler | March 6, 2013 at 9:59 am | Permalink

    Mike, in my test harness I set error to 0 before each call and check its value after return. This is not unlike the situation with errno, where calls which can set it in case of error leave it unchanged in the non-error case.

  17. kps | March 6, 2013 at 10:25 am | Permalink

    I’m looking forward to Part II: The Slightly Longer Coding Contest, which changes the requirement to *strictly* conforming C11 or C99, and then Part III that requires strictly conforming C89.

  18. David Grayson | March 6, 2013 at 10:58 am | Permalink

    I just submitted my entry. I might improve upon it in the near future though. I am excited for the results!

  19. Mike G. | March 6, 2013 at 12:26 pm | Permalink

    @Jeff, that’s definitely an argument for using an error return code and passing in a pointer to the location for the return value… The other argument is reentrance, of course.

    e.g. (just for discussion – I don’t think the contest rules should change now, of course):

    /*
    * if input matches ^-?[0-9]+$ and the resulting integer is
    * representable as a long, return 0 and set *result to the integer; otherwise if
    * the input is a null-terminated string, return 1 and
    * possibly modify *result; otherwise behavior is undefined. The behavior is also
    undefined unless result is a valid pointer to a writeable value of type long.
    */
    int str2long (int *result, const char *str);

  20. regehr | March 6, 2013 at 12:48 pm | Permalink

    Hi Franz, I do not believe that your checked class works in the case where long is 64 bits!

  21. regehr | March 6, 2013 at 12:51 pm | Permalink

    Hi Jeff, I have never seen the error that you report.

    However, I always build an optimized + asserts version of LLVM, not a debug + asserts. Doesn’t seem like that should make any difference.

    Did you build Compiler-RT in addition to Clang and LLVM? You must build all three for the sanitizer to work.

    Finally, you are providing an explicit path to Clang; it seems possible that you should rather install it, add it to your PATH, and then run it.

  22. regehr | March 6, 2013 at 12:53 pm | Permalink

    Jeff and Mike, I deliberately specified leaving error alone in the success case since this supports performing many conversions and then checking the error flag at the end. I’m not sure there’s a right answer, just preference.

    Regarding the reentrancy issue, I agree — but just wanted to keep things simple here.

  23. regehr | March 6, 2013 at 12:54 pm | Permalink

    Hi kps, that would be brutal. But I think it might be very hard to check code for strict conformance. I don’t even know of any work in that direction.

    For example, strictly conforming code would have to work properly on a machine where LONG_MAX is 2^31-1 and LONG_MIN is -2^666. I don’t believe anything in the C standard prohibits this.

  24. regehr | March 6, 2013 at 12:55 pm | Permalink

    Hi everyone, I have about a dozen submissions that I haven’t processed yet, I will do this on Weds evening, thanks for the code!

  25. Franz Beaune | March 6, 2013 at 1:05 pm | Permalink

    John,

    You’re right, my checked_long class only works when long is 32-bit, which is the case on my platform. Thanks for pointing this out.

    Franz

  26. kps | March 6, 2013 at 1:39 pm | Permalink

    In C99 and C11 I think §6.2.6.2 forces LONG_MIN = -LONG_MAX or -LONG_MAX-1. In C89, I presume it was *intended* from §3.1.2.5 with footnote 13, but they neglected to rule out crazy interpretations of ‘sign bit’.

  27. regehr | March 6, 2013 at 1:56 pm | Permalink

    Hi kps, you’re right. I wish they would go just a bit further and mandate two’s complement.

  28. Andrew J. | March 6, 2013 at 2:46 pm | Permalink

    Can we assume the character encoding is ASCII, or at least that the characters for the digits ‘0’ through ‘9’ are adjacent and sorted?

  29. Tennessee C-Veilleux | March 6, 2013 at 2:50 pm | Permalink

    I got my entry nearly done. Will finalize my testing tonight :) Hope that you can wait until Friday even if you have a lot of submissions.

    One thing that is not clear:

    If I have any of the following:
    000000000
    0000000123
    -00000000000000000
    -000000000000001234

    Are they all valid strings for your function, and should they return “0” on arbitrarily long strings of leading zeros ?

    I would expect: 0, 123, 0, -1234 as the expected for the four cases above, but just want to make sure leading zeros need to be ignored or if they are considered incorrect.

  30. regehr | March 6, 2013 at 3:28 pm | Permalink

    Hi Andrew, perfectly OK to assume either an ASCII encoding or an encoding with the weaker property that you mention.

  31. Magnus | March 6, 2013 at 4:25 pm | Permalink

    22 john:
    I set error = 0 when the parse succeeds, I hope I don’t have to change my entry?

    29 Tennessee C-Veilleux:
    According to the regex all your examples are valid (I hope! as they would be accepted by my entry).

  32. regehr | March 6, 2013 at 4:46 pm | Permalink

    Hi Magus, please update your entry, there is no penalty :)

    Tennessee, leading zeros should be ignored as they do not change whether the subsequent number is representable or not.

  33. Eitan Adler | March 6, 2013 at 9:57 pm | Permalink

    Andrew,

    ISO/IEC 9899:2011 5.2.1p3 guarantees that the digits ‘0’ through ‘9’ are 1 apart and sorted.

  34. regehr | March 6, 2013 at 11:28 pm | Permalink

    Hi all, I processed many entries tonight (Weds) but code kept arriving as I worked and now I need to get ready to teach tomorrow. I should be able to catch up on Thurs.

  35. erik | March 7, 2013 at 4:16 pm | Permalink

    Is it OK for code to overflow as part of overflow checking? Or do you want code to use constants like LONG_MIN and LONG_MAX and avoid any possible overflow?

  36. regehr | March 7, 2013 at 4:23 pm | Permalink

    Hi erik, the key here is to avoid undefined integer operations including any signed overflow, regardless of its purpose. Unsigned overflows are perfectly acceptable.

  37. regehr | March 8, 2013 at 12:49 am | Permalink

    Tonight (Thurs) I almost got through all entries that I’ve received so far, but not quite. I’ll finish up Friday AM.

    So far I have 29 working solutions, some of which are very nice.

  38. reno | March 8, 2013 at 8:09 am | Permalink

    > Jeff and Mike, I deliberately specified leaving error alone in the success case since[cut]

    Err, no, you think you did specify this but in fact you didn’t specify anything about the value of error in case of success!
    A classic case of ‘the spec is in my mind’..

  39. regehr | March 8, 2013 at 8:28 am | Permalink

    Hi reno, leaving the flag alone is implied. If you don’t like that I will have to tighten up the spec substantially, for example by specifying that your code is not to call fork(), open a socket, etc.

  40. Pascal Cuoq | March 8, 2013 at 9:39 am | Permalink

    Someone should invent some sort of language to specify in detail and without ambiguity what a C function is supposed to do.

    :-J

  41. reno | March 8, 2013 at 9:55 am | Permalink

    >Hi reno, leaving the flag alone is implied. If you don’t like that I will have to tighten up the spec substantially, for example by specifying that your code is not to call fork(), open a socket, etc.

    You’re showing your academic background here.
    In theory, you’re right. In practice a developer cannot distinguish a voluntary omission from a specification mistake, so more complete specifications are better (of course specifications can be too detailed too)..

  42. regehr | March 8, 2013 at 9:57 am | Permalink

    Hi Pascal, can you post an ACSL specification for this function? This might make both me and Reno happy!

    Reno, here’s what I read in the man page for fork() on my Linux machine:

    “On success, the PID of the child process is returned in the parent, and 0 is returned
    in the child. On failure, -1 is returned in the parent, no child process is created,
    and errno is set appropriately.”

    Perhaps you should find the author of this man page and suggest that his/her academic background has lead to a failure to specify that errno is unmolested in the non-error case.

  43. Peter De Wachter | March 8, 2013 at 10:55 am | Permalink

    In the case of Linux manpages, errno(3) already says: “Valid error numbers are all nonzero; errno is never set to zero by any system call or library function.”
    So reno should be happy with them.

  44. Pascal Cuoq | March 8, 2013 at 12:05 pm | Permalink

    John, I have a hunch that WordPress comments aren’t going to be any friendlier to ACSL than they are to C. I duplicated it at http://pastebin.com/Ys23KhPV

    The ACSL specification would be something like:

    /*@
    requires valid_string(s);
    behavior good_number:
    assumes \exists integer n; string_is_number(s, n) && LONG_MIN <= n <= LONG_MAX;
    assigns \result;
    ensures \exists integer n; string_is_number{Pre}(s, n) && \result == n;
    behavior bad_number:
    assumes \forall integer n; ! string_is_number(s, n);
    assigns error;
    ensures error==1;
    */
    long str2long (const char * s);

    The predicate valid_string is conveniently predefined :

    /*@ predicate valid_string{L}(char *s) =
    0 <= strlen(s) && \valid(s+(0..strlen(s)));
    */

    (The predicate strlen is itself defined in a way such that it makes sense to use it as it is used here)

    The predicate string_is_number would have to be defined by the specifier. Yes, it would be the same length as (and would like a) prolog implementation of that idea. On the plus side, the predicate could be re-used as-is for the specification of str2int and str2longlong.

    In effect, when verifying an implementation of str2long against such a specification, you would in some sense be comparing an implementation with another. If there is the same bug in both, it might go unnoticed. But I have to point out that, in most programming languages, and especially in C, nothing prevents you to put together a function f() and a function g() that, while each fine in itself, do not work well together:

    void f(int*p) // accepts NULL or a valid pointer to int
    {
    g(p);
    }

    void g(int *p) // expects a valid pointer
    {
    *p = 1;
    }

    By contrast, with ACSL specifications, the user would be warned when trying to put f() and g() together. Similarly, you my get predicate string_is_number wrong and thus manage to prove that a function that should be rejected, but hopefully you will catch that when str2long is used and the wrong definition for string_is_number prevents another property from being proved.

  45. regehr | March 8, 2013 at 12:27 pm | Permalink

    Thanks Pascal!

    Yes, it was the string_is_number() part that I was worried about, there just does not seem to be any easy way out. As you suggest, though, building up a library of such functions will dull the pain when others run across similar problems.

    Think how fun it would be to verify printf() or scanf().

  46. regehr | March 9, 2013 at 12:07 am | Permalink

    Hi everyone,

    Thanks for submitting some very nice codes!

    As of the end of Friday, I’m no longer taking new submissions.

    Currently I have 36 codes that pass all tests cases including the signed integer overflow checker, on both x86 and x86-64.

    I will work on finding the “best” one of these, but the resulting post is probably going to be a bit long, and I may not be able to finish it on Saturday. Failing that, it’ll have to be after I get back to town, probably Thursday the 14th.

    Thanks again!

  47. regehr | March 9, 2013 at 12:08 am | Permalink

    Also: I believe I have mailed everyone who sent me a submission. If you expected to hear from me but didn’t please let me know.

  48. Ryan Prichard | March 9, 2013 at 4:41 am | Permalink

    I think there’s a bug in the ACSL specification — if the string is a number outside the [LONG_MIN, LONG_MAX] range, then str2long is required to set error to 1, but this is not apparently required by the bad_number behavior. I think bad_number’s assumption should be the exact inverse of the good_number assumption.

  49. Pascal Cuoq | March 9, 2013 at 6:29 am | Permalink

    Ryan, you are right. I should have added, as an indication of my intention, …

    complete behaviors;
    disjoint behaviors;

    … meaning that under the precondition “valid_string(s)”, I expected exactly one of the “assumes” clauses on which behaviors “good_number” and “bad_number” are conditioned to hold. The system would have had a chance to warn me that there was an error in my specification then.

  50. Pascal Cuoq | March 9, 2013 at 6:47 am | Permalink

    John, yes, it would be fun to specify scanf() or printf(), and generally a workable subset of the standard library,. Actually the specifications distributed with Frama-C have been improving steadily for the last two or three versions (though they are still mostly partial specifications).

    However, while your str2long() is only too much work for me to enjoy doing, and by no way out of reach—it would be shorter than my C submission, for one thing—variadic functions are not supported at all in ACSL 1.6.

    On this subject, you may remember that I have a particular kind of contempt for type qualifiers such as const and restrict:

    http://blog.frama-c.com/index.php?tag/restrict

    Coincidentally, the StackOverflow question below, posted today, brought to my attention that these qualifiers have the same limitation as ACSL: they cannot be used to specify that sscanf() expects non-overlapping addresses to write to:

    http://stackoverflow.com/q/15308427/139746

  51. David Eisenstat | March 9, 2013 at 3:24 pm | Permalink

    My name links to my would-be entry, which was several hours too late. Probably the part of most interest is the line

    return n != 0 ? -1 – (long)(n – 1) : -(long)n;

    which computes -n for n in [0, -INT_MIN] with wrapping semantics and compiles under mild optimization with recent versions of GCC and Clang to a single instruction.

  52. David Eisenstat | March 9, 2013 at 3:34 pm | Permalink

    Note that n is an unsigned long and that Safari, in its infinite wisdom, has converted two of my minus signs to en dashes.

  53. Robert David Graham | March 11, 2013 at 12:57 pm | Permalink

    Ryan,

    While I’m quite happy with my submission and don’t want to change it, your performance tests skewed toward how the submissions deal with invalid input rather than valid input.

    To showcase this problem (NOT A NEW SUBMISSION) I’ve created a version of my submission that adds two checks. The first check is to early exit when an error occurs. The second skips leadings zeroes — a completely pointless check that hurts performance on normal input, and is only needed when you plan on sending a lot of input with leading zeroes.

    I’ve posted this file as “robert_3.c” at:
    https://gist.github.com/robertdavidgraham/5136591
    Again, this is NOT a new submission. I’m quite happy with my submission, and don’t want to change it. I don’t like redundant error checking, nor do I like the “implicit goto” caused by the return statement (which is why I removed it from my first submission robert.c).

    On my machine, the fastest submission is “yang_2.c”, which includes this error checking. My “robert_2.c” is twice as slow, but “robert_3.c” is only 10% slower. I suspect that if you made the same changes to the other submissions, the majority of them would cluster much closer in your tests.

  54. Magnus | March 11, 2013 at 8:22 pm | Permalink

    Had a lot of fun doing this — unfortunately the latest version of my entry that was committed to github still had the conditional compilation set up to use MSVC’s #define of LLONG_MAX (which I had to use because longs are only 32 bits even with the 64 bit compiler) which ends in “i64″ instead of “LL”, which causes the infinite loop.

    The correct version works fine as far as I know with only a single character difference. I sent that to John and he even replied before the 9th so I can prove it!

  55. Meddler | March 13, 2013 at 4:16 am | Permalink

    Many versions of itoa() have an overflow bug that makes them print non-digits when given INT_MIN. You don’t wanna see “if (n<0) { n = -n; …}"

  56. berdario | March 13, 2013 at 7:08 am | Permalink

    Uh,

    https://github.com/regehr/str2long_contest/blob/master/str2long.h

    so, I could’ve used strlen after all…

    “Don’t call library functions; your code should perform the conversion itself
    It is fine to use constants from limits.h”

    led me to think that anything outside limits.h was to be avoided… shuck, could’ve saved a few lines probably

  57. Thomas Fach-Pedersen | March 14, 2013 at 5:06 am | Permalink

    berdario: No, it’s still against the rules. It’s not like you needed to know the length of the input anyway ;-)

  58. berdario | March 14, 2013 at 1:25 pm | Permalink

    Uh… well, there’re some submissions that make use of such library functions :)