A Quick Coding Contest: Convert String to Integer Without Overflow


UPDATE: As of Saturday March 9, the contest is closed. Results will be posted in a few days. I’ve created a Github repository containing all submissions and my test harness.

Regular readers will know that I’m obsessed with integer overflows. One apparently simple piece of code that many programmers get wrong in this respect is converting strings to integers. It is slightly tricky. The goal of this contest is to submit the nicest (judged by me) implementation in C for this specification:

/*
 * if input matches ^-?[0-9]+\0$ and the resulting integer is
 * representable as a long, return the integer; otherwise if
 * the input is a null-terminated string, set error to 1 and 
 * return any value; otherwise behavior is undefined
 */
extern int error;
long str2long (const char *);

I deliberately didn’t name the function atol or strtol since its semantics are slightly nonstandard. The regular expression above probably isn’t standard either; it is simply trying to specify a null-terminated string containing only a signed decimal integer of arbitrary size. No other characters (even whitespace) are permitted.

A few extra things:

  • Don’t call library functions; your code should perform the conversion itself
  • It is fine to use constants from limits.h
  • You should write conforming C code that works on x86 and x86-64. I’ll probably only test entries using GCC and Clang but even so you are not allowed to use extensions such as GCC’s 128-bit integers
  • You may assume that implementation-defined behaviors are the standard ones for the x86 and x86-64 platforms on Linux machines
  • Your code must not execute undefined behaviors (use a recent Clang to check for this)

Please mail entries to me, not only to avoid spoiling the fun but also because WordPress loves to mangle C code left in the comments. Feel free to leave comments that don’t include code, of course. I’ll announce the winner in a few days, and will send the winner a small prize. By submitting an entry into this contest you are agreeing to let me use your entry in a subsequent blog post, either as a positive or negative example.

NOTE: The best Clang flags to use are: -fsanitize=integer -fno-sanitize=unsigned-integer-overflow

UPDATE: As of Saturday March 9, the contest is closed. Results will be posted in a few days. I’ve created a Github repository containing all submissions and my test harness.


58 responses to “A Quick Coding Contest: Convert String to Integer Without Overflow”

  1. Of course, since it is impossible in nearly all dialects of C to detect that a const char * points to a zero-terminated string, what you actually mean when you write :

    if input matches ^-?[0-9]+$ and the resulting integer is
    representable as a long, return the integer; otherwise set
    error to 1 and return any value

    is:

    if input matches ^-?[0-9]+$ and the resulting integer is
    representable as a long, return the integer; otherwise if
    the input is a zero-terminated string, set
    error to 1 and return any value; otherwise do undefined behavior.

    right?

  2. “Slightly tricky”? I’ve seen more than one standard library that gets these functions wrong. For example, C++Builder has separate implementations for strtol and iostreams, both broken. Well, it seems strtol has been fixed now.

  3. Hi Tennessee, I’m going to be traveling starting next Sunday, so let’s say the deadline is evening of Friday March 8 (giving me Saturday to process the results).

    Right, performance is not the primary factor, although I would consider extremely slow submissions to be undesirable.

  4. Are you looking for a bulletproof implementation on any weird processor, or just a carefully written one for reasonable some-number-of-bit 2s-complement processors that doesn’t rely on undefined behavior? i.e., would you throw a red flag at something that (say) assumed -LONG_MAX is representable?

  5. Hi Peter, it would be fun to start a list of broken atoi-like functions.

    I took a quick look at the one from eglibc today. I wasn’t convinced that it is correct, but couldn’t conclusively show this without running it. Unfortunately eglibc is a mess of preprocessor crap and I could not easily extract this function for unit testing purposes….

  6. so if i have a stirng like this ‘9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999’

    should the function return error or just the modded value?

  7. Hi Andrew, the latter is what I have in mind– carefully written code that a modern compiler for real hardware cannot break, even after 5 or 10 years of optimizer improvements. So -LONG_MAX is perfectly acceptable but -LONG_MIN is way wrong!

  8. So far, I have four submissions, all of which are prettier than my code. Two of them execute at least one undefined integer operation and two of them are correct as far as I can tell — but my tester needs more work.

  9. Is there a guide to building clang from svn with -fsanitize-integer enabled that has more hand-holding than “get the svn, run configure, make”? I apparently did something wrong, as clang is wrongly passing the -fsanitize=integer flag on to the assembler:

    $ ./Debug+Asserts/bin/clang -fsanitize=integer -c x.c
    gcc-4.7.real: error: unrecognized command line option ‘-fsanitize=integer’
    clang: error: assembler (via gcc) command failed with exit code 1 (use -v to see invocation)

    all are svn checkouts from yesterday as detailed at http://clang.llvm.org/get_started.html including a checkout of compiler-rt. Systems are debian wheezy amd64 and debian kfreebsd wheezy amd64.

  10. Minor nit, but shouldn’t the contract be

    /*
    * if input matches ^-?[0-9]+$ and the resulting integer is
    * representable as a long, set error to 0 and return the integer; otherwise if
    * the input is a null-terminated string, set error to 1 and
    * return any value; otherwise behavior is undefined
    */

    As you wrote it, there’s no guarantee that error gets cleared on success, which makes it hard to use after the first failure…

  11. Oh god, this is such a great contest. People will fail although they *know* what’s being asked of them and invest great care into achieving it: no undefinedness.

  12. Mike, in my test harness I set error to 0 before each call and check its value after return. This is not unlike the situation with errno, where calls which can set it in case of error leave it unchanged in the non-error case.

  13. I’m looking forward to Part II: The Slightly Longer Coding Contest, which changes the requirement to *strictly* conforming C11 or C99, and then Part III that requires strictly conforming C89.

  14. @Jeff, that’s definitely an argument for using an error return code and passing in a pointer to the location for the return value… The other argument is reentrance, of course.

    e.g. (just for discussion – I don’t think the contest rules should change now, of course):

    /*
    * if input matches ^-?[0-9]+$ and the resulting integer is
    * representable as a long, return 0 and set *result to the integer; otherwise if
    * the input is a null-terminated string, return 1 and
    * possibly modify *result; otherwise behavior is undefined. The behavior is also
    undefined unless result is a valid pointer to a writeable value of type long.
    */
    int str2long (int *result, const char *str);

  15. Hi Jeff, I have never seen the error that you report.

    However, I always build an optimized + asserts version of LLVM, not a debug + asserts. Doesn’t seem like that should make any difference.

    Did you build Compiler-RT in addition to Clang and LLVM? You must build all three for the sanitizer to work.

    Finally, you are providing an explicit path to Clang; it seems possible that you should rather install it, add it to your PATH, and then run it.

  16. Jeff and Mike, I deliberately specified leaving error alone in the success case since this supports performing many conversions and then checking the error flag at the end. I’m not sure there’s a right answer, just preference.

    Regarding the reentrancy issue, I agree — but just wanted to keep things simple here.

  17. Hi kps, that would be brutal. But I think it might be very hard to check code for strict conformance. I don’t even know of any work in that direction.

    For example, strictly conforming code would have to work properly on a machine where LONG_MAX is 2^31-1 and LONG_MIN is -2^666. I don’t believe anything in the C standard prohibits this.

  18. Hi everyone, I have about a dozen submissions that I haven’t processed yet, I will do this on Weds evening, thanks for the code!

  19. John,

    You’re right, my checked_long class only works when long is 32-bit, which is the case on my platform. Thanks for pointing this out.

    Franz

  20. In C99 and C11 I think §6.2.6.2 forces LONG_MIN = -LONG_MAX or -LONG_MAX-1. In C89, I presume it was *intended* from §3.1.2.5 with footnote 13, but they neglected to rule out crazy interpretations of ‘sign bit’.

  21. Can we assume the character encoding is ASCII, or at least that the characters for the digits ‘0’ through ‘9’ are adjacent and sorted?

  22. I got my entry nearly done. Will finalize my testing tonight 🙂 Hope that you can wait until Friday even if you have a lot of submissions.

    One thing that is not clear:

    If I have any of the following:
    000000000
    0000000123
    -00000000000000000
    -000000000000001234

    Are they all valid strings for your function, and should they return “0” on arbitrarily long strings of leading zeros ?

    I would expect: 0, 123, 0, -1234 as the expected for the four cases above, but just want to make sure leading zeros need to be ignored or if they are considered incorrect.

  23. Hi Andrew, perfectly OK to assume either an ASCII encoding or an encoding with the weaker property that you mention.

  24. 22 john:
    I set error = 0 when the parse succeeds, I hope I don’t have to change my entry?

    29 Tennessee C-Veilleux:
    According to the regex all your examples are valid (I hope! as they would be accepted by my entry).

  25. Hi Magus, please update your entry, there is no penalty 🙂

    Tennessee, leading zeros should be ignored as they do not change whether the subsequent number is representable or not.

  26. Andrew,

    ISO/IEC 9899:2011 5.2.1p3 guarantees that the digits ‘0’ through ‘9’ are 1 apart and sorted.

  27. Hi all, I processed many entries tonight (Weds) but code kept arriving as I worked and now I need to get ready to teach tomorrow. I should be able to catch up on Thurs.

  28. Is it OK for code to overflow as part of overflow checking? Or do you want code to use constants like LONG_MIN and LONG_MAX and avoid any possible overflow?

  29. Hi erik, the key here is to avoid undefined integer operations including any signed overflow, regardless of its purpose. Unsigned overflows are perfectly acceptable.

  30. Tonight (Thurs) I almost got through all entries that I’ve received so far, but not quite. I’ll finish up Friday AM.

    So far I have 29 working solutions, some of which are very nice.

  31. > Jeff and Mike, I deliberately specified leaving error alone in the success case since[cut]

    Err, no, you think you did specify this but in fact you didn’t specify anything about the value of error in case of success!
    A classic case of ‘the spec is in my mind’..

  32. Hi reno, leaving the flag alone is implied. If you don’t like that I will have to tighten up the spec substantially, for example by specifying that your code is not to call fork(), open a socket, etc.

  33. >Hi reno, leaving the flag alone is implied. If you don’t like that I will have to tighten up the spec substantially, for example by specifying that your code is not to call fork(), open a socket, etc.

    You’re showing your academic background here.
    In theory, you’re right. In practice a developer cannot distinguish a voluntary omission from a specification mistake, so more complete specifications are better (of course specifications can be too detailed too)..

  34. Hi Pascal, can you post an ACSL specification for this function? This might make both me and Reno happy!

    Reno, here’s what I read in the man page for fork() on my Linux machine:

    “On success, the PID of the child process is returned in the parent, and 0 is returned
    in the child. On failure, -1 is returned in the parent, no child process is created,
    and errno is set appropriately.”

    Perhaps you should find the author of this man page and suggest that his/her academic background has lead to a failure to specify that errno is unmolested in the non-error case.

  35. In the case of Linux manpages, errno(3) already says: “Valid error numbers are all nonzero; errno is never set to zero by any system call or library function.”
    So reno should be happy with them.

  36. John, I have a hunch that WordPress comments aren’t going to be any friendlier to ACSL than they are to C. I duplicated it at http://pastebin.com/Ys23KhPV

    The ACSL specification would be something like:

    /*@
    requires valid_string(s);
    behavior good_number:
    assumes \exists integer n; string_is_number(s, n) && LONG_MIN <= n <= LONG_MAX;
    assigns \result;
    ensures \exists integer n; string_is_number{Pre}(s, n) && \result == n;
    behavior bad_number:
    assumes \forall integer n; ! string_is_number(s, n);
    assigns error;
    ensures error==1;
    */
    long str2long (const char * s);

    The predicate valid_string is conveniently predefined :

    /*@ predicate valid_string{L}(char *s) =
    0 <= strlen(s) && \valid(s+(0..strlen(s)));
    */

    (The predicate strlen is itself defined in a way such that it makes sense to use it as it is used here)

    The predicate string_is_number would have to be defined by the specifier. Yes, it would be the same length as (and would like a) prolog implementation of that idea. On the plus side, the predicate could be re-used as-is for the specification of str2int and str2longlong.

    In effect, when verifying an implementation of str2long against such a specification, you would in some sense be comparing an implementation with another. If there is the same bug in both, it might go unnoticed. But I have to point out that, in most programming languages, and especially in C, nothing prevents you to put together a function f() and a function g() that, while each fine in itself, do not work well together:

    void f(int*p) // accepts NULL or a valid pointer to int
    {
    g(p);
    }

    void g(int *p) // expects a valid pointer
    {
    *p = 1;
    }

    By contrast, with ACSL specifications, the user would be warned when trying to put f() and g() together. Similarly, you my get predicate string_is_number wrong and thus manage to prove that a function that should be rejected, but hopefully you will catch that when str2long is used and the wrong definition for string_is_number prevents another property from being proved.

  37. Thanks Pascal!

    Yes, it was the string_is_number() part that I was worried about, there just does not seem to be any easy way out. As you suggest, though, building up a library of such functions will dull the pain when others run across similar problems.

    Think how fun it would be to verify printf() or scanf().

  38. Hi everyone,

    Thanks for submitting some very nice codes!

    As of the end of Friday, I’m no longer taking new submissions.

    Currently I have 36 codes that pass all tests cases including the signed integer overflow checker, on both x86 and x86-64.

    I will work on finding the “best” one of these, but the resulting post is probably going to be a bit long, and I may not be able to finish it on Saturday. Failing that, it’ll have to be after I get back to town, probably Thursday the 14th.

    Thanks again!

  39. Also: I believe I have mailed everyone who sent me a submission. If you expected to hear from me but didn’t please let me know.

  40. I think there’s a bug in the ACSL specification — if the string is a number outside the [LONG_MIN, LONG_MAX] range, then str2long is required to set error to 1, but this is not apparently required by the bad_number behavior. I think bad_number’s assumption should be the exact inverse of the good_number assumption.

  41. Ryan, you are right. I should have added, as an indication of my intention, …

    complete behaviors;
    disjoint behaviors;

    … meaning that under the precondition “valid_string(s)”, I expected exactly one of the “assumes” clauses on which behaviors “good_number” and “bad_number” are conditioned to hold. The system would have had a chance to warn me that there was an error in my specification then.

  42. John, yes, it would be fun to specify scanf() or printf(), and generally a workable subset of the standard library,. Actually the specifications distributed with Frama-C have been improving steadily for the last two or three versions (though they are still mostly partial specifications).

    However, while your str2long() is only too much work for me to enjoy doing, and by no way out of reach—it would be shorter than my C submission, for one thing—variadic functions are not supported at all in ACSL 1.6.

    On this subject, you may remember that I have a particular kind of contempt for type qualifiers such as const and restrict:

    http://blog.frama-c.com/index.php?tag/restrict

    Coincidentally, the StackOverflow question below, posted today, brought to my attention that these qualifiers have the same limitation as ACSL: they cannot be used to specify that sscanf() expects non-overlapping addresses to write to:

    http://stackoverflow.com/q/15308427/139746