A Variable Argument Hazard

Variable argument functions in C and C++ can be tricky to use correctly, and they typically only get compiler-based type checking for special cases such as printf(). Recently I ran across an entertaining variable argument bug using tis-interpreter.

Although I will use code from SQLite to illustrate this bug, this bug has not been found in SQLite itself. Rather, it was in some code used to test that database.

SQLite’s API has a configuration function that takes variable arguments:

int sqlite3_config(int op, ...);

Several of its configuration commands, such as this one, expect pointer-typed arguments:

case SQLITE_CONFIG_LOG: {
  typedef void(*LOGFUNC_t)(void*,int,const char*);
  sqlite3GlobalConfig.xLog = va_arg(ap, LOGFUNC_t);
  sqlite3GlobalConfig.pLogArg = va_arg(ap, void*);
  break;
}

We’re only allowed to execute this code when the first two variable arguments are respectively compatible with LOGFUNC_t and void *, otherwise the behavior is undefined.

The test code contains calls that look like this:

sqlite3_config(SQLITE_CONFIG_LOG, 0, pLog);

At first glance there’s no problem — normally it is fine to pass a 0 to a function that expects a pointer argument: the 0 is turned into an appropriately-typed null pointer. On the other hand, when a function takes variable arguments (or when it lacks a prototype) the compiler cannot do this kind of caller-side type conversion and instead a different set of rules, the “default argument promotions” are applied, resulting in a zero-valued int being passed to sqlite3_config() in the first variable argument position. Another fun aspect of the default argument promotions is that they force all float arguments to be promoted to double. These rules seem to be the same in C++ though they would fire less often due to C++’s stricter rules about function prototypes.

This problem was not noticed because the code happens to be compiled benignly on both x86 and x64 platforms. On x64 Linux, pointers and ints are not the same size, but the calling convention says that the first six integer and pointer arguments are passed in 64-bit registers — so the problem gets covered up by an implicit promotion to 64 bits. The ABI does not guarantee that the higher bits are zeroed in this case, but in this example they happen to be cleared by GCC and LLVM.

On x86 Linux, all arguments are passed in memory but since pointers and integers are the same size, the problem is again covered up. An ABI that represents a null pointer differently than a zero-valued long integer would expose this bug right away, as would an x64 program that passes at least six variable arguments. In any case, the fix is easy:

sqlite3_config(SQLITE_CONFIG_LOG, (LOGFUNC_t)0, pLog);

In summary, be careful with the default argument promotions that are done when calling variable argument functions.

Thanks to Pascal Cuoq who provided comments on this piece.

Also see this article by Rich Felker and this one by Jens Gustedt.

March 9, 2016

regehr

Compilers, Computer Science, Software Correctness

15 responses to “A Variable Argument Hazard”

Richard Kettlewell says:

March 9, 2016 at 4:37 am

It’s worse than that! A naive programmer may be tempted to use the macro ‘NULL’ in this context, and believe that they’ve solved the problem. But that’s no good, since NULL is defined as ‘a implementation-defined null pointer constant’, which doesn’t actually have to be a null pointer (of any type) – it may be the integer constant 0.

Here’s a rather depressing instance of someone going through a program and *putting this bug in*:

https://anonscm.debian.org/cgit/dpkg/dpkg.git/commit/?id=4e5846ccd3dcc33504aba8ef35a8962bccfd562e
Jeff Walden says:

March 9, 2016 at 4:49 am

Mozilla had this as a security bug, once. Some code was attempting to use varargs with a null constant as terminating the arguments list, and at the time I believe our “null”-alike was int-sized. Can’t remember if this was a switch-to-nullptr issue or a 64-bit porting issue. I do distinctly remember the issue lying in SVG code, tho. 🙂

Incidentally, C++11 does have specific language for the size of nullptr, which might seem not-immediately-necessary until you remember pointers can have different sizes, and if nullptr inhabits its own type it clearly can’t be the size of all of them.
regehr says:

March 9, 2016 at 6:11 am

Richard, great example. Thanks Jeff! I’ve only seen a few platforms where different kinds of pointers have different sizes.
Daniel Stenberg says:

March 9, 2016 at 6:25 am

I fully acknowledge the problems vargs with numericals. We have them all the time in libcurl where we were stupid enough to design an API (many years ago) where the numerical is supposed to be a “long” and you set them in a vararg function: curl_easy_setopt(…, 14); which has the unfortunate downside that people need to spell out that it is a long, like in: curl_easy_setopt(…, 14L) (or a plain typecast) … or things go sideways really fast on platforms where long and int differ in size… And that’s just not how we want to pass in numbers.
Jed Davis says:

March 9, 2016 at 10:57 am

Depending on the x86_64 zero-extension can be dangerous. I ran into a bug that showed up in ASan builds, but not in non-ASan builds with the same Clang/LLVM, not because of anything about ASan itself, but because they added -fno-omit-frame-pointer. Which added more register pressure, which caused LLVM to move some spills inside a conditional, meaning *less* spilling on the common path, meaning a register with extra stuff in the high bits was passed as-is to buggy assembly instead of being zero-extended. (The assembly code had been buggy for 7 years without anyone noticing. There might be something to be said for a compiler plugin to randomize or otherwise poison ABI-undefined bits like that.) For extra fun, those high bits weren’t the result of anything obvious in the source, like a narrowing cast being optimized out: it was two 32-bit struct field accesses optimized into one 64-bit load.

Longer version: https://bugzilla.mozilla.org/show_bug.cgi?id=1215681#c17
Reid Kleckner says:

March 9, 2016 at 11:25 am

MSVC actually zero extends all literal ‘0’ arguments to varargs functions to fix this glitch. NULL is also defined to ‘0’, so even portable C code needs this fix.
We had to follow suit in clang to avoid miscompiling such code. Richard insisted we limit this to MSVC mode, but I’m not sure I agree with that.
Colin Watson says:

March 9, 2016 at 1:16 pm

Reminds me of https://ghc.haskell.org/trac/ghc/ticket/8965, which is by far the most time-consuming bug I’ve ever tracked down in something I was only arguably supposed to be working on in the first place.
Jeff Walden says:

March 9, 2016 at 3:50 pm

I tracked down the original Mozilla bug I mentioned earlier — it’s bug 547964, “Potentially dangerous word-size error in definition of ‘nsnull’”. The problem was a 64-bit issue, but it arose in the context of our possibly switching from #define nsnull 0 to NULL, a switch we never actually did. (Instead we did some sketchy #define nullptr ... stuff guarded by autoconf in new-enough compilers, with new-enough compile options.)

I’m compelled to note that you’ve absolutely “seen a few platforms where different kinds of pointers have different sizes”. Function pointers and pointers to data, certainly, are almost always the same size. But pointer-to-virtual member (or throw in multiple inheritance, or virtual inheritance, or other exotic characteristics) almost never that same size on any compiler, for any platform or architecture, even modern, super-common ones.
Andrew Certain says:

March 10, 2016 at 9:10 am

Richard, can you give me an example of code and a compiler that would actually trigger this bug using NULL? I understand that it could happen, but is there a current tool chain that behaves in this way?

Thanks.

Andrew
Richard Kettlewell says:

March 10, 2016 at 12:51 pm

Andrew, the platforms I can conveniently check almost all define NULL as (void *)0 or 0L, meaning the problem doesn’t arise in practice. The only counterexample I can find is an ancient AIX system, but the compiler stores a 64-bit 0 in that case, dodging the bullet.

This isn’t an exhaustive survey. My guess is that C library implementors have deliberately arranged for an uncasted NULL to yield a pointer type, or a 64-bit type, as a pragmatic measure against this class of issue.

(Empirically, clang-3.8 Linux does have the behavior John warns of: terminating an execl() argument list, long enough to get out of the register parameters, with a bare ‘0’ results in a 32-bit store. But NULL is ‘((void *)0)’ in Glibc.)
Michael Pyne says:

March 10, 2016 at 7:24 pm

We saw this years ago in KDE as well, when integrating media players into gstreamer’s C interface:

https://bugs.kde.org/show_bug.cgi?id=90869 (Amarok) and
http://marc.info/?l=kde-commits&m=113944788802256&w=2 (fixing a similar bug in the then-default media player).
regehr says:

March 11, 2016 at 1:47 am

Thanks for the stories / links, folks! I hadn’t known that this kind of error happened more broadly.

It sounds like a varargs sanitizer mode for Clang or GCC should be created.
Zack says:

March 11, 2016 at 5:22 am

Functions like execl exhibit undefined behavior if NULL is used to terminate the argument list, _even if_ NULL is defined as ((void*)0) or a pointer-sized integer zero. Only (char*)0 (or (char*)NULL, I suppose) satisfies the letter of the standard.

This is because the standard allows ABIs where the calling convention depends on the _exact type_ of the arguments, not just their sizes. Hypothetically speaking, void* and char* pointers might be passed in different register classes, or their NULLs might have different bit patterns.
Andrew McCreight says:

March 11, 2016 at 10:09 am

A more recent example of this in Firefox is this:
https://bugzilla.mozilla.org/show_bug.cgi?id=1194562
Some code for making a string for an error message was using var args, and somebody passed in the wrong kind of string, causing type confusion. I later fixed this issue by converting the function to use variadic templates, allowing compile-time checking of the number of arguments:
https://bugzilla.mozilla.org/show_bug.cgi?id=1195977
Richard Kettlewell says:

March 11, 2016 at 2:14 pm

Zack: the rule I see in C99 7.15.1.1 requires compatible types, not identical types, and explicitly permits void * and char * to be intermixed (among other things). Is there some other bit I’ve missed that restricts it further?