What’s the difference between an integer and a pointer?

(This piece is an alternate introduction and advertisement for a soon-to-be-published research paper.)

In an assembly language we typically don’t have to worry very much about the distinction between pointers and integers. Some instructions happen to generate addresses whereas others behave arithmetically, but underneath there’s a single data type: bitvectors. At the opposite end of the PL spectrum, a high-level language won’t offer opportunities for pointer/integer confusion because the abstractions are completely firewalled off from each other. Also, of course, a high-level language may choose not to expose anything that resembles a pointer.

The problematic case, then, is the performance-oriented systems programming language: C, C++, Rust, and a few others. The issue is that:

When possible, they pretend to be high-level, in order to justify optimizations that would be unsound at the assembly level. For example, they assume that a pointer derived from one heap allocation cannot alias a pointer derived from a different heap allocation.
When necessary, they allow the programmer to manipulate pointers in low-level ways, such as inspecting the representation of a pointer, creating a new pointer at some offset from an existing one, storing unrelated data in unused parts of a pointer, and even fabricating a pointer out of whole cloth.

These goals conflict: the better a language is for justifying high-level optimizations, the worse it seems to be for implementing an OS kernel, and vice versa. There are a few ways we might deal with this:

We could require the high-level language to call out to low-level code written in a different language, as Java does with its JNI.
We could create a special low-level mode for a safe language where it can break the normal rules, as Rust does with its unsafe code.
We could go ahead and freely mix high-level and low-level code, hoping that the compiler can sort it all out, as C and C++ do.

Let’s look at an example:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  int *x = malloc(4);
  *x = 3;
  int *y = malloc(4);
  *y = 5;
  uintptr_t xi = (uintptr_t)x;
  uintptr_t yi = (uintptr_t)y;
  uintptr_t diff = xi - yi;
  printf("diff = %ld\n", (long)diff);
  int *p = (int *)(yi + diff);
  *p = 7;
  printf("x = %p, *x = %d\n", x, *x);
  printf("y = %p, *y = %d\n", y, *y);
  printf("p = %p, *p = %d\n", p, *p);
}

When run (on my Mac) it gives:

$ clang-6.0 -O3 mem1d.c ; ./a.out
diff = -96
x = 0x7fcb0ec00300, *x = 7
y = 0x7fcb0ec00360, *y = 5
p = 0x7fcb0ec00300, *p = 7
$ gcc-8 -O3 mem1d.c ; ./a.out
diff = -96
x = 0x7f93b6400300, *x = 7
y = 0x7f93b6400360, *y = 5
p = 0x7f93b6400300, *p = 7

What happened here? First, we grabbed a couple of heap cells and initialized their contents. Second, we used integer math to fabricate a pointer p that has the same value as x, and stored through that pointer. Third, we asked the compiler what contents it thinks are in the locations pointed to by x, y, and p. Both GCC and Clang/LLVM are of the opinion that the store through p changed the value referred to by x. This is probably the result that we had hoped for, but let’s look into it a bit more deeply.

First, does this C program execute undefined behavior? It does not! Computing out-of-bounds pointers is UB but we are free to do whatever we like with integer-typed values that were derived from pointers. Second, does the C standard mandate the behavior that we saw here? It does not! In fact it gives only very sparse guidance about the behavior of integers derived from pointers and vice versa. You can read about that here.

Now let’s modify the program a tiny bit:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  int *x = malloc(4);
  *x = 3;
  int *y = malloc(4);
  *y = 5;
  uintptr_t xi = (uintptr_t)x;
  uintptr_t yi = (uintptr_t)y;
  uintptr_t diff = xi - yi;
  printf("diff = %ld\n", (long)diff);
  int *p = (int *)(yi - 96); // <--- CHANGED!
  *p = 7;
  printf("x = %p, *x = %d\n", x, *x);
  printf("y = %p, *y = %d\n", y, *y);
  printf("p = %p, *p = %d\n", p, *p);
}

The only difference is at line 13: instead of adding diff to yi to compute p, we added the observed value of the difference: -96. Now we’ll run it:

$ clang-6.0 -O3 mem1d2.c ; ./a.out
diff = -96
x = 0x7ff21ac00300, *x = 7
y = 0x7ff21ac00360, *y = 5
p = 0x7ff21ac00300, *p = 7
$ gcc-8 -O3 mem1d2.c ; ./a.out
diff = -96
x = 0x7fce5e400300, *x = 3
y = 0x7fce5e400360, *y = 5
p = 0x7fce5e400300, *p = 7

Clang behaves the same as before, but GCC no longer believes that storing through p results in x being overwritten, despite the fact that x and p have the same value. What is going on is that as far as GCC is concerned, since we computed the address of x through guesswork instead of through an actual observation, our guess is effectively illegitimate. (If this guessing game is too simple for you, there are entertaining variations that avoid the need for information from a previous run of the program, such as guessing addresses using an exhaustive loop or making huge allocations so that the resulting addresses are constrained by the pigeonhole principle.) Here again we are operating well outside of the C standard and are rather in a zone where compiler writers need to make decisions that balance optimization power against developers’ expectations. I like this example because it is weird and perhaps surprising.

What should we take away from this? First, it is a big mistake to try to understand pointers in a programming language as if they follow the same rules as pointer-sized integer values. Even people who have come to terms with UB and its consequences are often surprised by this. Second, these issues are not specific to C and C++, but rather are going to create problems in any language that wants highly optimizing compilers but also permits low-level pointer manipulation.

So what are the actual rules that Clang and GCC are following when compiling these examples? The short answer is that it’s complicated and not well documented. Some useful discussion can be found in this piece about pointer provenance. For a longer answer that is specific to LLVM, see this paper that will get presented at OOPSLA in November. Also, one of my coauthors wrote a blog post about this material.

September 18, 2018

regehr

Compilers, Computer Science, Software Correctness

23 responses to “What’s the difference between an integer and a pointer?”

John Payson says:

September 18, 2018 at 11:24 am

The Standard does not require that implementations define intptr_t nor uintptr_t. It also does not require that implementations define integer-to-pointer conversions in any meaningful fashion unless they define those types. Conforming implementations are certainly allowed to document some integer-to-pointer conversions as yielding trap representations, and nothing would prohibit an implementation that doesn’t define intptr_t nor uintptr_t from documenting them all as doing so, nor from saying that no pointers that aren’t trap representations can be represented with any integer type. On such an implementation, the issues you mention simply would not exist.

Consequently, since it would be rare for programs to convert pointers to integer type except when they needed to do “weird” things with them, I would think the simplest approach would be to specify that a conforming implementation may define types intptr_t and uintptr_t only if it treats the result of integer-to-pointer conversions as pointers of “officially unknowable” provenance at the time of derivation, and allows them to be used to access the storage associated with any object until the next time that either:

1. The storage is accessed in conflicting fashion via pointer not derived from that cast integer.

2. Some other pointer is derived, not from that cast integer, which will be used to either access the storage in conflicting fashion or derive another pointer that will be used to do so, etc.

3. Execution reaches enters a function or reaches the start of a loop iteration where one of the above occurs.

Implementations which would be incapable of supporting such semantics for a uintptr_t and intptr_t types would be free to either refrain from defining the types. If there is significant need for such types in code that doesn’t need any of the semantics, implementations that can’t support the semantics could define the types if user code defines a macro “__STDC_WAIVE_UINTPTR_GUARANTEES” before including a header that would define them.
Lionel says:

September 18, 2018 at 1:56 pm

Hi Jon,

Maybe the Ada way was right : a pointer is a special ‘access’ type that allows (nullable or non nullable) references another type. Pointer Arithmetic doesn’t exist and address arithmetic (with the specific System.Address type and its operators) is frowned upon. This allows for safer optimizations. I think the Spark2014 people are on their way to add pointer safety (ala rust) to Spark and Ada2020. It’s easier to do that in Ada since pointer are already 1 specific typed abstraction. Let me know if you’re interested, or just contact Yannick Moy from AdaCore !
regehr says:

September 18, 2018 at 5:04 pm

John, at a glance I can’t tell if that would meet all the requirements, this stuff is very hard to think about. Did you read our full paper on this stuff?

Lionel, I agree that putting more structure into the pointer system is a very good idea, C and C++ are rather extreme design points.
John Payson says:

September 18, 2018 at 9:59 pm

I’ve read the piece on pointer provenance, and the authors seem to think that compilers should be allowed to track provenance through integers. While there might conceivably be some situations where it’s necessary to pass convert pointers through integer types without doing anything “interesting” with them, I think they’d be sufficiently rare in performance-critical code that having such casts serve as a localized optimization barrier would seldom cost much if anything. Further, if for some reason it’s necessary to allow optimizations across pointer-uintptr_t-pointer conversions, such need could be accommodated by invoking the compiler with `-d__STDC_WAIVE_UINTPTR_GUARANTEES” or equivalent.

Adding the requirements I described would mean that some compilers, with their present optimizers, would not be able to define “uintptr_t” for programs that didn’t define __STDC_WAIVE_UINTPTR_GUARANTEES. If the goal of the Standard is to avoid suggesting that compilers should do anything they don’t already feel like doing, those requirements wouldn’t fit such a goal. If, however, the goal is to minimize the need for programmers to use behaviors not described by the Standard, however, there needs to be a way of formally erasing pointers’ provenance, and using pointer-int-pointer conversions for that purpose seems the most logical approach.
Davin says:

September 19, 2018 at 2:02 am

Recently I stumbled on this GCC bug, and found the discussion within alarming:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86259

One of the developers is arguing that provenance does in fact transfer to integers (see comment #24) and that would make your example here unsound (for what it’s worth I personally think that’s completely wrong, there’s no way to justify it by the current language standard, and your example here is fine).
John Payson says:

September 19, 2018 at 10:14 am

Under the present Standard, a conforming compiler could at its leisure treat as UB almost most forms of code that would convert a pointer to an uintptr_t, convert that back to a pointer, and use it to access storage. For a typical address, there will exist another address, which I’ll call the “dual” of the first, such that the two addresses compare equal, but can only be used to access disjoint regions of storage. The Standard guarantees that a round-trip pointer-uintptr_t-pointer conversion will yield a pointer that compares equal to the original, but in cases where two possible addresses would meet that criterion the Standard would allow a conforming but capricious implementation to select whichever would yield UB if code makes any attempt to dereference the resulting pointer.

Consider “int arr[2][1],*p=arr[0]+1,*q=arr[1], *p1 = (int*)(uintptr_t)p, *q1 = (int*)(uintptr_t)q;”. The addresses held by “p” and “q” are duals of each other, since they compare equal, but a compiler would be entitled to regard arr[1][0] as unreachable from p, and arr[0][1] as unreachable from q. Because p and q compare equal, the Standard doesn’t specify whether (int*)(uintptr_t)p would yield p or q, and likewise (int*)(uintptr_t)q, and there would be no way for code to use either pointer to access any storage without invoking UB.

The only “benefit” I can see to having the Standard’s wording be so loose is that almost anything stronger would make uintptr_t unsupportable for compilers whose back end is incapable of handling pointer laundering. Rather than having the Standard bend over backward to avoid suggesting that compilers be capable of recognizing actions that would allow pointer to be used in “interesting” ways, at least within certain windows of opportunity, I would suggest that it would be much more helpful to recognize a category of implementations that simply cannot support low-level programming, and a separate category of implementations that can. Since freestanding implementations which can’t support low-level programming are pretty much useless [the Standard doesn’t define any means via which such programs could do anything], perhaps the Standard could expand the present two categories into three:

– High-level-only Hosted implementations could omit support for pointer-related constructs that are hard to support, since the Standard Library would allow many programs to perform their required task without need for low-level constructs. Such implementations probably shouldn’t define intptr_t and uintptr_t.

– Low-level-implementations could omit Standard Library provided they support the kinds of low-level programming techniques that embedded-software programs and operating systems use to perform I/O and memory management without the Standard Library, including some which are not presently mandated by the Standard, but are consistently handled by quality implementations designed to be suitable for low-level programming.

– Full-feature implementations would be those that meet all the requirements for High-level-only and Low-level implementations.

Splitting things in this fashion would expand the range of optimizations that could be performed legitimately and non-controversially by high-level-only implementations, while increasing the range of tasks that low-level implementations could perform without relying upon any behaviors not fully specified by the Standard, and could also improve the efficiency with which compilers could process low-level code.
John Payson says:

September 19, 2018 at 12:49 pm

Looking at the paper [don’t know how I missed the link before], I’m not quite sure why wildcard provenances would need to pose a major problem if one recognizes that the purpose of provenance is to establish sequencing relationships, and treats the act of creating a pointer or reference as imposing sequencing restrictions in the context where it is created, lesser sequencing restrictions in calling contexts, and none among operations that occur entirely within individual nested contexts (loops or function calls).

If a loop in function F calls functions A, B, C, D, and E, and function C does something “tricky”, then everything that occurs before a call to C would be sequenced either before the call, or before the first tricky operation (compiler’s choice), and everything that occurs after the call to C would be sequenced either after the call or after the last tricky operation (again, compiler’s choice), but the call to C would not affect the reordering or interleaving of operations within the union of A and B, and would not affect reordering of operations within D, nor those within E. The act of calling D or E could be viewed as tricky, depending upon whether it makes use of any pointer that C might have created in tricky fashion.

To be sure, this model would end up unnecessarily blocking some potentially-useful optimizations, but a loop like the one in function F generally wouldn’t be a particularly hot one. Foregoing optimizations in F while retaining the ability to optimize A, B, D, and E would be much less costly than having to forego all optimizations entirely in order to obtain the required semantics.
It might be good to migrate toward a behavior model with a means of officially indicating when code is done using an artificially-created pointer, as well as a means of inviting an implementation to assume that all such transitions are marked. The practice of artificially creating pointers without including all such markings (including the “everything marked” indication) could be deprecated, but properly deprecating a construct requires recognizing its present legitimacy and providing an alternative–neither of which the C or C++ Standards have yet done.
John Payson says:

September 20, 2018 at 10:58 am

One difficulty I’ve had while reading the paper is that I’m not quite clear whether your objective is to make an implementation be suitable for a wide range of low-level programming tasks without having to completely disabling optimizations, or whether the goal is merely to do the minimum necessary to conform to the Standard. I’m not quite clear why “pointer guessing” should be regarded as an issue in either case.

Some execution environments may give information about pointers beyond what the C Standard would require of implementations, and beyond anything the compiler would have any way of knowing [e.g. because the programmer designed the environment in question]. Some tasks may be impractical without making use of such information. An implementation which does not allow a programmer to exploit such information in ways the compiler doesn’t understand may be a great implementation for some purposes, but would be unsuitable the aforementioned for tasks. An implementation intended to be suitable for a wide range of tasks in a variety of environments should thus allow for the possibility that a programmer may “guess” at pointer values in ways that are defined by the environment, but which the implementation could not particularly anticipate. In general, the consequences of this on optimization should be no worse than those of calling “whateverType *unknownFunction(uint64_t argument)”, which is already a required ability for many tasks, treating int-to-pointer conversions as a call to such a function would avoid the need to have implementations care whether a programmer might “guess” at a pointer.

On the flip side, if the goal is to do merely what the Standard requires, I don’t see why special consideration need be given to “guessed” pointers, since a conforming implementation that wasn’t intended to usefully support low-level programming could already treat most uses of pointers formed from integers as UB. In your “A Apendix”, for example, I’m not sure why you have two “if (n)” blocks, but the only thing the Standard says about “(int*)pi” is that it must yield a pointer that compares equal to whatever pointer was used to form “pi”. It does not require that the expression yield a pointer that is actually usable for any purpose other than equality comparison. If an implementation happens to yield useful pointers except in cases where programmers “guess”, and yields useless pointers in those cases, such behavior would be well in line with what the Standard would allow.
Juneyoung Lee says:

September 22, 2018 at 8:09 am

Hello John Payson,

The benefit of defining language semantics in a formal way is that it helps check correctness of optimizations with a way that everyone agree.

For example, hoisting a pointer-to-integer cast from a loop:

for (..) {
if (pass_this_iter()) continue;
intptr_t x = (intptr_t)p; // p is never changed in this loop
use(x);
}
=>
intptr_t x = (intptr_t)p;
for (..) {
if (pass_this_iter()) continue;
use(x);
}

This optimization seems legal, and would be important for performance because this can reduce the number of operations (LLVM’s `ptrtoint` can drop high bits, so this reduces # of operations). If pointer-to-integer cast raises UB in certain cases however, we cannot show correct of this optimization.

Let’s assume that, in certain user’s input, pass_this_loop() always returns true. Pointer-to-integer cast before this transformation never happens, but after that it happens at least once. This means that a compiler may transform a valid program into a program with UB.

In order to reactivate this optimization, the programmer should somehow ‘guarantee’ that pass_this_iter() returns true at least once. (something like #pragma will_return_true ?) Writing such guarantees for all existing function calls would be a daunting job, however.

That is why a pointer-to-integer cast in our memory model always returns a valid integer. Constraining reordering of pointer-to-integer casts (and integer-to-pointer casts as well) will unavoidably block optimizability, which is undesirable.

In our experience, if correctness of an optimization cannot be formally verified, there always exists a corner case where miscompilation happens. That’s how previous work (https://blog.regehr.org/archives/1496 ) found compiler bugs as well.

“`
Consider “int arr[2][1],*p=arr[0]+1,*q=arr[1], *p1 = (int*)(uintptr_t)p, *q1 = (int*)(uintptr_t)q;”. The addresses held by “p” and “q” are duals of each other, since they compare equal, …
“`
In this case – int arr[2][1] is actually allocating a single object, not 2 objects. arr[0] and arr[1] are pointing to a same object, so they are not dual, they are equal. With provenance info, the value of arr[1] and arr[0] will be both something like (l, 0x4422), where l is the provenance.

“`
In your “A Apendix”, for example, Iâ€™m not sure why you have two “if (n)” blocks, …
“`
The execution of first if(n) helps guarantee `pi == yi`.
Anon expert (anonymous and not an expert) says:

September 23, 2018 at 3:33 pm

The reason for the incorrect output is that the arguments to printf were precomputed and “optimized” (to avoid pipeline hazards I suppose?). In the second code snipped, the compiler couldn’t figure out that (int *)(yi – 96) actually points to xi, and thus decided to forward *xi from the first two lines to the printf() argument. However, the proposed solution to this problem, a new memory model, seems a bit of an overkill! Would have been nice to see some motivation in the paper as to why a new memory model is the best solution to this problem.

Also, has the implementation of this model (part of the OOPSLA paper) been tested on systems with different endianness? How will this memory model affect bit-addressable systems, Harvard architectures, some older non-ARM non-x86 embedded platforms like PIC? These are all areas where C is heavily used, and from my limited experience, is one of the reasons for having a lot of UB in the standards.
regehr says:

September 23, 2018 at 7:35 pm

Hi Anon, the “memory model” is just shorthand for “the set of rules that tells us which optimizations are allowed and which ones are not allowed.” So this doesn’t have to be a heavyweight thing, and I don’t agree that changing the memory model is overkill.

My guess is that our proposal is orthogonal to endianness and the other concerns you mention, but I’m not completely sure about this.
John Payson says:

September 24, 2018 at 11:16 am

Greetings Juneyoung Lee.

How often, and for what purposes, are integer-to-pointer casts used within performance-critical code? Treating integer-to-pointer casts as calls to functions that return a pointer via unknown means would be an easy way of yielding guaranteed-correct semantics, albeit at some performance cost. Unless that performance cost would actually pose a problem, there’s no need to eliminate it. If the cost would be unacceptable in some situations where a programmer would be unable to move the casts outside the loop (e.g. because they occur within a template) adding a new syntax could allow such situations to be handled more efficiently than complicating the semantics of integer-to-pointer conversions. In any case, worrying about the cost without identifying if and when it actually matters smacks of premature optimization.

Further, I think it’s important to recognize many actions that create pointers have two parts: the determination of the physical address, and the logical creation of the reference contained in the pointer. Given a sequence: “int *p1 = (int*)(FOO+i*16); int *p2 = (int*)(FOO+j*16); *p1 = 1; *p2 = 2; if (someFlag) *p1 = 1;” it may be reasonable for a compiler to treat operations on *p1 and *p2 as independent because p2 is created within p1’s lifetime, but p1 is used between the creation and use of p2. This in turn would allow a compiler to eliminate one of the writes to p1 along with the conditional test on someFlag. If, however, the code had been “{int *p1 = FOO+i*16; *p1 = 1;} {int *p2 = (int*)(FOO+j*16); *p2 = 2;} if (someFlag) {int *p3 = FOO+i*16; *p3 = 3;}”, such behavior would not be reasonable. Here, the lifetimes of the pointers are disjoint, meaning they won’t alias even if they happen to identify the same storage (at different times). What’s would be needed to efficiently process the above would for the assignment to “p3” to be accomplished via statement meaning, essentially, “regard some reference as having been derived from some unknown object in a fashion that happens to yields a pointer to the same *address* as some earlier operation, but should not be presumed to identify the same *object*.”

Personally, I think the right approach for C++ would be to add an “access window” type which behaves like a pointer, but with the semantics that (1) everything that is sequenced before the creation of the access window will be sequenced before all operations using it, (2) all pointers derived from the access window will become invalid upon its destruction, and (3) everything that is sequenced after the lifetime of the access window will be sequenced after all actions using it or any pointers derived therefrom. That would allow code which needs particular semantics to specify what it needs, without hamstringing the compiler supplying semantics that programs don’t need.

Until such a thing exists, though, programmers can’t be expected to use it.
Juneyoung Lee says:

September 25, 2018 at 1:51 am

Dear Anon,

I believe suggesting a correct memory model is a good step forward, not only in theoretical aspect, but also in practical aspect.
Compiler people would like to write an optimizer that replaces a snippet of code (containing load, store, pointer integer cast, etc) with faster one which is automatically generated. How can we guarantee that the generated one is correct? One possible way is to write a tool that checks, for all inputs, the two code snippets behave identically. However, the checker itself should be correct as well – and having a formal memory model has a role here (the checker should answer correctly w.r.t the memory model). Sometimes checking the correctness of optimizations is quite tricky without formal model. You may refer to this link as well – http://lists.llvm.org/pipermail/llvm-dev/2017-July/115497.html .

Regarding other architectures – we believe that our memory model supports architectures which are supported by LLVM’s original memory model.
About bit addressable architecture – I believe this is not obliged, according to C standard (3.5.2 – it is saying that It need not be possible to express the address of each individual bit of an object), but it would be great if our model can be extended into bit-addressable architecture.
ntysdd says:

September 25, 2018 at 2:14 am

Is gcc really sane here?

Isn’t implementation defined behavior supposed to be documented and consistent?

The behavior here is anything but consistent.

And how can you get gcc to work on anything like integer to pointer conversion at all, if it fails so unpredictably?
Marcus says:

September 25, 2018 at 2:52 am

Dear John,

I had to adapt your source code to be able to compile it with gcc, I attached my version. Unfortunately I was unable to reproduce your issue, I would have loved to see it with my own eyes and learn from it. My CPU is Intel x86_64.

$ cat p.c
#include
#include
#include

int main(void) {
int *x = malloc(sizeof(int*));
*x = 3;
int *y = malloc(sizeof(int*));
*y = 5;
uintptr_t xi = (uintptr_t)x;
uintptr_t yi = (uintptr_t)y;
uintptr_t diff = xi – yi;
printf(“diff = %ld\n”, (long)diff);
int *p = (int *)(yi – 32); // <— CHANGED!
*p = 7;
printf("x = %p, *x = %d\n", x, *x);
printf("y = %p, *y = %d\n", y, *y);
printf("p = %p, *p = %d\n", p, *p);
return 0;
}

$ ./a.out
diff = -32
x = 0x55975f7c0010, *x = 7
y = 0x55975f7c0030, *y = 5
p = 0x55975f7c0010, *p = 7

$ gcc –version
gcc 6.3.0 20170516
JUNEYOUNG LEE says:

September 25, 2018 at 2:56 am

Dear John Payson,

> How often, and for what purposes, are integer-to-pointer casts used within performance-critical code?

There are algorithms which heavily use pointer integer casts (XOR list, hashing pointers, etc) but more frequently they are used for alignment. perlbench of SPEC CPU2017 heavily uses pointerinteger casts for this purpose. Setting alignment is important for vectorized operations as well, as they trap if an unaligned pointer is given.

Allowing pointer integer casts to raise UB will block optimizations which insert such casts as well: For example,

int **p = ..;
*p = x;
ix = *(intptr_t *)p; // this can be optimized to ‘ix = (intptr_t)x’

This store-load pair can be optimized to a cast, but it is incorrect if the inserted cast raises UB.

But more important thing is that allowing casts raise UB is not enough to resolve existing issues. A two different, valid pointers may have same integer representation (ex. ‘int x[4]; int y[4]; &x[4] == &y[0]’), and this poses another problem: how can we know where the integer is from either &x[4] or &y[0]?
If (void *)(intptr_t)(&x[4]) nondeterministically returns &y[0] or &x[4], it violates C 7.20.1.4 and C++ 7.6.1.10.5, saying that `(void*)(intptr_t)p == p` should be true.
If we define the pointer comparison to compare the underlying integer representation only, it cannot explain compiler optimizations which optimizes (p == q) into ‘false’ if they are having different origins – this is important for C++ programs because they’re frequently comparing ‘this’ pointer with some other pointers.
To resolve this issue, there have been a proposal to let integers carry provenance as well (the n2263 link), and this again disables many integer optimizations. (There is a benchmark in LLVM Nightly Tests which slows down x2000 times after integer optimizations are naively disabled)

> adding a new syntax could allow such situations to be handled more efficiently than complicating the semantics of integer-to-pointer conversions.

Yep, I agree that if a programmer can write such signs explicitly, it would bring us to a simpler semantics. 🙂

> Further, I think itâ€™s important to recognize many actions that create pointers have two parts: the determination of the physical address, and the logical creation of the reference contained in the pointer.

I also believe that separating those two is a good idea. Actually in our memory model, there are two pointers – logical pointer and physical pointer. A logical pointer is a pointer with provenance, and a physical pointer without one. In case of a physical pointer, the reference is determined when the pointer is dereferenced, so it’s kind of a lazy evaluation. (the determined reference does not last however)
ntysdd says:

September 25, 2018 at 2:57 am

GCC’s online doc says,
When casting from pointer to integer and back again, the resulting pointer must reference the same object as the original pointer, otherwise the behavior is undefined. That is, one may not use integer arithmetic to avoid the undefined behavior of pointer arithmetic as proscribed in C99 and C11 6.5.6/8.

I am not sure calculating an integer that “happens to be” the casted pointer applies here.

I think it should. Otherwise, what can I do to this casted pointer? Can I copy it? Can I multiply it by one? Can I use tagging bit on it?Can I send it to my mail box?
John Payson says:

September 25, 2018 at 10:34 am

JUNEYOUNG LEE:

I’m not sure why integer-to-pointer casts would be needed for alignment. If one knows that a “uint32_t*” is 32-bit aligned, but wants it 16-byte aligned, converting it to an integer type would reveal whether one needed to index it by 0, 1, 2, or 3 to achieve the required alignment, without having to do anything that would obscure the provenance of the pointer. If the original might not be 32-bit aligned, conversion to “char*” and back to “uint32_t*” would be required, but there would still be no need to convert a uintptr_t to a pointer. On segmented-memory platforms (e.g. large-model 8086), this approach could be faster than doing math on a uintptr_t, since there would be no need to do math on the segment part of the address.

BTW, there only reason to use “uintptr_t” even in these cases is that the Standard doesn’t define the behavior of converting a pointer to any other integer type. If it were to say that converting a pointer to any integer type yields “mostly”-Implementation-Defined value (allowing an implementation-defined subset of bits to take on Unspecified values), then on implementations that define such conversions in typical fashion, conversion to `unsigned` would be sufficient even if `uintptr_t` was a larger type.

While requiring pointers with the same integer representation to compare equal would preclude some attempts to optimize comparisons as “impossible” based on provenance, I’m still not clear why efficiency-minded code *should* be performing integer-to-pointer conversions in the first place. Perhaps it might be useful for a compiler to include an option to improve the performance of code that uses unnecessary integer-to-pointer conversions at the expense of making the implementation unsuitable for many kinds of low-level programming, but in the long run I’d think optimization prospects for the code would be improved more by replacing integer-to-pointer conversions with pointer arithmetic.

I don’t doubt that blocking certain optimizations would impose a many-orders-of-magnitude slowdown in some benchmarks, but fail to see why that should be considered any sort of problem. One can easily construct benchmarks where a guarantee that would normally cost nothing makes the difference between something be evaluated as a constant versus not, and consequently imposes a thousand-fold, million-fold, or even billion-fold speed penalty, but such benchmarks might have no relation whatsoever to real-world scenarios.
John Payson says:

September 25, 2018 at 10:46 am

ntysdd:

You can compare such pointers for equality each other or with other pointers, and you can use them in any other manner an implementation chooses to support. If an implementation accurately describes the purposes for which it is intended to be suitable, and supports pointer usages appropriate for such purposes, those should fulfill the needs of programmers who make reasonable efforts to select implementations appropriate for their tasks.

One of my biggest gripes is with compilers who fail to recognize and make clear what purposes they are and are not intended to be suitable, thus leading programmers to use implementations inappropriately. If compiler writers would acknowledge that enabling certain optimizations will render their implementations unsuitable for various purposes, rather than trying to claim that programs that are incompatible with such optimizations are “broken”, that would solve a lot of problems.
ntysdd says:

September 26, 2018 at 9:28 pm

How much gain can we get from provenance based alias analysis?

If it only boosts benchmarks and not so many real applications, then I am quite against it.

C has become so strange â€” even an integer (well not a mathematical integer) can have a provenance.
ntysdd says:

September 26, 2018 at 9:37 pm

What’s the difference between an integer and a pointer?

Ironically, the answer is nothing.

If an integer can have a provenance, then how is it different from a pointer?
Michael Tempest says:

October 23, 2018 at 6:41 am

Hi

The C standard has a list of undefined behaviours (UB). That is useful, but my impression is that it is not complete. Anything that is not defined in some way in the standard is also … not defined i.e. UB. (Here I consider stuff that is implementation-defined or unspecified to be defined “in some way”). Or should such things rather be considered unspecified? As I understand it, the issue addressed in this blog post relies on behaviour not defined in the C standard. As such, the result does not surprise me.

I have not seen much attention given to behaviour that is not defined by the C standard, yet not listed as UB by the standard itself. I am curious to know why that would be.

A list of additional, known-non-defined behaviours could be useful e.g. for comparing static analysers and educating C programmers.

Michael
John Payson says:

October 23, 2018 at 3:36 pm

Michael Tempest:

The authors of the Standard were not trying to create a new language. Instead, they were trying to describe a language that was already in wide use. An interesting feature of this language was that implementations would often define the behaviors of many actions based upon the target platforms and application field.

Most situations where a program invokes Undefined Behavior fit one of two similar patterns:

1. One part of the Standard defines the behavior of some actions, and another part says that an overlapping category of actions invokes Undefined Behavior.

2. One part of the Standard specifies the behavior of some actions under certain conditions, and says they invoke Undefined Behavior when those conditions don’t apply, but the same behavioral description could be meaningfully applied on the target platform under circumstances beyond those listed.

Many implementations can be configured to very usefully extend the language by treating the situations above by applying the given behavioral definition absent a compelling reason to do otherwise or–as the Standard would describe it, “behaving during translation or program execution in a documented manner characteristic of the environment”. For freestanding implementations, applying at least some such extensions may be the difference between being able to do almost nothing versus being able to do just about anything.

I’m not sure what “education” programmers need about UB, beyond recognizing that some implementations are incapable of usefully processing low-level code unless optimizations are completely disabled. The vast majority of programs for freestanding implementations require abilities the Standard doesn’t define, and the notion that programmers should limit themselves to things the Standard defines is fundamentally contrary to the Spirit of C described in the published Rationale: [among other things] “Donâ€™t prevent the programmer from doing what needs to be done.”