If I choose a size S, can you tell me how many valid C programs exist that are no larger than that size? I’m actually interested in the answer — it’ll help me make a point in a paper I’m writing. Shockingly, the Internet (or at least, the part of it that I looked at based on a few searches) does not provide a good answer.

Let’s start with a few premises:

- Since it would be exceedingly difficult to construct the exact answer, we’re looking for a respectably tight lower bound.
- S is measured in bytes.
- Since it seems obvious that there’s an exponential number of C programs, our answer will be of the form N
^{S}.

But how to compute N? My approach (and I’m not claiming it’s the best one) is to consider only programs of this form:

int main() { int id1; int id2; .... }

The C99 standard guarantees that the first 63 characters of each identifier are significant, so (to drop into Perl regex mode for a second) each identifier is characterized as: [a-zA-Z_][a-zA-Z0-9_]{62}. The number of different identifiers of this form is 53*63^{62}, but we spend 69 bytes in order to generate that many programs, in addition to 15 bytes of overhead to declare main(). The 69^{th} root of the large number is between 43 and 44, so I claim that 43^{(N-15)} is a not-incredibly-bad lower bound on the number of C programs of size S. Eventually we’ll hit various other implementation limits and be forced to declare additional functions, but I suspect this can be done without invalidating the lower bound.

Of course, the programs I’ve enumerated are not very interesting. Perhaps it would have been better to ask:

- How many non-alpha-equivalent C programs are there?

or even:

- How many functionally inequivalent C programs are there?

But these sound hard and don’t serve my narrow purpose of making a very blunt point.

Anyway, I’d be interested to hear people’s thoughts on this, and in particular on the 43^{S} (claimed) lower bound.

I’m tired and it is late, but S to the N for a fixed N is exponential?

It sounds like a polynomial upper bound to me.

Don’t mind me if I’m writing something ridiculous…

Bleh… of course I had flipped these around, thanks Daniel, fixed now. The argument, I hope, stands…

int main() {“somerandomgarbage”;} is a valid C program and the garbage string can be anything without worrrying about limitations on identifiers or wasting space on redundantly saying “int”. So that’s X^(S-15) where X is the number of distinct characters that are valid in a string…minus a bit for invalid escape sequences which we can ignore by just not allowing \. Of course, that gets us into character set issues :/ Assumig 7 bit ASCII we’re at roughly 100 or so charcters that are allowed in a string without escaping.

That said, random strings probably aren’t very interesting to a fuzz tester like Csmith

@James: the (C99) standard only guarantees that an implementation be able to translate one program containing at least 4095 characters in a string literal, which puts a much lower upper bound on programs of this form.

@regehr: and it also guarantees only one program with 511 identifiers with block scope, so a single main() with that many local variables need not be translatable. So you need to start declaring additional functions much sooner. On the plus side, since there is no implementation limit for the number of internal identifiers other than initial characters, you can have that many static functions.

To be precise, what we’re talking about here are not valid C programs but /strictly conforming/ programs, which are not allowed to exceed any minimum implementation limit. Implementations are of course free to accept correct programs that go beyond these limits.

Oops, I fail at reading comprehension. ðŸ™‚ Obviously X^4080 is a bigger number than 43^4080, since X >= 93 (if I read 5.2.1p3 correctly and subtract two characters for \ and “).

I think James Iry is on the right track. To even make all the programs functionally different, do something like hash the contents of the string(s) and return some bits of the hash in the exit status.

unsigned h;void H(char*s){while(*s)h+=*s++; /*any function on s that modifies h*/}int main(int c,char**v){

H(“…”); // respecting 4095 characters in logical source line limit

H(“…”);

return (h >>c)&1;}

Now almost all of the program is inside the string literals, so we get nearly as many valid-programs-of-length-N as James’ system, but furthermore the result of the program depends on the contents of all the string literals as well (since any bit of the hash can be selected as the program’s output). If your N is really big then you can save a little more with some additional macro trickery, like

#define J );H(

and then write

H(“…”

J”…”

J”…”);

which saves 3 characters each time at a price of 15 characters; this is applicable only to programs over length somwhat over 20kB.

By my count the smallest program of this form (with #define J and one call H(“”)) is 108 bytes; after that, every 4095 characters add 4092 characters from a set of X >= 93, making the approximate number of programs in my form 93^((N-108)^(4092/4095)).

A related question is how many different C programs have ever been written. This is probably dominated by student programs.

100 universities teaching 100 students per year with each student writing 10 programs is 100,000 per year. Of course the intent is that all the programs for one question be functionally the same ðŸ˜‰

If we limit ourselves to production code then small programs will still dominate (see fig 107.1 in http://www.knosof.co.uk/cbook/cbook1_2.pdf) and discussion of the various versions of Linux are down at the noise level.

If we have say 100,000 people writing C and they each write 10 small programs a year we have 1 million programs per year (I would expect some of these to be functionality similar).

In the context of the lambda calculus (a somewhat simpler setting), your final questions about alpha-equivalence have been explored. See, for example, Jue Wang’s work [1,2] and “Testing an optimising compiler by generating random lambda terms” by Palka et al [3]

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.95.2624

[2] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.89.4380

[3] http://dl.acm.org/citation.cfm?id=1982615

Thanks folks! I agree that James is on the right track.

A super easy way to render the string-containing programs functionally inequivalent is to print the strings or (if you prefer) stuff them into a volatile variable byte-by-byte.

Derek I have seen as many as 70% of my students submit functionally inequivalent solutions to a homework problem :).

Sam, thanks, the AST ’11 paper looks great, I had totally missed it.

I’d start by sampling random strings of different lengths and attempting to parse them as expressions. It should give you a rough estimate of what fraction of the “random bytes” space is C.

I rather suspect that you won’t get many data points without a LOT of compute time, but some statistical confidence bounds could be extracted even from null results.

Also, I rather suspect that token counts would be a better measure than bytes.

BCS, that’s an interesting approach. I wonder how many samples would be required in order to get a tighter lower bound than Jeff’s? For realistic program sizes I kind of suspect the universe would end first :).

The number of distinct programs of a given length n is not computable. For, suppose it were computable by some Turing machine M(n). Then one could determine the busy-beaver number as follows. Dove-tail the executions of all C programs of length n simultaneously. For n sufficiently large, one of these diverges; the others terminate in finite time. As soon as M(n)-1 executions halt, declare the largest such one to be the busy beaver.

Hi David, even the validity of C programs is not decidable:

main () {

might_halt();

return 1/0;

}

Making this topic particularly thorny are the nasty termination games played by both language standards and real compilers http://blog.regehr.org/archives/161

What range of values of S are you interested in. If the number is small and the identifier names long, then there can only be so many variations. Once S starts getting big and the identifiers shorter the variations would grow rapidly. Ok, that was stating the obvious.

David Harris: the upper bound on the number of C program of length B bytes is the number of byte strings of length B: 2^(8*B)

regehr : I’d be slightly surprised if *any* non trivial programs could be generated. Hello worlds is ~22 tokens long -> ~10^30 options . Thus the limiting to expressions and extrapolating from there.

A bit of a brain dump here: At a guess, the results will look something like:

\prod_{n=1}^{B} F(n)

where:

F(~20) ~= 20

\lim_{n \rightarrow \propto}F(x)~=5

Both 20 and 5 are WAGs but you might be able to get a better approximation for the tail number by computing the average bits generated per token by a Markov chain.

A hint at a lower (rather than upper) bound on the number of reasonable programs of a certain size: comparing the size of linux-3.2.7.tar and httpd-4.2.1.tar to their compressed forms (xz -e9, the best compressor I have at hand), there is ~1.1 bit/character of (compressible) entropy in (this) human-written C code.

If valid is defined as “it compiles” rather than “it terminates”, have you thought about recursively counting the possibilities for each character according to the c grammar?

Wikipedia says this (http://en.wikipedia.org/wiki/Deterministic_finite_automaton#Accept_and_Generate_modes):

The generating mode is similar except that rather than validating an input string its goal is to produce a list of all the strings in the language. Instead of following a single transition out of each state, it follows all of them. In practice this can be accomplished by massive parallelism (having the program branch into two or more processes each time it is faced with a decision) or through recursion. As before, the computation begins at the start state and then proceeds to follow each available transition, keeping track of which branches it took. Every time the automaton finds itself in an accept state it knows that the sequence of branches it took forms a valid string in the language and it adds that string to the list that it is generating. If the language this automaton describes is infinite (ie contains an infinite number or strings, such as “all the binary string with an even number of 0s) then the computation will never halt. Given that regular languages are, in general, infinite, automata in the generating mode tends to be more of a theoretical construct[citation needed].

—–

C is an infinite language, but if you terminate the generator at a finite length of program then the DFA in generate mode will terminate.

Hi Matt, the C grammar cannot be described by a DFA; the grammar is at least LL(k). With semantic checks, the C language is not (nor are most languages) context-free.

I would argue that trying to count the bytes directly is quite difficult, or that there may be a conceptually easier way.

Try to look at how many valid ASTs of C programs there are, and what their size implication (in bytes) is..

Depending on the point you’re trying to make, it might be more compelling to try this experiment: take some small existing program, and see how many different programs could be generated by making individually very simple mutations that retain the properties the compiler cares aboutâ€”like changing a plus to a minus. This is already 2^N in the number of expressions, or something like it; more to the point, these variations are very likely to produce functionally different programs. That is, they matter.

Once you get N past a couple thousand, 1.03^N and 255.99^N both just mean “too many to worry about”; it is therefore much more important that there are, say, 1.03^N functionally different variations of some known useful program of length N, than that there are, say, 255.99^N possible C “programs” (in some totally non-human-relevant sense) of length N.