I love a good bit-level puzzle. Today’s is one I learned about from a comment in a Google+ discussion:

Given two words a and b, return the high bits that they have in common, with the highest bit where they differ set, and all remaining bits clear. So 10101 and 10011 yields 10100. Then do the same thing, only flipped, so we keep low-order identical bits and mask out the high ones.

The problem comes from trie implementation. Here’s code that does it the slow way:

uint64_t high_common_bits_64_slow (uint64_t a, uint64_t b) { uint64_t mask = 0x8000000000000000LLU; uint64_t output = 0; int i; for (i=63; i>=0; i--) { if ((a & mask) == (b & mask)) { output |= (a & mask); } else { output |= mask; goto out; } mask >>= 1; } out: return output; } uint64_t low_common_bits_64_slow (uint64_t a, uint64_t b) { uint64_t mask = 1; uint64_t output = 0; int i; for (i=0; i<64; i++) { if ((a & mask) == (b & mask)) { output |= (a & mask); } else { output |= mask; goto out; } mask <<= 1; } out: return output; }

The problem is to do much better than these. To keep things simple, let’s say we’re interested primarily in speed on newish x86-64 implementations. Choose any programming language, or just write assembly. In a few days I’ll post my solutions in addition to any good ones that show up here before then. Beware WordPress’s mangling of source code in the comments — feel free to just mail me.

I’ll just add that I was bummed not to be able to create branch-free x64 code for these. I’m sure it’s possible.

You would love the first lab assignment from the _Computer Systems: A Programmer’s Perspective_ book. Having TA’d the course several years now, the bit-twiddling assignments almost always result in some very interesting solutions from the students:

http://csapp.cs.cmu.edu/public/labs.html

Link to the original? ðŸ™‚ I like finding people to follow who ask questions like this.

We’ll see how WordPress takes this, but here’s a solution (I think, I haven’t checked all corner cases.);

uint64_t high_common(uint64_t a, uint64_t b) {

uint64_t x = (a ^ b);

x |= x >> 1;

x |= x >> 2;

x |= x >> 4;

x |= x >> 8;

x |= x >> 16;

x |= x >> 32;

return (a & ~x) | ((x >> 1) + 1);

}

Fairly long dependency chain, but all simple ops and entirely branch free. I think there might be a faster way to implement the middle section (“set all bits under any currently set bits….”) but I’d have to check the assembly spec.

Also, is this useful for something? Seems like a fairly odd spec to just cut from whole cloth.

Hi Lars, I was an instructor for that course 3 or 4 times and always loved that lab. But too many of the problems have solutions that are easily available on the Internet.

Is there a way to link to a G+ comment? I don’t see a way to do that. But anyway, it was posted by Jan-Willem Maessen.

Lars, yeah, I remember liking that lab in undergrad (HMC).

Hi Andrew, your version fails on my tester ðŸ™‚

Offhand, what I’m thinking of is decomposing this into known bit-twiddling problems:

unsigned diffIndex=findHighestBitSet(a^b);

// upperMask is a precomputed array of the masks

// needed to mask off the top bits. There’s probably a

// more efficient way to do it – too lazy to dig up ]

// “Hacker’s Delight”. The ‘1<<diffIndex'

// shift could also be replaced with a table lookup if

// more efficient.

return (upperMask[diffIndex]&a)|(1<<diffIndex);

That's without any testing, but I think that would work, and it should certainly be faster than the naive version.

@Andrew

> is this useful for something?

It looks like exactly what a particular big-endian (resp. little-endian) implementation of Patricia trees might need.

One day, I will have the time to measure which is faster between the two sequences below.

x |= x >> 1;

x |= x >> 2;

x |= x >> 4;

…

or

y = x >> 1;

z = x >> 2;

x |= y;

x |= z;

y = x >> 3;

z = x >> 6;

x |= y;

x |= z;

y = x >> 9;

z = x >> 18;

x |= y;

x |= z;

…

The latter is longer but has shorter dependency chains.

Hi John,

If you replace Andrew’s last line with “return (a & ~x) | (x & ~(x >> 1));”, does it pass?

Wow, that’s good serendipity; I’ve been thinking of this exact same operation for the last week or so. I use this exact operation in a spatial data structure built around z-order space-filling curves.

(To answer Andrew’s question): If you think of a binary number as a sequence of binary tree decisions on the full range of possible numbers, this bit operation gives you the most specific common ancestor on this tree.

With z-order curves you get a cheap way to compute fairly tight bounding boxes between points in n dimensions.

uint64_t high(uint64_t a, uint64_t b) {

uint64_t db = a ^ b, hdb;

db |= (db >> 1);

db |= (db >> 2);

db |= (db >> 4);

db |= (db >> 8);

db |= (db >> 16);

db |= (db >> 32);

return ((a & b) | (db >> 1)) + 1;

}

uint64_t low(uint64_t a, uint64_t b) {

uint64_t x = (a ^ b) & -(a ^ b);

return (a & (x – 1)) | x;

}

Eddie, your low code looks good but the high one fails in (I think) the same way as Andrew’s.

Tony– it works!

If your CPU has a fast count-leading-zeros instruction:

high_common(uint64_t a, uint64_t b)

{

uint64_t x = a ^ b;

int n = 63 – clz(x);

return (a >> n | 1) << n;

}

Fixing this for a == b is left as an exercise.

A version with gcc builtins.

uint64_t high(uint64_t a, uint64_t b) {

uint64_t bit = 1ULL << 63-__builtin_clzll(a^b);

return a == b ? a : (a & ~(bit-1)) | bit;

}

uint64_t low(uint64_t a, uint64_t b) {

uint64_t bit = 1ULL << __builtin_ctzll(a^b);

return a == b ? a : (a & bit-1) | bit;

}

Sadly this is not branchless on x86 because bsf/bsr are undefined for 0. The upcoming BMI instructions lzcnt and tzcnt solve this ussue

BTW, with GCC __builtin_clz() provides the count-leading-zeros instruction if available.

Darn, I thought I was going to be the only one clever enough to remember about clz.

Anyway, here’s my solution. I think I’m depending on undefined behaviour with shifting UINT32_MAX, so that probably counts as cheating. ðŸ˜›

uint32_t puzzle(uint32_t a, uint32_t b)

{

uint32_t and = a&b;

uint32_t xor = a^b;

uint32_t leading_zeros = __builtin_clz(xor);

uint32_t and_mask = UINT32_MAX << (8*sizeof(and_mask) – leading_zeros);

uint32_t result = and & and_mask | 1 << (8*sizeof(result) – leading_zeros – 1);

return result;

}

Is the spec defined when a == b? I.e. high(a,b) == low(a,b) == a == b?

Ryan, any unsigned value x may be safely shifted by n in either direction provided 0 <= n < CHAR_BIT * sizeof(x).

Another variant for high_common() would be Eddie’s low() combined with a bit-reversal instruction if the CPU has one. ARM does, don’t know about x86. I don’t know of any CPU that has bit-reverse but not CLZ though, so this is probably not very useful.

Eddie, that’s what I was assuming.

Carlos, is it correct to just return the argument when a==b? It probably doesn’t matter in practice since you’re not going to call the function unless you have two different tree nodes.

Regehr,

Yeah, I noticed the all-same behavior while walking to work. Serves me right for not doing better testing. (How are you testing these, by the way? Hand-chosen interesting test cases, or one of the tools you’ve mentioned for generating failures? I ran mine through a fuzzer, but that’s not likely to generate any number of “interesting” cases.)

Andrew, my tests are made by:

– pick a random 64-bit int and call it a

– initialize b to a, then flip 0..63 random bits

I’m not sure if this is good, but clearly just making a and b separately random is wrong.

Benjamin, these are basically my solutions. Your branch is the same one I failed to eliminate.

Carlos, hope we can help make your code faster :).

Regehr, yeah, that looks better than my method (pick a,b iid uniform.) I’d be curious what would happen if you threw a SAT solver at these functions (searching for a setting of bits where they differ). I’ve heard of good results from (similar) techniques from, for one, the code sketching people at MIT, but I don’t have their infrastructure handy, so…

Andrew, bit puzzles like this pretty much scream for being attacked by SAT or SMT.

About a dozen times I’ve started on this kind of project before realizing (1) it’s really hard to do a good job, and maybe impossible if there are loops and (2) lots of smart people already work on this kind of thing.

For the lower bit function, you don’t need to bother with the tree:

uint64_t d= a ^ b;

uint64_t low_d= (d & -d);

uint64_t low_mask_out_lower= low_d | (~-low_d);

uint64_t answer_low= ((a & b) | low_d) & low_mask_out_lower;

An (untestet) asm idea based on the bsf instruction without branches/loops:

Assume a in eax and b in ebx.

andl %eax,%ebx # put a&b in eax

bsfl %ecx,%eax # index of least significant bit in ecx

shr %eax, %cl # shift a&b right by index (cl == lower 8bit of ecx)

shl %eax, %cl # shift left again, so lower bits now cleared

Result in eax now

I’m not sure, if the bsf instruction is more efficient than a table-based findHighestBitSet, though.

Correction, because i misunderstood the puzzle:

Assume a in eax and b in ebx.

xorl %eax,%ebx # put a^b in eax

bsfl %ecx,%eax # index of least significant bit in ecx

dec %ecx # index–

shr %eax, %cl # shift a^b right by index (cl == lower 8bit of ecx)

and %eax, 1 # set differing bit

shl %eax, %cl # shift left again, so lower bits now cleared

Result in eax now

Here are two branchless versions that I think work even for a==b. high2() assumes that __builtin_clzl(0) returns some arbitrary integer value (but does not trap or otherwise explode).

uint64_t high(uint64_t a, uint64_t b) {

uint64_t db = a ^ b;

db |= (db >> 1);

db |= (db >> 2);

db |= (db >> 4);

db |= (db >> 8);

db |= (db >> 16);

db |= (db >> 32);

return ((a & b) | (db >> 1)) + (db > 0);

}

uint64_t high2(uint64_t a, uint64_t b) {

uint64_t x = a ^ b, nz = x > 0;

uint64_t db = (uint64_t) nz << (63 – (__builtin_clzl(x) & 63));

return (a & ~(db – nz)) | db;

}

qznc, bsf is fast on Intel Pentium-2 and later (except Atom), and AMD K10 and later. On Intel Pentium, Intel Atom, and AMD <= K8 it is very slow, perhaps microcoded.

A bit off-topic, but has anyone found a way to post code (fixed-width font) on Google+?