A Bit Puzzle

I love a good bit-level puzzle. Today’s is one I learned about from a comment in a Google+ discussion:

Given two words a and b, return the high bits that they have in common, with the highest bit where they differ set, and all remaining bits clear. So 10101 and 10011 yields 10100. Then do the same thing, only flipped, so we keep low-order identical bits and mask out the high ones.

The problem comes from trie implementation. Here’s code that does it the slow way:

uint64_t high_common_bits_64_slow (uint64_t a, uint64_t b)
{
  uint64_t mask = 0x8000000000000000LLU;
  uint64_t output = 0;
  int i;
  for (i=63; i>=0; i--) {
    if ((a & mask) == (b & mask)) {
      output |= (a & mask);
    } else {
      output |= mask;
      goto out;
    }
    mask >>= 1;
  }
 out:
  return output;
}

uint64_t low_common_bits_64_slow (uint64_t a, uint64_t b)
{
  uint64_t mask = 1;
  uint64_t output = 0;
  int i;
  for (i=0; i<64; i++) {
    if ((a & mask) == (b & mask)) {
      output |= (a & mask);
    } else {
      output |= mask;
      goto out;
    }
    mask <<= 1;
  }
 out:
  return output;
}

The problem is to do much better than these. To keep things simple, let’s say we’re interested primarily in speed on newish x86-64 implementations. Choose any programming language, or just write assembly. In a few days I’ll post my solutions in addition to any good ones that show up here before then. Beware WordPress’s mangling of source code in the comments — feel free to just mail me.

December 19, 2011

regehr

Computer Science

34 responses to “A Bit Puzzle”

regehr says:

December 19, 2011 at 9:53 am

I’ll just add that I was bummed not to be able to create branch-free x64 code for these. I’m sure it’s possible.
Lars Bergstrom says:

December 19, 2011 at 10:10 am

You would love the first lab assignment from the _Computer Systems: A Programmer’s Perspective_ book. Having TA’d the course several years now, the bit-twiddling assignments almost always result in some very interesting solutions from the students:

http://csapp.cs.cmu.edu/public/labs.html
Andrew Hunter says:

December 19, 2011 at 10:11 am

Link to the original? 🙂 I like finding people to follow who ask questions like this.

We’ll see how WordPress takes this, but here’s a solution (I think, I haven’t checked all corner cases.);
uint64_t high_common(uint64_t a, uint64_t b) {
uint64_t x = (a ^ b);
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x |= x >> 32;
return (a & ~x) | ((x >> 1) + 1);
}

Fairly long dependency chain, but all simple ops and entirely branch free. I think there might be a faster way to implement the middle section (“set all bits under any currently set bits….”) but I’d have to check the assembly spec.

Also, is this useful for something? Seems like a fairly odd spec to just cut from whole cloth.
regehr says:

December 19, 2011 at 10:11 am

Hi Lars, I was an instructor for that course 3 or 4 times and always loved that lab. But too many of the problems have solutions that are easily available on the Internet.
regehr says:

December 19, 2011 at 10:18 am

Is there a way to link to a G+ comment? I don’t see a way to do that. But anyway, it was posted by Jan-Willem Maessen.
Andrew Hunter says:

December 19, 2011 at 10:18 am

Lars, yeah, I remember liking that lab in undergrad (HMC).
regehr says:

December 19, 2011 at 10:20 am

Hi Andrew, your version fails on my tester 🙂
Mike G. says:

December 19, 2011 at 10:34 am

Offhand, what I’m thinking of is decomposing this into known bit-twiddling problems:
unsigned diffIndex=findHighestBitSet(a^b);
// upperMask is a precomputed array of the masks
// needed to mask off the top bits. There’s probably a
// more efficient way to do it – too lazy to dig up ]
// “Hacker’s Delight”. The ‘1<<diffIndex'
// shift could also be replaced with a table lookup if
// more efficient.
return (upperMask[diffIndex]&a)|(1<<diffIndex);

That's without any testing, but I think that would work, and it should certainly be faster than the naive version.
Pascal Cuoq says:

December 19, 2011 at 10:37 am

@Andrew

> is this useful for something?

It looks like exactly what a particular big-endian (resp. little-endian) implementation of Patricia trees might need.

One day, I will have the time to measure which is faster between the two sequences below.

x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
…

or

y = x >> 1;
z = x >> 2;
x |= y;
x |= z;
y = x >> 3;
z = x >> 6;
x |= y;
x |= z;
y = x >> 9;
z = x >> 18;
x |= y;
x |= z;
…

The latter is longer but has shorter dependency chains.
Tony Garnock-Jones says:

December 19, 2011 at 10:43 am

Hi John,

If you replace Andrew’s last line with “return (a & ~x) | (x & ~(x >> 1));”, does it pass?
Carlos says:

December 19, 2011 at 10:44 am

Wow, that’s good serendipity; I’ve been thinking of this exact same operation for the last week or so. I use this exact operation in a spatial data structure built around z-order space-filling curves.

(To answer Andrew’s question): If you think of a binary number as a sequence of binary tree decisions on the full range of possible numbers, this bit operation gives you the most specific common ancestor on this tree.

With z-order curves you get a cheap way to compute fairly tight bounding boxes between points in n dimensions.
Eddie Kohler says:

December 19, 2011 at 10:46 am

uint64_t high(uint64_t a, uint64_t b) {
uint64_t db = a ^ b, hdb;
db |= (db >> 1);
db |= (db >> 2);
db |= (db >> 4);
db |= (db >> 8);
db |= (db >> 16);
db |= (db >> 32);
return ((a & b) | (db >> 1)) + 1;
}

uint64_t low(uint64_t a, uint64_t b) {
uint64_t x = (a ^ b) & -(a ^ b);
return (a & (x – 1)) | x;
}
regehr says:

December 19, 2011 at 11:01 am

Eddie, your low code looks good but the high one fails in (I think) the same way as Andrew’s.
regehr says:

December 19, 2011 at 11:03 am

Tony– it works!
Mans says:

December 19, 2011 at 11:07 am

If your CPU has a fast count-leading-zeros instruction:

high_common(uint64_t a, uint64_t b)
{
uint64_t x = a ^ b;
int n = 63 – clz(x);
return (a >> n | 1) << n;
}

Fixing this for a == b is left as an exercise.
Benjamin Kramer says:

December 19, 2011 at 11:08 am

A version with gcc builtins.

uint64_t high(uint64_t a, uint64_t b) {
uint64_t bit = 1ULL << 63-__builtin_clzll(a^b);
return a == b ? a : (a & ~(bit-1)) | bit;
}

uint64_t low(uint64_t a, uint64_t b) {
uint64_t bit = 1ULL << __builtin_ctzll(a^b);
return a == b ? a : (a & bit-1) | bit;
}

Sadly this is not branchless on x86 because bsf/bsr are undefined for 0. The upcoming BMI instructions lzcnt and tzcnt solve this ussue
Mans says:

December 19, 2011 at 11:11 am

BTW, with GCC __builtin_clz() provides the count-leading-zeros instruction if available.
Ryan Fox says:

December 19, 2011 at 11:23 am

Darn, I thought I was going to be the only one clever enough to remember about clz.

Anyway, here’s my solution. I think I’m depending on undefined behaviour with shifting UINT32_MAX, so that probably counts as cheating. 😛

uint32_t puzzle(uint32_t a, uint32_t b)
{
uint32_t and = a&b;
uint32_t xor = a^b;

uint32_t leading_zeros = __builtin_clz(xor);

uint32_t and_mask = UINT32_MAX << (8*sizeof(and_mask) – leading_zeros);

uint32_t result = and & and_mask | 1 << (8*sizeof(result) – leading_zeros – 1);

return result;
}
Eddie Kohler says:

December 19, 2011 at 11:23 am

Is the spec defined when a == b? I.e. high(a,b) == low(a,b) == a == b?
Mans says:

December 19, 2011 at 11:30 am

Ryan, any unsigned value x may be safely shifted by n in either direction provided 0 <= n < CHAR_BIT * sizeof(x).
Mans says:

December 19, 2011 at 11:32 am

Another variant for high_common() would be Eddie’s low() combined with a bit-reversal instruction if the CPU has one. ARM does, don’t know about x86. I don’t know of any CPU that has bit-reverse but not CLZ though, so this is probably not very useful.
regehr says:

December 19, 2011 at 11:49 am

Eddie, that’s what I was assuming.

Carlos, is it correct to just return the argument when a==b? It probably doesn’t matter in practice since you’re not going to call the function unless you have two different tree nodes.
Andrew Hunter says:

December 19, 2011 at 12:02 pm

Regehr,

Yeah, I noticed the all-same behavior while walking to work. Serves me right for not doing better testing. (How are you testing these, by the way? Hand-chosen interesting test cases, or one of the tools you’ve mentioned for generating failures? I ran mine through a fuzzer, but that’s not likely to generate any number of “interesting” cases.)
regehr says:

December 19, 2011 at 12:07 pm

Andrew, my tests are made by:

– pick a random 64-bit int and call it a
– initialize b to a, then flip 0..63 random bits

I’m not sure if this is good, but clearly just making a and b separately random is wrong.
regehr says:

December 19, 2011 at 12:08 pm

Benjamin, these are basically my solutions. Your branch is the same one I failed to eliminate.
regehr says:

December 19, 2011 at 12:10 pm

Carlos, hope we can help make your code faster :).
Andrew Hunter says:

December 19, 2011 at 12:13 pm

Regehr, yeah, that looks better than my method (pick a,b iid uniform.) I’d be curious what would happen if you threw a SAT solver at these functions (searching for a setting of bits where they differ). I’ve heard of good results from (similar) techniques from, for one, the code sketching people at MIT, but I don’t have their infrastructure handy, so…
regehr says:

December 19, 2011 at 12:16 pm

Andrew, bit puzzles like this pretty much scream for being attacked by SAT or SMT.

About a dozen times I’ve started on this kind of project before realizing (1) it’s really hard to do a good job, and maybe impossible if there are loops and (2) lots of smart people already work on this kind of thing.
MSN says:

December 19, 2011 at 3:00 pm

For the lower bit function, you don’t need to bother with the tree:

uint64_t d= a ^ b;
uint64_t low_d= (d & -d);
uint64_t low_mask_out_lower= low_d | (~-low_d);
uint64_t answer_low= ((a & b) | low_d) & low_mask_out_lower;
qznc says:

December 20, 2011 at 3:06 am

An (untestet) asm idea based on the bsf instruction without branches/loops:

Assume a in eax and b in ebx.
andl %eax,%ebx # put a&b in eax
bsfl %ecx,%eax # index of least significant bit in ecx
shr %eax, %cl # shift a&b right by index (cl == lower 8bit of ecx)
shl %eax, %cl # shift left again, so lower bits now cleared
Result in eax now

I’m not sure, if the bsf instruction is more efficient than a table-based findHighestBitSet, though.
qznc says:

December 20, 2011 at 3:27 am

Correction, because i misunderstood the puzzle:

Assume a in eax and b in ebx.
xorl %eax,%ebx # put a^b in eax
bsfl %ecx,%eax # index of least significant bit in ecx
dec %ecx # index–
shr %eax, %cl # shift a^b right by index (cl == lower 8bit of ecx)
and %eax, 1 # set differing bit
shl %eax, %cl # shift left again, so lower bits now cleared
Result in eax now
Eddie Kohler says:

December 20, 2011 at 6:00 am

Here are two branchless versions that I think work even for a==b. high2() assumes that __builtin_clzl(0) returns some arbitrary integer value (but does not trap or otherwise explode).

uint64_t high(uint64_t a, uint64_t b) {
uint64_t db = a ^ b;
db |= (db >> 1);
db |= (db >> 2);
db |= (db >> 4);
db |= (db >> 8);
db |= (db >> 16);
db |= (db >> 32);
return ((a & b) | (db >> 1)) + (db > 0);
}

uint64_t high2(uint64_t a, uint64_t b) {
uint64_t x = a ^ b, nz = x > 0;
uint64_t db = (uint64_t) nz << (63 – (__builtin_clzl(x) & 63));
return (a & ~(db – nz)) | db;
}
Mans says:

December 20, 2011 at 7:35 am

qznc, bsf is fast on Intel Pentium-2 and later (except Atom), and AMD K10 and later. On Intel Pentium, Intel Atom, and AMD <= K8 it is very slow, perhaps microcoded.
Kristina says:

December 20, 2011 at 3:18 pm

A bit off-topic, but has anyone found a way to post code (fixed-width font) on Google+?