## A Bit Puzzle

I love a good bit-level puzzle. Today’s is one I learned about from a comment in a Google+ discussion:

Given two words a and b, return the high bits that they have in common, with the highest bit where they differ set, and all remaining bits clear. So 10101 and 10011 yields 10100. Then do the same thing, only flipped, so we keep low-order identical bits and mask out the high ones.

The problem comes from trie implementation. Here’s code that does it the slow way:

```uint64_t high_common_bits_64_slow (uint64_t a, uint64_t b)
{
uint64_t output = 0;
int i;
for (i=63; i>=0; i--) {
} else {
goto out;
}
}
out:
return output;
}

uint64_t low_common_bits_64_slow (uint64_t a, uint64_t b)
{
uint64_t output = 0;
int i;
for (i=0; i<64; i++) {
} else {
goto out;
}
}
out:
return output;
}

```

The problem is to do much better than these. To keep things simple, let’s say we’re interested primarily in speed on newish x86-64 implementations. Choose any programming language, or just write assembly. In a few days I’ll post my solutions in addition to any good ones that show up here before then. Beware WordPress’s mangling of source code in the comments — feel free to just mail me.

1. | December 19, 2011 at 9:53 am | Permalink

I’ll just add that I was bummed not to be able to create branch-free x64 code for these. I’m sure it’s possible.

2. | December 19, 2011 at 10:10 am | Permalink

You would love the first lab assignment from the _Computer Systems: A Programmer’s Perspective_ book. Having TA’d the course several years now, the bit-twiddling assignments almost always result in some very interesting solutions from the students:

http://csapp.cs.cmu.edu/public/labs.html

3. Andrew Hunter | December 19, 2011 at 10:11 am | Permalink

We’ll see how WordPress takes this, but here’s a solution (I think, I haven’t checked all corner cases.);
uint64_t high_common(uint64_t a, uint64_t b) {
uint64_t x = (a ^ b);
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x |= x >> 32;
return (a & ~x) | ((x >> 1) + 1);
}

Fairly long dependency chain, but all simple ops and entirely branch free. I think there might be a faster way to implement the middle section (“set all bits under any currently set bits….”) but I’d have to check the assembly spec.

Also, is this useful for something? Seems like a fairly odd spec to just cut from whole cloth.

4. | December 19, 2011 at 10:11 am | Permalink

Hi Lars, I was an instructor for that course 3 or 4 times and always loved that lab. But too many of the problems have solutions that are easily available on the Internet.

5. | December 19, 2011 at 10:18 am | Permalink

Is there a way to link to a G+ comment? I don’t see a way to do that. But anyway, it was posted by Jan-Willem Maessen.

6. Andrew Hunter | December 19, 2011 at 10:18 am | Permalink

Lars, yeah, I remember liking that lab in undergrad (HMC).

7. | December 19, 2011 at 10:20 am | Permalink

Hi Andrew, your version fails on my tester

8. Mike G. | December 19, 2011 at 10:34 am | Permalink

Offhand, what I’m thinking of is decomposing this into known bit-twiddling problems:
unsigned diffIndex=findHighestBitSet(a^b);
// needed to mask off the top bits. There’s probably a
// more efficient way to do it – too lazy to dig up ]
// “Hacker’s Delight”. The ’1<<diffIndex'
// shift could also be replaced with a table lookup if
// more efficient.

That's without any testing, but I think that would work, and it should certainly be faster than the naive version.

9. Pascal Cuoq | December 19, 2011 at 10:37 am | Permalink

@Andrew

> is this useful for something?

It looks like exactly what a particular big-endian (resp. little-endian) implementation of Patricia trees might need.

One day, I will have the time to measure which is faster between the two sequences below.

x |= x >> 1;
x |= x >> 2;
x |= x >> 4;

or

y = x >> 1;
z = x >> 2;
x |= y;
x |= z;
y = x >> 3;
z = x >> 6;
x |= y;
x |= z;
y = x >> 9;
z = x >> 18;
x |= y;
x |= z;

The latter is longer but has shorter dependency chains.

10. | December 19, 2011 at 10:43 am | Permalink

Hi John,

If you replace Andrew’s last line with “return (a & ~x) | (x & ~(x >> 1));”, does it pass?

11. | December 19, 2011 at 10:44 am | Permalink

Wow, that’s good serendipity; I’ve been thinking of this exact same operation for the last week or so. I use this exact operation in a spatial data structure built around z-order space-filling curves.

(To answer Andrew’s question): If you think of a binary number as a sequence of binary tree decisions on the full range of possible numbers, this bit operation gives you the most specific common ancestor on this tree.

With z-order curves you get a cheap way to compute fairly tight bounding boxes between points in n dimensions.

12. | December 19, 2011 at 10:46 am | Permalink

uint64_t high(uint64_t a, uint64_t b) {
uint64_t db = a ^ b, hdb;
db |= (db >> 1);
db |= (db >> 2);
db |= (db >> 4);
db |= (db >> 8);
db |= (db >> 16);
db |= (db >> 32);
return ((a & b) | (db >> 1)) + 1;
}

uint64_t low(uint64_t a, uint64_t b) {
uint64_t x = (a ^ b) & -(a ^ b);
return (a & (x – 1)) | x;
}

13. | December 19, 2011 at 11:01 am | Permalink

Eddie, your low code looks good but the high one fails in (I think) the same way as Andrew’s.

14. | December 19, 2011 at 11:03 am | Permalink

Tony– it works!

15. | December 19, 2011 at 11:07 am | Permalink

high_common(uint64_t a, uint64_t b)
{
uint64_t x = a ^ b;
int n = 63 – clz(x);
return (a >> n | 1) << n;
}

Fixing this for a == b is left as an exercise.

16. Benjamin Kramer | December 19, 2011 at 11:08 am | Permalink

A version with gcc builtins.

uint64_t high(uint64_t a, uint64_t b) {
uint64_t bit = 1ULL << 63-__builtin_clzll(a^b);
return a == b ? a : (a & ~(bit-1)) | bit;
}

uint64_t low(uint64_t a, uint64_t b) {
uint64_t bit = 1ULL << __builtin_ctzll(a^b);
return a == b ? a : (a & bit-1) | bit;
}

Sadly this is not branchless on x86 because bsf/bsr are undefined for 0. The upcoming BMI instructions lzcnt and tzcnt solve this ussue

17. | December 19, 2011 at 11:11 am | Permalink

BTW, with GCC __builtin_clz() provides the count-leading-zeros instruction if available.

18. Ryan Fox | December 19, 2011 at 11:23 am | Permalink

Darn, I thought I was going to be the only one clever enough to remember about clz.

Anyway, here’s my solution. I think I’m depending on undefined behaviour with shifting UINT32_MAX, so that probably counts as cheating.

uint32_t puzzle(uint32_t a, uint32_t b)
{
uint32_t and = a&b;
uint32_t xor = a^b;

uint32_t result = and & and_mask | 1 << (8*sizeof(result) – leading_zeros – 1);

return result;
}

19. | December 19, 2011 at 11:23 am | Permalink

Is the spec defined when a == b? I.e. high(a,b) == low(a,b) == a == b?

20. | December 19, 2011 at 11:30 am | Permalink

Ryan, any unsigned value x may be safely shifted by n in either direction provided 0 <= n < CHAR_BIT * sizeof(x).

21. | December 19, 2011 at 11:32 am | Permalink

Another variant for high_common() would be Eddie’s low() combined with a bit-reversal instruction if the CPU has one. ARM does, don’t know about x86. I don’t know of any CPU that has bit-reverse but not CLZ though, so this is probably not very useful.

22. | December 19, 2011 at 11:49 am | Permalink

Eddie, that’s what I was assuming.

Carlos, is it correct to just return the argument when a==b? It probably doesn’t matter in practice since you’re not going to call the function unless you have two different tree nodes.

23. Andrew Hunter | December 19, 2011 at 12:02 pm | Permalink

Regehr,

Yeah, I noticed the all-same behavior while walking to work. Serves me right for not doing better testing. (How are you testing these, by the way? Hand-chosen interesting test cases, or one of the tools you’ve mentioned for generating failures? I ran mine through a fuzzer, but that’s not likely to generate any number of “interesting” cases.)

24. | December 19, 2011 at 12:07 pm | Permalink

Andrew, my tests are made by:

- pick a random 64-bit int and call it a
- initialize b to a, then flip 0..63 random bits

I’m not sure if this is good, but clearly just making a and b separately random is wrong.

25. | December 19, 2011 at 12:08 pm | Permalink

Benjamin, these are basically my solutions. Your branch is the same one I failed to eliminate.

26. | December 19, 2011 at 12:10 pm | Permalink

Carlos, hope we can help make your code faster .

27. Andrew Hunter | December 19, 2011 at 12:13 pm | Permalink

Regehr, yeah, that looks better than my method (pick a,b iid uniform.) I’d be curious what would happen if you threw a SAT solver at these functions (searching for a setting of bits where they differ). I’ve heard of good results from (similar) techniques from, for one, the code sketching people at MIT, but I don’t have their infrastructure handy, so…

28. | December 19, 2011 at 12:16 pm | Permalink

Andrew, bit puzzles like this pretty much scream for being attacked by SAT or SMT.

About a dozen times I’ve started on this kind of project before realizing (1) it’s really hard to do a good job, and maybe impossible if there are loops and (2) lots of smart people already work on this kind of thing.

29. MSN | December 19, 2011 at 3:00 pm | Permalink

For the lower bit function, you don’t need to bother with the tree:

uint64_t d= a ^ b;
uint64_t low_d= (d & -d);

30. | December 20, 2011 at 3:06 am | Permalink

An (untestet) asm idea based on the bsf instruction without branches/loops:

Assume a in eax and b in ebx.
andl %eax,%ebx # put a&b in eax
bsfl %ecx,%eax # index of least significant bit in ecx
shr %eax, %cl # shift a&b right by index (cl == lower 8bit of ecx)
shl %eax, %cl # shift left again, so lower bits now cleared
Result in eax now

I’m not sure, if the bsf instruction is more efficient than a table-based findHighestBitSet, though.

31. | December 20, 2011 at 3:27 am | Permalink

Correction, because i misunderstood the puzzle:

Assume a in eax and b in ebx.
xorl %eax,%ebx # put a^b in eax
bsfl %ecx,%eax # index of least significant bit in ecx
dec %ecx # index–
shr %eax, %cl # shift a^b right by index (cl == lower 8bit of ecx)
and %eax, 1 # set differing bit
shl %eax, %cl # shift left again, so lower bits now cleared
Result in eax now

32. | December 20, 2011 at 6:00 am | Permalink

Here are two branchless versions that I think work even for a==b. high2() assumes that __builtin_clzl(0) returns some arbitrary integer value (but does not trap or otherwise explode).

uint64_t high(uint64_t a, uint64_t b) {
uint64_t db = a ^ b;
db |= (db >> 1);
db |= (db >> 2);
db |= (db >> 4);
db |= (db >> 8);
db |= (db >> 16);
db |= (db >> 32);
return ((a & b) | (db >> 1)) + (db > 0);
}

uint64_t high2(uint64_t a, uint64_t b) {
uint64_t x = a ^ b, nz = x > 0;
uint64_t db = (uint64_t) nz << (63 – (__builtin_clzl(x) & 63));
return (a & ~(db – nz)) | db;
}

33. | December 20, 2011 at 7:35 am | Permalink

qznc, bsf is fast on Intel Pentium-2 and later (except Atom), and AMD K10 and later. On Intel Pentium, Intel Atom, and AMD <= K8 it is very slow, perhaps microcoded.

34. | December 20, 2011 at 3:18 pm | Permalink

A bit off-topic, but has anyone found a way to post code (fixed-width font) on Google+?