Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

algorithm - About the branchless binary search

I asked a question about reducing the miss prediction.

Jerry Coffin give me an impressive answer.

About reducing the branch miss prediciton

The binary search is branchless, but when I use it in my set intersection algorithm, I found it much slower than the original binary search. What's the reason?

Update:

I use the following event to test number of branch miss prediction of i7 processor: BR_MISS_PRED_RETIRED. I found the branchless version is about half of the branch miss than the original one.

For cache miss: I use LLC_MISSES to test the number of last level cache misses, also half.

But the time is about 2.5 times than the original one.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The problem with the conditional move (branchless) search occurs when then arrays are large and the memory access time is large relative to a branch misprediction.

A conditional move search is something like:

int needle; // value we are searching for
int *base = ...; // base pointer
int n; // number of elements in the current region
while (n > 1) {
  int middle = n/2;
  base += (needle < *base[middle]) ? 0 : middle;
  n -= middle;
}

Note that we conditionally update base without using a branch (at least assuming the compiler doesn't decide to implement the ternary operator as a branch). The problem is that the value of base in each iteration is data-dependent on the result of the comparison in the previous iteration, and so accesses to memory occur one at a time, serialized via a the data dependency.

For a search over large array, this removes the possibility of memory-level parallelism and your search takes something like log2(N) * average_access_time. A branch-based search has no such data dependency: it has only a speculated control dependency between iterations: the CPU picks a direction and goes with it. If it guesses right, you'll be loading the result from the current iteration and the next at the same time! It doesn't end there: the speculation continues and you might have have a dozen loads in flight at once.

Of course, the CPU doesn't always guess right! In the worst case, if the branches are totally unpredictable (your data and needle value don't have kind of bias), it will be wrong half the time. Still, that means that on average it will sustain 0.5 + 0.25 + 0.125 + ... = ~1 additional accesses in flight beyond the current one. This isn't just theoretical: try a binary search over random data and you'll probably see the 2x speedup for branch-based over the branchless search, due to double the parallelism.

For many data sets the branch direction isn't entirely random, so you can see more than 2x speedup, as in your case.

The situation is reversed for small arrays that fit in cache. The branchless search still has this same "serial dependency" problem, but the load latency is small: a handful of cycles. The branch-based search, on the other hand suffers constant mispredictions, which cost on the order of ~20 cycles, so branchless usually ends up faster in this case.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...