How We Beat C++ STL Binary Search

mmozeiko · on Aug 11, 2015

For 8192 elements I also get result that their search function is faster than STL. But 8192 elements is tiny size of array where to search, for larger array I get opposite - STL is faster. Here is output for 8 * 1024 * 1024 elements (~8.4million).

gcc 5.2.0 on Windows x64 (i5-3210M)

    stl      : 881 miliseconds
    version 1: 880 miliseconds
    version 2: 1607 miliseconds
    version 3: 1260 miliseconds
    version 4: 1271 miliseconds

gcc 5.2.0 on Linux x86_64 (i7-4770S)

    stl      : 629.231 miliseconds
    version 1: 629.436 miliseconds
    version 2: 897.143 miliseconds
    version 3: 862.827 miliseconds
    version 4: 863.22 miliseconds

clang 3.6.2 on Linux armv7h (CuBox-i, Cortex-A9)

    stl      : 3380.29 miliseconds
    version 1: 3428.9 miliseconds
    version 2: 3433.65 miliseconds
    version 3: 3391.86 miliseconds
    version 4: 3376.91 miliseconds

Oh, and Visual Studio doesn't have "/O3" argument they say they are using for cl.exe.

nhaehnle · on Aug 11, 2015

This is fascinating. Do you have any idea what might cause this?

A shot in the dark: Their benchmarking uses a very simple "random" generation to choose the index to search for, which is actually just a linear scan (modulo the size of the array). Could it be that with the larger array, the generated sequence of test indices happens to work nicely with the CPU's branch prediction - after all, your results show a performance drop-off for the versions that use conditional moves, and the behavior depends suspiciously on the CPU microarchitecture.

Noughmad · on Aug 11, 2015

There is no mention of the fact that the STL implementation is very generic, it only assumes operator++ and operator* on the iteration, and operator< on the value.

The "optimized" versions here all make more assumptions on the iterator, starting from operator+(int) in Version 1, so it no longer works on iterators with just "forward_iterator_tag". Further versions even restrict vector sizes (albeit to a very high number) and assign -1 to an unsigned integer (size_t). So this is something you can use in your project if you need the performance, but can't put it into GCC.

lhecker · on Aug 11, 2015

Just a nitpick: Assigning a negative value to an unsigned integer is perfectly safe and well defined (i.e. integer overflow for unsigned integers is well defined in the C++ spec).

Using -1 for size_t is sometimes even prefered to assign it's maximum value, since there is no cross-platform #define for it (most use either SIZE_T_MAX or SIZE_MAX).

In fact llvm's libc++ uses it for it's std::numeric_limits implementation and for std::string's npos.

Joky · on Aug 11, 2015

I believe there is a standard cross-platform maximum value, and you mention it in your last sentence: std::numeric_limits!

lhecker · on Aug 12, 2015

Uhm... Using "-1" to assign the maximum value is in fact as standard and cross-platform as using std::numeric_limits, which was basically my entire point.

The difference between those is only "textual solution" VS "mathematical solution" and thus a matter of personal preference.

arbre · on Aug 12, 2015

~0u ?

CJefferson · on Aug 11, 2015

operator+(int) is fine. The libstdc++ code is using std::advance, which will compile down to operator+ for random access iterators (and will be O(n) for forward iterators, but performance is already going to be awful). The vector issue requires care but is fixable (I'd probably test the total length, and branch to a more basic implementation for big ranges). The assigning -1 really is just unfixable.

rawnlq · on Aug 11, 2015

Is there any technical reason why the STL implementation can't include optimized specializations when there are less generic operators it can use?

thechao · on Aug 12, 2015

No technical reason; just lack of effort! In fact, a well-optimized implementation of the STL should have a large number of machine-word specific optimizations (a la std::vector<bool>). However, none that I know of, do, other than light unrolling to promote vectorization. About the closest you can find is well-optimized versions of valarray.

twoodfin · on Aug 12, 2015

It does. That's what std::advance is. For a random access iterator it will use operator+.

oleks · on Aug 11, 2015

Genericity usually comes at the cost of performance. But it seems that we've come to terms with a handful of basic datatypes in our databases, and so I think it makes a great deal of sense to specialize our fancy, theoretically fast, generic algorithms for this handful of datatypes. Also, I don't see how you can "beat" someone in terms of genericity? But I agree, the title could be a bit less abusive of the STL "brand": it could've used the word "performance" and downplayed generics.

Funnily enough, the HTML title behind the link is indeed "Performance Engineering at Realm: How we optimized binary search".

blt · on Aug 11, 2015

Loosely related: STL binary search suffers from the same weakness as several other comparator-based STL algorithms for certain classes of expensive comparators: https://www.reddit.com/r/cpp/comments/36sqtq/more_efficient_...

elitheeli · on Aug 11, 2015

If you take version 4, stop the first while loop once the size is at most 32, and then do linear search from there, it's faster (on my machine). For example, version 4 is 102ms and this tweak puts it at 83ms. (On clang x64.)

myle · on Aug 11, 2015

Isn't version 2 wrong, because

    size_t probe = (low + high) / 2;

may overflow?

slashdev · on Aug 11, 2015

Not if low and high are both unsigned (size_t is unsigned.) Even if it overflows, the result will be correct. See: http://googleresearch.blogspot.com/2006/06/extra-extra-read-...

jongraehl · on Aug 11, 2015

for 64 bit size_t it's pretty academic, but no, not correct.

suppose low is M-3 and high is M-1 (where M is 2^[#bits]) then mid should be M-2. but (M-3 + M-1) is (modulo M) equal to M-4. and half that is M/2 - 2, which is a lot less than the correct M-2. (pretend it's 2 bit unsigned ints so M is 4)

the correct 'mid' computation is low + (high - low)/2.

cautious_int · on Aug 11, 2015

That blog is wrong. The program won't be undefined( unsigned wrap is defined ), but the offset will not be correct.

dastbe · on Aug 12, 2015

So the article isn't wrong, but its misleading you into thinking this should be correct for the wrong reasons.

They are using signed integers as their indices, which means that the signed bit is always 0. Thus the addition after casting to unsigned will never overflow, and you can divide by two (shift by 1) and then recast to a signed integer, no harm no foul.

cautious_int · on Aug 12, 2015

Actually it is misleading by them( and you ) to assume that in C, an unsigned int can represent values larger than the largest signed int value.

In other words, C allows that UINT_MAX == INT_MAX, in which case you will overflow.

If they made that assumption, they should explicitly mention it, but they didn't.

> Update 17 Feb 2008:... ...Now that we've made this change, we know that the program is correct;)

It seems the article is aware of the irony. Another update would be in order.

dastbe · on Aug 12, 2015

Sure, and if you expect your software to run on such a platform with any degree of confidence then you're right to consider it. Even better, tell us about a conforming implementation that you've used in product recently that has UINT_MAX == INT_MAX.

Also I'm not trying to mislead people into thinking that its a good way to implement this. The confounding bit from the article is they started in Java and ended up in C. If you were indexing with signed ints in C, C++, or any language that has unsigned integers then you already have a bug with or without the bad mean check.

cautious_int · on Aug 12, 2015

You got it backwards there. Only if you know your implementation and plan to code only for it, can you even start to consider bending the C Standard, and not the other way around.

slashdev · on Aug 11, 2015

They didn't say unsigned overflow was undefined, but you're right that the offset won't be correct. I thought it seemed intuitively wrong to me, but I figured google research is probably a reliable source!

banachtarski · on Aug 12, 2015

That's not true. Unsigned wrap is well-defined and wraps around to 0. Signed wrap is undefined.

cautious_int · on Aug 12, 2015

It is not true that unsigned wrap is defined? Did you even read my comment, here is the relevant part: unsigned wrap is defined

rrrlasse · on Aug 11, 2015

True, but only if the list has more than 2^31 entries on a 32-bit machine.

In practise that's not possible if it contains 32-bit integers because that would take up a 16 GB linear address space which doesn't exist on a 32-bit machine.

An academically correct version would be `size_t probe = low + ((high - low) / 2);` but that would be much slower.

You could instead just add an initial debug-mode-only assert on the list length, for correctness.

maxlybbert · on Aug 12, 2015

Sun had to put out a JDK bug fix for this nine years ago ( http://googleresearch.blogspot.com/2006/06/extra-extra-read-... ).

xjia · on Aug 12, 2015

(76% + 81% + 80% + 75% + 70%) / 5 = 76%

http://www.cse.unsw.edu.au/~gernot/benchmarking-crimes.html

avdicius · on Aug 11, 2015

My version:

  template <class T> INLINE size_t fast_upper_bound5(const vector<T>& vec, T value)
  {
      size_t index = 0;
      size_t size = vec.size();
      while (size > 0) {
          size /= 2;
          size_t probe = index + size;
          if (vec[probe] <= value)
              index = probe + 1;
      }
      return index;
  }

Slightly faster with gcc:

  $ g++-mp-4.9 -std=c++11 -O3 -DNDEBUG -o blog blog.cpp 
  $ ./blog 
  size = 8192:
      stl      : 144.883 miliseconds
      version 1: 145.406 miliseconds
      version 2: 129.713 miliseconds
      version 3: 109.231 miliseconds
      version 4: 103.578 miliseconds
      version 5: 102.282 miliseconds

But sucks with clang, apparently because of the described problem with non-using cmov instruction.

  $ clang++-mp-3.6 -std=c++11 -O3 -DNDEBUG -o blog blog.cpp
  $ ./blog 
  size = 8192:
      stl      : 147.466 miliseconds
      version 1: 145.978 miliseconds
      version 2: 145.547 miliseconds
      version 3: 113.546 miliseconds
      version 4: 106.968 miliseconds
      version 5: 144.231 miliseconds

detrino · on Aug 11, 2015

It should be noted that their version gives different results to std::upper_bound if your range contains duplicates.

arbre · on Aug 12, 2015

This is nice but what are the use cases for such an optimization ? My understanding is that such an optimization will not matter anyway compared to all the I/O and other expensive operations. And if a program is really too slow then the true optimization is to scale with threads or distributed computing.

twoodfin · on Aug 11, 2015

Great link.

The most educational bit for me was the careful structuring and eventual elimination of the if/else to shake out a conditional move rather than an unpredictable branch.

Modern optimizers and CPU scheduling engines are so powerful, a lot of received wisdom on how to code for speed is outdated; manual loop unrolling, for example, is rarely very beneficial. It's nice to see that there's still some room for craftsmanship in the most critical of paths. Structuring loops for autovectorization is another useful habit to get into.

cautious_int · on Aug 11, 2015

Wouldn't a conditional move transfer the cost from branch mispredict to the cache miss.

I wonder how realistic their test were, and if the same results would be achieved with some cache trashing.

twoodfin · on Aug 11, 2015

If you're going to miss when you go back to the array, you're going to pay that cost regardless. But without the mispredict, you'll be able to issue the load that much sooner.

I'm not familiar enough with Intel's architecture to know one way or the other, but it wouldn't surprise me if not mispredicting the next memory access saves you more than just some pipeline flushing: The CPU could speculatively issue the load you don't actually need, wasting resources that could be used to service the correct addresses.

Or did you mean something else?

cautious_int · on Aug 11, 2015

Well a branch will fetch the incorrect cache line if it mispredicts, but a conditional move will fetch both cache lines every time.

rrrlasse · on Aug 11, 2015

The conditional move is the statement `low = v >= value ? low : other_low;` which either assigns `other_low` to `low` or does nothing.

And other_low is a variable, and not an arbitrary element of the list (which is a big difference with respect to cache), and it can also be seen from the assembly that it's stored in register. So there there is no "both cache lines" to fetch anything from.

cautious_int · on Aug 11, 2015

The conditional move still depends on both values taken from the array, via other variables.

chardan · on Aug 12, 2015

I think it's at the very least been hinted at below (and there are other good points), but note that their implementation makes copies. If the copy ctor of type T is the least bit expensive, the STL (implementation in glibc) for this same benchmark is far faster than theirs. Try it. :)

CJefferson · on Aug 11, 2015

One of the problems with the extreme complexity of submitting code to GCC, and FSF projects in general, (having to complete copyright assignments, which are very slowly handled) is that this is unfortunately unlikely to end up in libstdc++ (although I would be happy to see it there).

nanofortnight · on Aug 11, 2015

I think the fact this violates the standard would be the greatest difficulty.

on Aug 11, 2015

[deleted]

usefulcat · on Aug 11, 2015

Because binary search on a sorted array is nearly always faster than std::map due to cache effects.

jemfinch · on Aug 11, 2015

This question isn't even relevant to the article, which discusses search, not insert.

sbahra · on Aug 12, 2015

Paul has been doing great work here: http://repnop.org/pd/slides/bsearch.pdf