Hacker News new | past | comments | ask | show | jobs | submit login
How We Beat C++ STL Binary Search (realm.io)
95 points by _bkgg on Aug 11, 2015 | hide | past | favorite | 43 comments



For 8192 elements I also get result that their search function is faster than STL. But 8192 elements is tiny size of array where to search, for larger array I get opposite - STL is faster. Here is output for 8 * 1024 * 1024 elements (~8.4million).

gcc 5.2.0 on Windows x64 (i5-3210M)

    stl      : 881 miliseconds
    version 1: 880 miliseconds
    version 2: 1607 miliseconds
    version 3: 1260 miliseconds
    version 4: 1271 miliseconds
gcc 5.2.0 on Linux x86_64 (i7-4770S)

    stl      : 629.231 miliseconds
    version 1: 629.436 miliseconds
    version 2: 897.143 miliseconds
    version 3: 862.827 miliseconds
    version 4: 863.22 miliseconds
clang 3.6.2 on Linux armv7h (CuBox-i, Cortex-A9)

    stl      : 3380.29 miliseconds
    version 1: 3428.9 miliseconds
    version 2: 3433.65 miliseconds
    version 3: 3391.86 miliseconds
    version 4: 3376.91 miliseconds
Oh, and Visual Studio doesn't have "/O3" argument they say they are using for cl.exe.


This is fascinating. Do you have any idea what might cause this?

A shot in the dark: Their benchmarking uses a very simple "random" generation to choose the index to search for, which is actually just a linear scan (modulo the size of the array). Could it be that with the larger array, the generated sequence of test indices happens to work nicely with the CPU's branch prediction - after all, your results show a performance drop-off for the versions that use conditional moves, and the behavior depends suspiciously on the CPU microarchitecture.


There is no mention of the fact that the STL implementation is very generic, it only assumes operator++ and operator* on the iteration, and operator< on the value.

The "optimized" versions here all make more assumptions on the iterator, starting from operator+(int) in Version 1, so it no longer works on iterators with just "forward_iterator_tag". Further versions even restrict vector sizes (albeit to a very high number) and assign -1 to an unsigned integer (size_t). So this is something you can use in your project if you need the performance, but can't put it into GCC.


Just a nitpick: Assigning a negative value to an unsigned integer is perfectly safe and well defined (i.e. integer overflow for unsigned integers is well defined in the C++ spec).

Using -1 for size_t is sometimes even prefered to assign it's maximum value, since there is no cross-platform #define for it (most use either SIZE_T_MAX or SIZE_MAX).

In fact llvm's libc++ uses it for it's std::numeric_limits implementation and for std::string's npos.


I believe there is a standard cross-platform maximum value, and you mention it in your last sentence: std::numeric_limits!


Uhm... Using "-1" to assign the maximum value is in fact as standard and cross-platform as using std::numeric_limits, which was basically my entire point.

The difference between those is only "textual solution" VS "mathematical solution" and thus a matter of personal preference.


~0u ?


operator+(int) is fine. The libstdc++ code is using std::advance, which will compile down to operator+ for random access iterators (and will be O(n) for forward iterators, but performance is already going to be awful). The vector issue requires care but is fixable (I'd probably test the total length, and branch to a more basic implementation for big ranges). The assigning -1 really is just unfixable.


Is there any technical reason why the STL implementation can't include optimized specializations when there are less generic operators it can use?


No technical reason; just lack of effort! In fact, a well-optimized implementation of the STL should have a large number of machine-word specific optimizations (a la std::vector<bool>). However, none that I know of, do, other than light unrolling to promote vectorization. About the closest you can find is well-optimized versions of valarray.


It does. That's what std::advance is. For a random access iterator it will use operator+.


Genericity usually comes at the cost of performance. But it seems that we've come to terms with a handful of basic datatypes in our databases, and so I think it makes a great deal of sense to specialize our fancy, theoretically fast, generic algorithms for this handful of datatypes. Also, I don't see how you can "beat" someone in terms of genericity? But I agree, the title could be a bit less abusive of the STL "brand": it could've used the word "performance" and downplayed generics.

Funnily enough, the HTML title behind the link is indeed "Performance Engineering at Realm: How we optimized binary search".


Loosely related: STL binary search suffers from the same weakness as several other comparator-based STL algorithms for certain classes of expensive comparators: https://www.reddit.com/r/cpp/comments/36sqtq/more_efficient_...


If you take version 4, stop the first while loop once the size is at most 32, and then do linear search from there, it's faster (on my machine). For example, version 4 is 102ms and this tweak puts it at 83ms. (On clang x64.)


Isn't version 2 wrong, because

    size_t probe = (low + high) / 2;
may overflow?


Not if low and high are both unsigned (size_t is unsigned.) Even if it overflows, the result will be correct. See: http://googleresearch.blogspot.com/2006/06/extra-extra-read-...


for 64 bit size_t it's pretty academic, but no, not correct.

suppose low is M-3 and high is M-1 (where M is 2^[#bits]) then mid should be M-2. but (M-3 + M-1) is (modulo M) equal to M-4. and half that is M/2 - 2, which is a lot less than the correct M-2. (pretend it's 2 bit unsigned ints so M is 4)

the correct 'mid' computation is low + (high - low)/2.


That blog is wrong. The program won't be undefined( unsigned wrap is defined ), but the offset will not be correct.


So the article isn't wrong, but its misleading you into thinking this should be correct for the wrong reasons.

They are using signed integers as their indices, which means that the signed bit is always 0. Thus the addition after casting to unsigned will never overflow, and you can divide by two (shift by 1) and then recast to a signed integer, no harm no foul.


Actually it is misleading by them( and you ) to assume that in C, an unsigned int can represent values larger than the largest signed int value.

In other words, C allows that UINT_MAX == INT_MAX, in which case you will overflow.

If they made that assumption, they should explicitly mention it, but they didn't.

> Update 17 Feb 2008:... ...Now that we've made this change, we know that the program is correct;)

It seems the article is aware of the irony. Another update would be in order.


Sure, and if you expect your software to run on such a platform with any degree of confidence then you're right to consider it. Even better, tell us about a conforming implementation that you've used in product recently that has UINT_MAX == INT_MAX.

Also I'm not trying to mislead people into thinking that its a good way to implement this. The confounding bit from the article is they started in Java and ended up in C. If you were indexing with signed ints in C, C++, or any language that has unsigned integers then you already have a bug with or without the bad mean check.


You got it backwards there. Only if you know your implementation and plan to code only for it, can you even start to consider bending the C Standard, and not the other way around.


They didn't say unsigned overflow was undefined, but you're right that the offset won't be correct. I thought it seemed intuitively wrong to me, but I figured google research is probably a reliable source!


That's not true. Unsigned wrap is well-defined and wraps around to 0. Signed wrap is undefined.


It is not true that unsigned wrap is defined? Did you even read my comment, here is the relevant part: unsigned wrap is defined


True, but only if the list has more than 2^31 entries on a 32-bit machine.

In practise that's not possible if it contains 32-bit integers because that would take up a 16 GB linear address space which doesn't exist on a 32-bit machine.

An academically correct version would be `size_t probe = low + ((high - low) / 2);` but that would be much slower.

You could instead just add an initial debug-mode-only assert on the list length, for correctness.


Sun had to put out a JDK bug fix for this nine years ago ( http://googleresearch.blogspot.com/2006/06/extra-extra-read-... ).



My version:

  template <class T> INLINE size_t fast_upper_bound5(const vector<T>& vec, T value)
  {
      size_t index = 0;
      size_t size = vec.size();
      while (size > 0) {
          size /= 2;
          size_t probe = index + size;
          if (vec[probe] <= value)
              index = probe + 1;
      }
      return index;
  }
Slightly faster with gcc:

  $ g++-mp-4.9 -std=c++11 -O3 -DNDEBUG -o blog blog.cpp 
  $ ./blog 
  size = 8192:
      stl      : 144.883 miliseconds
      version 1: 145.406 miliseconds
      version 2: 129.713 miliseconds
      version 3: 109.231 miliseconds
      version 4: 103.578 miliseconds
      version 5: 102.282 miliseconds
But sucks with clang, apparently because of the described problem with non-using cmov instruction.

  $ clang++-mp-3.6 -std=c++11 -O3 -DNDEBUG -o blog blog.cpp
  $ ./blog 
  size = 8192:
      stl      : 147.466 miliseconds
      version 1: 145.978 miliseconds
      version 2: 145.547 miliseconds
      version 3: 113.546 miliseconds
      version 4: 106.968 miliseconds
      version 5: 144.231 miliseconds


It should be noted that their version gives different results to std::upper_bound if your range contains duplicates.


This is nice but what are the use cases for such an optimization ? My understanding is that such an optimization will not matter anyway compared to all the I/O and other expensive operations. And if a program is really too slow then the true optimization is to scale with threads or distributed computing.


Great link.

The most educational bit for me was the careful structuring and eventual elimination of the if/else to shake out a conditional move rather than an unpredictable branch.

Modern optimizers and CPU scheduling engines are so powerful, a lot of received wisdom on how to code for speed is outdated; manual loop unrolling, for example, is rarely very beneficial. It's nice to see that there's still some room for craftsmanship in the most critical of paths. Structuring loops for autovectorization is another useful habit to get into.


Wouldn't a conditional move transfer the cost from branch mispredict to the cache miss.

I wonder how realistic their test were, and if the same results would be achieved with some cache trashing.


If you're going to miss when you go back to the array, you're going to pay that cost regardless. But without the mispredict, you'll be able to issue the load that much sooner.

I'm not familiar enough with Intel's architecture to know one way or the other, but it wouldn't surprise me if not mispredicting the next memory access saves you more than just some pipeline flushing: The CPU could speculatively issue the load you don't actually need, wasting resources that could be used to service the correct addresses.

Or did you mean something else?


Well a branch will fetch the incorrect cache line if it mispredicts, but a conditional move will fetch both cache lines every time.


The conditional move is the statement `low = v >= value ? low : other_low;` which either assigns `other_low` to `low` or does nothing.

And other_low is a variable, and not an arbitrary element of the list (which is a big difference with respect to cache), and it can also be seen from the assembly that it's stored in register. So there there is no "both cache lines" to fetch anything from.


The conditional move still depends on both values taken from the array, via other variables.


I think it's at the very least been hinted at below (and there are other good points), but note that their implementation makes copies. If the copy ctor of type T is the least bit expensive, the STL (implementation in glibc) for this same benchmark is far faster than theirs. Try it. :)


One of the problems with the extreme complexity of submitting code to GCC, and FSF projects in general, (having to complete copyright assignments, which are very slowly handled) is that this is unfortunately unlikely to end up in libstdc++ (although I would be happy to see it there).


I think the fact this violates the standard would be the greatest difficulty.


[deleted]


Because binary search on a sorted array is nearly always faster than std::map due to cache effects.


This question isn't even relevant to the article, which discusses search, not insert.


Paul has been doing great work here: http://repnop.org/pd/slides/bsearch.pdf




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: