For 8192 elements I also get result that their search function is faster than STL. But 8192 elements is tiny size of array where to search, for larger array I get opposite - STL is faster. Here is output for 8 * 1024 * 1024 elements (~8.4million).
gcc 5.2.0 on Windows x64 (i5-3210M)
stl : 881 miliseconds
version 1: 880 miliseconds
version 2: 1607 miliseconds
version 3: 1260 miliseconds
version 4: 1271 miliseconds
gcc 5.2.0 on Linux x86_64 (i7-4770S)
stl : 629.231 miliseconds
version 1: 629.436 miliseconds
version 2: 897.143 miliseconds
version 3: 862.827 miliseconds
version 4: 863.22 miliseconds
clang 3.6.2 on Linux armv7h (CuBox-i, Cortex-A9)
stl : 3380.29 miliseconds
version 1: 3428.9 miliseconds
version 2: 3433.65 miliseconds
version 3: 3391.86 miliseconds
version 4: 3376.91 miliseconds
Oh, and Visual Studio doesn't have "/O3" argument they say they are using for cl.exe.
This is fascinating. Do you have any idea what might cause this?
A shot in the dark: Their benchmarking uses a very simple "random" generation to choose the index to search for, which is actually just a linear scan (modulo the size of the array). Could it be that with the larger array, the generated sequence of test indices happens to work nicely with the CPU's branch prediction - after all, your results show a performance drop-off for the versions that use conditional moves, and the behavior depends suspiciously on the CPU microarchitecture.
There is no mention of the fact that the STL implementation is very generic, it only assumes operator++ and operator* on the iteration, and operator< on the value.
The "optimized" versions here all make more assumptions on the iterator, starting from operator+(int) in Version 1, so it no longer works on iterators with just "forward_iterator_tag". Further versions even restrict vector sizes (albeit to a very high number) and assign -1 to an unsigned integer (size_t). So this is something you can use in your project if you need the performance, but can't put it into GCC.
Just a nitpick: Assigning a negative value to an unsigned integer is perfectly safe and well defined (i.e. integer overflow for unsigned integers is well defined in the C++ spec).
Using -1 for size_t is sometimes even prefered to assign it's maximum value, since there is no cross-platform #define for it (most use either SIZE_T_MAX or SIZE_MAX).
In fact llvm's libc++ uses it for it's std::numeric_limits implementation and for std::string's npos.
Uhm... Using "-1" to assign the maximum value is in fact as standard and cross-platform as using std::numeric_limits, which was basically my entire point.
The difference between those is only "textual solution" VS "mathematical solution" and thus a matter of personal preference.
operator+(int) is fine. The libstdc++ code is using std::advance, which will compile down to operator+ for random access iterators (and will be O(n) for forward iterators, but performance is already going to be awful). The vector issue requires care but is fixable (I'd probably test the total length, and branch to a more basic implementation for big ranges). The assigning -1 really is just unfixable.
No technical reason; just lack of effort! In fact, a well-optimized implementation of the STL should have a large number of machine-word specific optimizations (a la std::vector<bool>). However, none that I know of, do, other than light unrolling to promote vectorization. About the closest you can find is well-optimized versions of valarray.
Genericity usually comes at the cost of performance. But it seems that we've come to terms with a handful of basic datatypes in our databases, and so I think it makes a great deal of sense to specialize our fancy, theoretically fast, generic algorithms for this handful of datatypes. Also, I don't see how you can "beat" someone in terms of genericity? But I agree, the title could be a bit less abusive of the STL "brand": it could've used the word "performance" and downplayed generics.
Funnily enough, the HTML title behind the link is indeed "Performance Engineering at Realm: How we optimized binary search".
If you take version 4, stop the first while loop once the size is at most 32, and then do linear search from there, it's faster (on my machine). For example, version 4 is 102ms and this tweak puts it at 83ms. (On clang x64.)
for 64 bit size_t it's pretty academic, but no, not correct.
suppose low is M-3 and high is M-1 (where M is 2^[#bits]) then mid should be M-2. but (M-3 + M-1) is (modulo M) equal to M-4. and half that is M/2 - 2, which is a lot less than the correct M-2. (pretend it's 2 bit unsigned ints so M is 4)
the correct 'mid' computation is low + (high - low)/2.
So the article isn't wrong, but its misleading you into thinking this should be correct for the wrong reasons.
They are using signed integers as their indices, which means that the signed bit is always 0. Thus the addition after casting to unsigned will never overflow, and you can divide by two (shift by 1) and then recast to a signed integer, no harm no foul.
Sure, and if you expect your software to run on such a platform with any degree of confidence then you're right to consider it. Even better, tell us about a conforming implementation that you've used in product recently that has UINT_MAX == INT_MAX.
Also I'm not trying to mislead people into thinking that its a good way to implement this. The confounding bit from the article is they started in Java and ended up in C. If you were indexing with signed ints in C, C++, or any language that has unsigned integers then you already have a bug with or without the bad mean check.
You got it backwards there. Only if you know your implementation and plan to code only for it, can you even start to consider bending the C Standard, and not the other way around.
They didn't say unsigned overflow was undefined, but you're right that the offset won't be correct. I thought it seemed intuitively wrong to me, but I figured google research is probably a reliable source!
True, but only if the list has more than 2^31 entries on a 32-bit machine.
In practise that's not possible if it contains 32-bit integers because that would take up a 16 GB linear address space which doesn't exist on a 32-bit machine.
An academically correct version would be `size_t probe = low + ((high - low) / 2);` but that would be much slower.
You could instead just add an initial debug-mode-only assert on the list length, for correctness.
This is nice but what are the use cases for such an optimization ? My understanding is that such an optimization will not matter anyway compared to all the I/O and other expensive operations. And if a program is really too slow then the true optimization is to scale with threads or distributed computing.
The most educational bit for me was the careful structuring and eventual elimination of the if/else to shake out a conditional move rather than an unpredictable branch.
Modern optimizers and CPU scheduling engines are so powerful, a lot of received wisdom on how to code for speed is outdated; manual loop unrolling, for example, is rarely very beneficial. It's nice to see that there's still some room for craftsmanship in the most critical of paths. Structuring loops for autovectorization is another useful habit to get into.
If you're going to miss when you go back to the array, you're going to pay that cost regardless. But without the mispredict, you'll be able to issue the load that much sooner.
I'm not familiar enough with Intel's architecture to know one way or the other, but it wouldn't surprise me if not mispredicting the next memory access saves you more than just some pipeline flushing: The CPU could speculatively issue the load you don't actually need, wasting resources that could be used to service the correct addresses.
The conditional move is the statement `low = v >= value ? low : other_low;` which either assigns `other_low` to `low` or does nothing.
And other_low is a variable, and not an arbitrary element of the list (which is a big difference with respect to cache), and it can also be seen from the assembly that it's stored in register. So there there is no "both cache lines" to fetch anything from.
I think it's at the very least been hinted at below (and there are other good points), but note that their implementation makes copies. If the copy ctor of type T is the least bit expensive, the STL (implementation in glibc) for this same benchmark is far faster than theirs. Try it. :)
One of the problems with the extreme complexity of submitting code to GCC, and FSF projects in general, (having to complete copyright assignments, which are very slowly handled) is that this is unfortunately unlikely to end up in libstdc++ (although I would be happy to see it there).
gcc 5.2.0 on Windows x64 (i5-3210M)
gcc 5.2.0 on Linux x86_64 (i7-4770S) clang 3.6.2 on Linux armv7h (CuBox-i, Cortex-A9) Oh, and Visual Studio doesn't have "/O3" argument they say they are using for cl.exe.