A big extra cost of virtual functions in the underlying CPU not mentioned in the article: they effectively create a branch target dependency on a pointer chase. Put another way:
1) The virtual function address lookup requires a load from an address which is itself loaded. If neither location is cached, this has the unavoidable latency of two uncached memory accesses. Even at best, this incurs two cached L1 accesses, which is about 8-16 cycles on modern architectures.
2) The function call itself is dependent on the final address loaded above. None of that can proceed until the branch address is known. If cached, all is good and the core correctly predicts execution of a large number of instructions. Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.
In any case, nearly all of this is dwarfed by the cost to the compiled code itself: in most cases you can't inline, so simple transformations which could eliminate the function call altogether can't happen.
Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.
You seem very familiar with these issues, but this doesn't sound right to me. Maybe I'm not understanding your terminology, but don't all modern processors support speculative execution? All instructions (including dependent) are executed, but the results are held in the Reorder Buffer until the branch choice is confirmed. If this is still a large issue, why don't Eli's measurements show it to be?
If the branch target is an address loaded from memory, and there is no cached result for the branch instruction, then there's no way it can predict which instruction to execute next. The target could be anywhere in valid memory.
The reason the measurements don't show it is the micro-benchmark will be predicting very well. In fact it's quite difficult to defeat prediction even for giant codebases, and you probably have bigger issues with L1 thrashing at that point. The more subtle problem is even with prediction, there's a (quite high) limit to the number of unretired speculated instructions. Again, a micro-benchmark won't show that up - you'd need a large function in the inner loop.
I'm making it sound like there's no cost to virtual functions in real applications, but it's there, usually measurable and every little adds up. If anything, I think a better reason to not simply spray "virtual" everywhere is it demonstrates that the author didn't understand the data structures they created.
On the other hand, having no cached result for the branch probably correlates strongly with not having the target in I-cache, which means you may be stalling out anyway. It also implies that the branch is not in the middle of a tight loop.
Regarding the size of the ROB, I was wondering about the size a while ago and found an interesting post from someone who measured it for modern Intel processors: Ivy Bridge (168), Sandy Bridge (168), Lynnfield (128), Northwood (126), Yorkfield (96), Palermo (72), and Coppermine (40). http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/
I agree with you on the virtual part. I'm actually a C programmer more interested in how to implement efficient dispatch for interpreters. Eli (the original author) has some good posts on that as well: http://eli.thegreenplace.net/2012/07/12/computed-goto-for-ef...
Can profile-guided optimization realise that a certain virtual function almost always resolves to a specific implementation and have a conditional check to inline or optimize when needed?
I'm not overly experienced with complicated OO systems, but sometimes it seems the OO is just an abstraction for convenience, but runtime will always take a particular path.
My understanding is that good virtual machines basically do this sort of profiling and optimization at runtime and JIT compile specializations as necessary.
Does anybody know why JIT isn't done in classically AOT compilers? Is JIT overhead generally higher than cost savings of the optimizations?
> Does anybody know why JIT isn't done in classically AOT compilers?
One (admittedly incomplete) answer is that AOT compilers try to replicate many of the wins that JIT compilers get from runtime specialization by including a profile-guided optimization pass instead, which specializes ahead of time, using data logged from what you hope is a representative example of runtime.
Good JIT compilers can do things like optimizing fast paths, discovering latent static classes in highly dynamic languages, etc. These kinds of optimizations can also be done AOT, if you have good profile data and suitable analysis & optimization passes.
The pros/cons of each approach are not entirely resolved, and you will find varying opinions. Part of the problem with making a direct comparison is that there are large infrastructural inconveniences with switching from one approach to the other. A good JIT is a quite pervasive beast, not something you can just tack on as a nice-to-have. PGO is somewhat infrastructurally easier to add to an existing AOT compiler. Therefore, if you can do most of what JIT does via PGO, you would prefer to do that, were you the maintainer of an existing AOT compiler. Whether you really can is afaik a bit of an open question.
I think something that's often overlooked in this discussion is the language semantics differences. So we're not just comparing AOT with JIT (or why not JIT an AOT compiled app...) We're almost always also comparing C++ to the JVM/CLR worlds.
And then the point is that most optimizations a JIT can do that an AOT cannot are particulaly important where the language semantics are "too" flexible. If your code has lots and lot of virtual calls; lots of exceptions with unpredictable code flow - well, sure, it's really important to elide that flexibility where it's not actually used. That's kind of like JS Vm's nowadays speculatively type their untyped objects - it's a huge win, and not possible statically.
But the point is - these optimizations are critical because the languages don't allow (or encourage) code to disable these dynamic features. In C++ this can be helpful; but how often is dynamic devirtualization really going to matter? I mean, you can statically devirtualize certain instances (e.g. whole-program optimization reveals only two implementations and replaces a virtual call with an if), but the real code-could by any subtype but actually isn't scenario just isn't one that comes up often.
The consequence is that C++ gets most of the benefits of a JIT simply because the JIT is solving problems C++ compilers don't need to solve. The cost is that the compiler wastes inordinate amounts of time compiling your entire program as optimally as it can, even though it only has a few hotspots.
I'm not an expert on the topic, but JIT is not allowed in certain places - like game consoles, ios, etc, to the point where even simpler trampoline (ffcall, libffi) are not allowed too. As such AOT helps there where JIT can't be used, but does not support absolutely everything (C# templates when comes to types that are not yet loaded I guess?).
With C# generics on non-struct types, the emitted code is the same regardless of the type. On struct-types, it's necessarily different because there's no object header and the size of the objects can differ.
But, I can't think of a way to get an unknown struct type into a generic function without boxing (which makes it no longer a struct type). So for an AOT compiler it shouldn't be an issue. And there's probably some clever way to emit generic code for structs, too.
But generics have their own implementation difficulties and Mono didn't support them for a while in AOT.
I worked on serious x86 clone once - we took a lot of real-world trace and ran it through our various microarchitectures to see how it would fly - dynamic C++ dispatch was interesting normally you expect something like
mov r1, n(bp) ; get vtable
mov r2, n(r2) ; get method pointer
call (r2) ; call
that's a really bad pipe break a double indirect load and a call - but branch prediction may be your friend ...
However some of the code we saw (I think it came from a Borland compiler)
mov r1, n(bp) ; get vtable
push n(r2) ; get method pointer
ret ; call
an extra memory write/read but always caught in L1 and on the register poor x86 it saves a register right> ... but on most CPUs of the time you're screwed for the branch prediction - CPUs had a return cache, a cheap way to predict the branch target of a return - by doing a return without a call you've popped the return cache leaving it in a bad state - EVERY return in an enclosing method is going to mispredict as well - the code will run, but slowly
depends on the CPU - but it's relatively trivial thing to build (especially because unlike other caches it's a stack) on x86s return nominally is ALWAYS a bad pipe bubble: a pop followed by an indirect jump - the pop gets resolved at the end of its micro-op and the jump wants to be resolved early on so as to start decoding the next instruction
In the end it can't hurt to generate a bad jump prediction off of the return cache, it's no worse than being idle - the effect of messing with the cache though can cause it to always fail so as a result you get no advantage from it
(I should add - it's an x86, you're really register poor - sometimes you do have to do stuff like that - but if you have a register "mov reg, a;jmp (reg)" is better than "push a;ret")
for (unsigned i = 0; i < N; ++i) {
for (unsigned j = 0; j < i; ++j) {
obj->tick(j);
}
}
I wouldn't go quite so far as to say that benchmarks with tight inner loops like this are completely useless, but they are nearly so.
The author is clearly aware that the real world of performance is much bigger & more complex than his simple Petri dish. Credit to him for mentioning that. It's also really refreshing to see him analysing the optimised assembly.
The trouble with this approach is that it's tempting to draw simple conclusions. In this case, you might be tempted to conclude "CRTP always faster than virtual dispatch", when the truth is likely to be much more situation dependent.
I have seen a biggish project go though a lot of effort to switch to CRTP, only to see a negligible performance impact.
Agreed, for almost all code it doesn't matter, but for the remaining small fraction it's worth thinking about these things. It sounds pretty insane to go with a blanket approach of removing virtual calls throughout an entire codebase without understanding which ones are the problematic ones. Especially since some ways of solving the problem could potentially lead to other problems like increased compiled code size.
I've seen plenty of software (especially systems software) that does spend much of it's time in tight inner loops. Pulling out all the optimization stops there can give measurable gains. I've personally seen measurable gains on real applications from tricks like reordering branches so that the more predictable branches go first.
Sure, it's a waste to optimize code that doesn't significantly contribute to execution time. And there are lots of cases that are I/O bound, memory bound, cache bound, cross-thread communication bound etc. But if you're doing actual calculations - so not lots of communication like I/O or threading, and not delegating the crunching to a library - and your calculations are not trivial bit-pushing (e.g. not just streaming with minor changes), then it's a good bet that any virtual function calls in that kind of code will be problematic; getting rid of the inner-loop dynamic dispatches will almost certainly help.
So it's situational, but IME it's pretty predictable where you'll see this kind of optimization help. By all means profile and use whatever tools at hand to help you along, and don't apply the optimization blindly - but despite the whole "black art" label optimization sometimes gets this kind of thing really is pretty straightforward.
Yes, of course. But the real challenge is to determine that it's your virtual indirection that's causing the problem. Once you know what's causing the problem, fixing it is (relatively) easy.
The danger is that benchmarks like this encourage naïve programmers to use complex constructs as a matter of course, when simpler would usually be better. "Premature optimisation is the root of all evil" and all that...
It's not necessarily done with explicit 'virtual' calls in C++, but speed of dispatch is very important to interpreters for dynamic languages. Here's an improvement to Python where the same general type of fixes make a significant difference: http://bugs.python.org/issue4753
Also, CRTP prevents you from storing all derived objects in a single container, since the underlying types are now heterogeneous. There are also more restrictions present in terms of slicing, casting, among other things that render CRTP a poor choice in many situations.
In the end, it's just another tool which is the right one in particular circumstances, and the wrong one in all others.
"If anything doesn’t feel right, or just to make (3) more careful, use low-level counters to make sure that the amount of instructions executed and other such details makes sense given (2)."
This is explicit support for confirmation bias.
See Feynman's discussion of measuring the charge of the electron in Cargo Cult Science:
"Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of—this history—because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong—and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that..."
And as an alternative, would you suggest laboriously using low level counters to verify that every measurement you think is correct is indeed correct? Given finite resources, what's a better approach than concentrating on the apparent anomalous measurements? I'm not sure I see the parallel.
When you think you can use CRTP instead of virtual dispatch in your program, you didn't need virtual dispatch to begin with... you needed a generic algorithm to operate over your object classes. That's exactly what run_crtp() is, the CRTPInterface class is completely redundant except that it provides some degree of compile-time concept checking (which we'll hopefully get in C++17)
Virtual dispatch is useful for type erasure, when using abstract types from plugins, DLLs or generally "somebody elses code". IMHO, the valid use cases within a standalone program are actually fairly small.
I've done benchmarks on this fairly recently, and with the functions actually doing a lot of work (ray intersection for a raytracer), I saw practically no difference between CRTP and Virtual Functions:
So he found that dynamic dispatch was a lot more expensive. Fair enough and not very surprising. But let's quantify it a bit in absolute terms. The dynamic version of the code took 1.25s to run, during which time it performed approximately 8 x 10^8 virtual function calls. That translates to a cost per call of 1.5 nanoseconds.
From which my takeaway would be: In inner-loopy code for which an extra nanosecond or so per call is critical, you should avoid virtual function calls. For anything else, don't worry about it.
1.5 nanoseconds per call in the best case. In some huge monstrosity where you've got to go chase down object headers not in the cache, things may be quite different.
Instead of devirtualization, a simpler optimization, which would additionally also help in the dynamic case, is simple loop hoisting of the method pointer fetch. Instead of doing
which would avoid two redirections per inner loop! Actually, I'm almost sure that is what LuaJIT would do, and many other high-level programming languages could perform this optimization as well. However, maybe C is too low-level to be able to do that, and I don't know about C++.
That would save the indirection, but I hope the article shows that by far the biggest cost comes from the lack of inlining. The latter would not be solved by your function pointer.
This is a nice article and props for including and dissecting generated assembler!
A key thing here is that inlining is what enables zero-cost abstractions in C++. A virtual call is slower than a regular call, but the main problem is that it builds a barrier that effectively stops inlining.
It'll be interesting to see how devirtualization in GCC will do for real world programs.
Observation: the intricacies of our technologies are growing to such complexity that analysis of the things we once had a direct hand in the design of plays out much like the analysis of some sort of mysterious natural phenomenon.
its interesting to see a break down of this - especially using modern compilers on the intel platform.
did you try the intel compiler? for raw low level optimisation it sometimes massively out performs the ms, gcc or clang versions...
i'd imagine these problems are worse on ARM chips, and dynamic dispatch is even less effective there - certainly on PPC architectures I've seen much worse performance than on similarly powered Intels in precisely this situation. the caches are less and slower...
i'm not 100% but i think i've seen virtual calls 'devirtualised' by the MS compiler a couple of years ago... I might be thinking of something else though, it was a while back now. I was unpicking some CRTP mess in something that /was not performance critical in anyway/...
You may be thinking of this: IIRC the standard recommends that compilers omit dynamic dispatch when the dynamic type is known at compile time - this essentially boils down to the case where a virtual method call follows creation of the object with 'new' or as an automatic variable. In my experience, this is commonly implemented correctly in compilers.
The other case where the dynamic type is known is in the constructor itself of course.
I'd like to see a comparison of calling a dynamically linked function call vs a non-dynamically linked virtual call.
Dynamic linking has more indirection than you might expect because the function addresses can't always just be put at the call site during the library load (the places where you would want to write the address can be in code that is read-only mmapped to aid in sharing memory between processes and to avoid loading unused stuff from disk).
In an ideal world the OS could still replace the call sites with straight calls to the loaded library, circumventing a jump table altogether. I don't remember what this is called, maybe something like a thunk, but I've seen it happen in the debugger where the first call causes a fault which rewrites the call site with the target address and subsequent calls are straight to the lib. This can work even if the chunk of code containing the call sites is shared and readonly, as long as the OS can override that.
The calls into jump tables are generally static, so the jump table itself can be prefetched. The jump table code is then a regular function pointer call, which is also monomorphic and so can be reliably predicted. I'd expect the impact to be small compared to a regular monomorphic function pointer call.
This _could_ be another case of premature optimization, as gcc 4.9+ could automagically devirtualize non-overridden virtual functions. icc could do that for years.
That's not the way the phrase 'premature optimization' is usually used. Usually, it means spending time optimizing something that is not a limiting factor, or that otherwise will not make a difference in the final result. Keeping your code simple in the hope that eventually it will become fast is something else, probably falling closer to 'Sufficiently Smart Compiler' http://c2.com/cgi/wiki?SufficientlySmartCompiler.
The Java Hotspot VM can still optimize for this case, if the virtual call leads to only a few classes most of the time. Several virtual methods can be inlined, but of course there's still an extra step compared to static dispatch: the classes of the current object and the inlined methods have to be compared. If no matching method is inlined, control needs to be passed back to the VM.
Interesting read, I didn't know that making fields final in Java does nothing for performance. Any idea what those "Generic Popular Frameworks" are? I put my money on Hibernate.
This is less of a problem for systems with JIT compilation (including Java, C#, etc). They can recompile the code at runtime, which allows some nice tricks for virtual calls. They can turn a virtual call into a regular call with inline caching (http://en.wikipedia.org/wiki/Inline_caching), or can even compile a specialized version of the code for a given type and inline the entire virtual function.
The .NET and Java VMs are in fact able to do some inlining on virtual/interface calls and do other sorts of smart dispatch. So the cost of a virtual method in .NET and Java is not necessarily equivalent to the cost in C++.
I tried a simple example[1] in C#, .NET 4.5, both 32 and 64-bit, just looping and calling an Add method. The adding the keyword virtual increased runtimes > 200%. JVMs might do this, but the CLR's codegen doesn't.
An old blog post by one of the CLR engineers[1] states:
"We don't inline across virtual calls. The reason for not doing this is that we don't know the final target of the call. We could potentially do better here (for example, if 99% of calls end up in the same target, you can generate code that does a check on the method table of the object the virtual call is going to execute on, if it's not the 99% case, you do a call, else you just execute the inlined code), but unlike the J language, most of the calls in the primary languages we support, are not virtual, so we're not forced to be so aggressive about optimizing this case."
I guess things haven't changed. My testing with the CLR indicates that for best performance, you should make sure your IL is already inlined. The CLR does much better with huge function bodies.
The CLR's inlining for virtual calls is constrained specifically to interfaces, not to all uses of 'virtual', IIRC. (Interfaces are still very much virtual calls, they just don't use the 'virtual' keyword.)
See [1] for an example where the CLR fully inlines a virtual call (through an interface, specifically)
The call is most definitely virtual (or dynamic if you prefer that term), not statically-dispatched. It just happens to be performed through an interface. I suspect the CLR optimizes this because interfaces are incredibly common (IEnumerable, etc.)
I get even worse results using an interface for the Add function. That is, if I do "new X() as InterfaceType" then call the function, the performance is 5x worse than if I don't cast to the interface. This is in a tight loop doing an add.
Do you have any actual examples of making an interface call that gets inlined? This post[1], dated 2004 (later than the MSDN article you referenced) from Eric Gunnerson says:
"all the compiler knows is that it has an IProcessor reference, which could be pointing to any instance of an type that implementes IProcessor. There is therefore no way for the JIT to inline this call - the fact that there is a level of indirection in interfaces prevents it. That's the source of the slowdown."
He goes on to say that Sun does do something since Java makes everything virtual, and the CLR could do it in theory, but doesn't.
I skimmed through the linked article you provided but didn't find any mention of inlining interface method calls. On the excellent performance of virtual/interface calls, it says:
"the combination of caching the virtual method and interface method dispatch mechanisms (the method table and interface map pointers and entries) and spectacularly provident branch prediction enables the processor to do an unrealistically effective job"
I've seen interface calls be inlined in action on the modern CLR when looking at disassembly. I don't understand why you would have expected the interface call to be faster than a normal non-virtual call? The interface call always needs a type check before the inlined call body in case of polymorphism; it can't be as fast as a normal call.
Or are you saying the interface call is 5x slower than a virtual call? That definitely isn't right.
I expect an inlined interface or virtual call to be the same as an inlined non-virtual call. But since the CLR (4.5, Windows 7 x64, using 32 or 64-bit codegen) won't emit an inlined virtual/interface call for int Add(int, int) -- it's slower.
In my simple program doing a loop, calling an Add function on an interface, it is definitely making a function call each time. It unrolls 4 times, and loads the function pointer once per iteration - I'd have though it would only load it once overall. Loop is 89 bytes. There is no conditional inside the loop to check for the type.[1]
If I change it to not use the interface (don't cast to the interface type), it's unrolled and inlined. Loop is 34 bytes.[2]
It's the same on 32-bit, except there's no unrolling. The non-virtual loop body is 2 instructions (inc, add). The interface has a push, 3 movs and a call. The virtual one requires two extra movs (to load the function pointer - with an interface the address is embedded as a literal).
Shrug. Maybe it still doesn't work with value types? I started it without VS then broke in with the debugger to get the disassembly.
The loop is doing "y = x.Add(y, i)" where y is a local.
Edit: Aha! Using an interface method (not virtual) and strings, I was able to get inlining. I guess the CLR is still weak in dealing with value types.
1: Start of the loop using an interface:
lea r8d,[rdi-1]
mov rbx,qword ptr [FFEEFE60h]
mov edx,eax
mov rcx,rsi ; rsi is the object pointer
lea r11,[FFEEFE60h] ; I am embarrassed to admit I don't know what r11 is doing
call rbx ; just does lea eax[rdx+r8], ret
; similarly 3 more times then loop
2: Without using the interface, the loop body:
lea eax,[r8-1] ; r8's the counter
add ecx,eax
lea edx,[rcx+r8]
lea ecx,[rdx+rax]
lea eax,[r8+2]
add ecx,eax
; then loop
Nice work digging in and figuring out how to trigger it. I fiddled around some earlier and wasn't able to reproduce the behavior I saw before, so I gave up. :) You are correct that the CLR does a poor job optimizing value types, and I probably made the same mistake (i.e. used a struct)
But in cases of needless virtual calls (doesn't Java default to virtual for some strange reason?) it may be a quick and easy win.
Additionally, it's not always so easy to drop to a low-level language. If your architecture is enormous and complicated, it might be totally unfeasible to change languages for hot parts.
In Java, all methods are virtual. You can often achieve a similar effect to non-virtual methods by declaring them final to prevent them being overridden in subclasses, but the same rules about which method is called apply. The reason to simplify the language (in comparison to C++) - the rules about which method are called are much simpler and easy to remember.
Not having to think about the question "should I make this method virtual?" makes the Java language simpler, yes.
In the vast majority of cases where a virtual function call is actually monomorphic or bimorphic at runtime, the JVM JIT can observe that and potentially inline the method (with an if statement in the bimorphic case). It puts guards around the inlined method and deoptimizes in the event that a newly loaded class renders the optimization incorrect.
It just flips it for "should I make this method final?" So at best it's a wash. In the majority of cases, a function shouldn't be virtual so users need to mark it final to accurately represent their design.
Well, the answer to "should I make x final" is always "yes" in Java, so you don't really have to think too hard about that either. :-) `final` really should have been the default setting for all methods, classes, and local variables, but unfortunately inheritance was still in vogue when Java was created, and the benefits of immutable values weren't as well understood or appreciated either (or maybe they just figured the C & C++ programmers they were trying to win over would hate it).
1) The virtual function address lookup requires a load from an address which is itself loaded. If neither location is cached, this has the unavoidable latency of two uncached memory accesses. Even at best, this incurs two cached L1 accesses, which is about 8-16 cycles on modern architectures.
2) The function call itself is dependent on the final address loaded above. None of that can proceed until the branch address is known. If cached, all is good and the core correctly predicts execution of a large number of instructions. Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.
In any case, nearly all of this is dwarfed by the cost to the compiled code itself: in most cases you can't inline, so simple transformations which could eliminate the function call altogether can't happen.