Agreed, for almost all code it doesn't matter, but for the remaining small fraction it's worth thinking about these things. It sounds pretty insane to go with a blanket approach of removing virtual calls throughout an entire codebase without understanding which ones are the problematic ones. Especially since some ways of solving the problem could potentially lead to other problems like increased compiled code size.
I've seen plenty of software (especially systems software) that does spend much of it's time in tight inner loops. Pulling out all the optimization stops there can give measurable gains. I've personally seen measurable gains on real applications from tricks like reordering branches so that the more predictable branches go first.
Sure, it's a waste to optimize code that doesn't significantly contribute to execution time. And there are lots of cases that are I/O bound, memory bound, cache bound, cross-thread communication bound etc. But if you're doing actual calculations - so not lots of communication like I/O or threading, and not delegating the crunching to a library - and your calculations are not trivial bit-pushing (e.g. not just streaming with minor changes), then it's a good bet that any virtual function calls in that kind of code will be problematic; getting rid of the inner-loop dynamic dispatches will almost certainly help.
So it's situational, but IME it's pretty predictable where you'll see this kind of optimization help. By all means profile and use whatever tools at hand to help you along, and don't apply the optimization blindly - but despite the whole "black art" label optimization sometimes gets this kind of thing really is pretty straightforward.
I've seen plenty of software (especially systems software) that does spend much of it's time in tight inner loops. Pulling out all the optimization stops there can give measurable gains. I've personally seen measurable gains on real applications from tricks like reordering branches so that the more predictable branches go first.