Hacker News new | past | comments | ask | show | jobs | submit login

Ok, I spent quite a bit of time looking at performance counters, trying to understand what the M1's branch predictor was doing.

The branch predictor is really accurate with a common dispatcher, it predicts those indirect branches correctly 99.25% of the time. Switching to threaded jumps improves this slightly to 99.75%, but not because the indirect branches are at different addresses. This improvement in accuracy is entirely because of the unconditional branch back the dispatcher that was removed as a side effect. With that gone, the branch predictor can now track branch histories that are about twice as many VM instructions long.

My modified version with a dummy branch in the threaded dispatcher negates this longer history and (according to the performance counters) results in the exact same 99.25% correctly predicted branches as the common dispatcher, yet it's still significantly faster, only 20% slower than threaded jumps.

-------------------

Why are threaded jumps faster on the M1, if it's not increasing branch prediction accuracy?

Well, the M1 essentially has two branch predictors[1]. The faster one can return a prediction in one cycle, but it's not checking branch history it all, and it's almost always wrong for these unpredictable indirect branches. The slower predictor does take branch history into account, but takes three cycles to produce the correct result.

Which means there is a short pipeline stall when the second prediction comes in. This stall doesn't show up as a BRANCH_INDIR_MISPRED_NONSPEC, because the branch was correctly predicted. Instead, it seems to show up in the FETCH_RESTART counter.

So while threaded jumps doesn't improve overall branch prediction accuracy (except because of the longer history), it does slightly improve the accuracy of the faster branch predictor. With a common dispatcher, it predicts wrong almost 100% of the time, but with threaded code, the accuracy improves to 40%.

[1] Or at least that's a decent mental model. I suspect it's actually be a single ITTAGE branch predictor that returns the zero-history result in 1 cycle




Very cool, thanks for digging into this!


The other thing I noticed while playing around; Performance absolutely falls off a cliff if your hot loop starts missing in the L1i cache.

This blog post [1] from CloudFlare has a few interesting hints about the M1's branch predictor. First, (in their testing) to get the one cycle predictions at all, your hot code needs to fit in just 4KB.

Second, if the brach target isn't in L1 cache, you don't get a prediction at all. The branch target prediction probably points directly at the cache line + way, so even if a cache line moves to a different way, the prediction will still fail.

Which means, I'm not sure this optimisation is worth while. It works for fibonacci and mandelbrot because they have reasonably tight loops, and adding a dispatcher to each instruction handler doesn't push the hot code over the limit. But when interpreting more generic code, you might be better off trying to minimise cache usage.

[1] https://blog.cloudflare.com/branch-predictor




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: