Ok, I spent quite a bit of time looking at performance counters, trying to under...

mkeeter · 2024-07-13T19:30:42 1720899042

Very cool, thanks for digging into this!

phire · 2024-07-14T02:25:39 1720923939

The other thing I noticed while playing around; Performance absolutely falls off a cliff if your hot loop starts missing in the L1i cache.

This blog post [1] from CloudFlare has a few interesting hints about the M1's branch predictor. First, (in their testing) to get the one cycle predictions at all, your hot code needs to fit in just 4KB.

Second, if the brach target isn't in L1 cache, you don't get a prediction at all. The branch target prediction probably points directly at the cache line + way, so even if a cache line moves to a different way, the prediction will still fail.

Which means, I'm not sure this optimisation is worth while. It works for fibonacci and mandelbrot because they have reasonably tight loops, and adding a dispatcher to each instruction handler doesn't push the hot code over the limit. But when interpreting more generic code, you might be better off trying to minimise cache usage.

[1] https://blog.cloudflare.com/branch-predictor