IPC is going up but slowly compared to previous generations. If AMD can sustain ...

flumpcakes · on July 3, 2021

I have a hunch that it would be much greater than 50% if the software was recompiled to take advantage of the newer instruction sets. Then again, you won't be comparing exact binary copies of software, which some people might say invalidates it as a benchmark/comparison.

I remember many, many years ago my PC couldn't run the latest adobe software as I was missing SSE2 instructions on my then current CPU. I believe there was some hidden compatability mode, but the software ran extremely poorly vs. the version of software that was out before that didn't require SSE2.

aseipp · on July 3, 2021

There have been almost no new broad "performance-oriented" instruction sets introduced since the Haswell era, all the way to the Zen 3 era (e.g. SIMD/vector) for compilers to target. At least not any instructions that the vast majority of software is going to see magical huge improvements from and are wide spread enough to justify it (though specific software may benefit greatly from some specific tuning; pext/pdep and AVX-512 for instance.)

The microarchitectures have improved significantly though, which does matter. For instance, Haswell-era AVX2 implementations were significantly poorer than the modern ones in, say, Tiger Lake or Zen 3. The newer ones have completely different power usage and per-core performance characteristics for AVX code; even if you could run AVX2 on older processors, it might not have actually been a good idea if the cumulative slowdowns they cause impacts the whole system (because the chips had to downclock the whole system so they wouldn't brownout). So it's not just a matter of instruction sets, but their individual performance characteristics.

And also, it is not just CPUs that have improved. If anything, the biggest improvements have been in storage devices across the stack, which now have significantly better performance and parallelism, and the bandwidth has improved too (many more PCIe lanes). I can read gigabytes a second from a single NVMe drive; millions of IOPS a second, which is vastly better than you could 7 years ago on a consumer-level budget. Modern machines do not just crunch scalar code in isolation, and neither did older ones; we could just arbitrage CPU cycles more often than we can now in an era where a lot of the performance "cliffs" have been dealt with. Isolating things to just look at how fast the CPU can retire instructions is a good metric for CPU designers, but it's a very incomplete view when viewing the system as a whole as an application developer.

AtlasBarfed · on July 3, 2021

I'm not deep on the weeds of high-performance compilers, but just because ISA evolution hasn't happened doesn't mean compilers can't evolve to use the silicon better.

There always was an "Intel advantage" to compilers for decades (admittedly Intel invested in compilers more than AMD, but they also were sneaky about trying to nerf AMD in compilers), but with AMD being such a clear leader for so many years, I would hope at least GCC has started supporting AMD flavors of compilation better.

Anyone know if this has happened with GCC and AMD silicon? Or at least is there a better body of knowledge of what GCC flags help AMD more?

aseipp · on July 3, 2021

Yes, I consider this to fall under the umbrella of general microarchitectural improvements I mentioned. GCC and LLVM are regularly updated with microarchitectural scheduling models to better emit code that matches the underlying architecture, and have featured these for at least 5-7 years; there can be a big difference between say, Skylake and Zen 2, for instance, so targeting things appropriately is a good idea. You can use the `-march` flag for your compiler to target specific architectures, for instance -march=tigerlake or -march=znver3

But in general I think it's a bit of a red herring for the thrust of my original post; first off you always have to target the benchmark to test a hypothesis, you don't run them in isolation for zero reason. My hypothesis when I ran my own for instance was "General execution of bog standard scalar code is only up by about 50-60%" and using the exact same binary instructions was the baseline criteria for that; it was not "Does targeting a specific microarchitecture scheduling model yield specific gains." If you want to test the second one, you need to run another benchmark.

There are too many factors for any particular machine for any such post to be comprehensive, as I'm sure you're aware. I'm just speaking in loose generalities.

jmgao · on July 3, 2021

> You can use the `-march` flag for your compiler to target specific architectures, for instance -march=tigerlake or -march=znver3

Note that -march will use instructions that might be unavailable on other CPUs of the target. -mtune (which is implied by -march) is the flag that sets the cost tables used by instruction selection, cache line sizes, etc.

jiggawatts · on July 4, 2021

I recently watched the CppCon 2019 talk by Matt Godbolt "Compiler Explorer: Behind The Scenes"[1], and a cool feature he presented is the integrated LLVM Machine Code Analyser tool. If you look at the "timeline" and "resource" views of how a Zen 3 executes a typical assembly snippet, it is absolutely mind blowing. That beast of a CPU has so many resources it's just crazy.

[1] https://www.youtube.com/watch?v=kIoZDUd5DKw

scns · on July 3, 2021

march=native

jbluepolarbear · on July 3, 2021

I doubt that very much. I have a i7 4790 and I have Ryzen 7 3700X. In my testing single core speed is nearly 2.5 times in favor of my Ryzen. What where you using as bench marks?

My test was a single threaded software rasterizer with real-time vertex lighting. That I compiled in vs c++ 2008 around 2010.

aseipp · on July 3, 2021

Basic single-core scalar-only workloads of my own corroborate the grandparent, as well as most of the other benchmarks I've seen. My own 5600X is "only" about 50-60% better than my old Haswell i5-4950 (from Q2 '14) on this note.

But the scalar speed isn't everything, because you're often not bounded solely by retirement in isolation (the system, in aggregate, is an open one not a closed one.) Fatter caches and extraordinarily improved storage devices with lots of parallelism (even on a single core you can fire off a ton of asynchronous I/O at the device) make a huge difference here even for single-core workloads, because you can actually keep the core fed well enough to do work. So the cumulative improvement is even better in practice.

jbluepolarbear · on July 3, 2021

Now I’m curious, is this testing against some requirements for an application? Are they any applications that would benefit solely from scalar performance vs scalar, cache size, and memory speed.

speeder · on July 3, 2021

Need to point out in particular that that best Haswell i5 was FASTER than the best i7 for single-core workloads, so this might be a factor in OP post.

And also that is the reason why I use such processor on my own computer, it was already "outdated" when I bought it, but since one of the things I like to do is play some simulation style games that rely heavily on single-core performance, I choose the fastest single-core CPU I could find without bankrupting myself. (the i5 4690k that with some squeezing can even be pushed past 4ghz, it is a beastly CPU this one)

Causality1 · on July 3, 2021

Exactly. The only thing I used my system for that was really begging for more CPU was console emulation, and that depends more on single core performance than anything else.

epmaybe · on July 3, 2021

Haha I’m in almost the exact same boat, I replaced a haswell i5 with a 5800x. I definitely went overkill with my system and still haven’t gotten a GPU upgrade yet due to cost/laziness.

Causality1 · on July 4, 2021

Indeed. I feel like I could've gotten most of the speedup by wiping windows and reinstalling but I had to make a clean break from Windows 7 or I was never going to stop using it. Hopefully GPU prices come down soon.

deviledeggs · on July 3, 2021

Even crazier, my i5 750 (from 2009) was overclocked to 3.5ghz and per core Ryzen 3 isn't even twice as fast.

We're near the end of IPC scaling per core. And it was never that good in the first place. Pentium 3 IPC is only 3-4x worse than the fastest Ryzen. Most of our speed increases came from frequency.

IMO we need to get off silicon substrate so we can frequency scale again.

I wonder if the end of scaling will push everyone into faster languages like Rust. You can't sit around 2 years for your code performance to double anymore. Will this eventually kill slow languages? I think so.

michaelmrose · on July 3, 2021

Hardware increases of speed began to slow down 20 years ago and people are still using software that is 50x slower than the fastest possible technology. If this alone was going to kill slow languages one would suppose it would have already done so.

The arguable shift that is more significant is not about hardware its hopefully about newer languages like Rust making this performance cost less in terms of safety and development time which is a more recent development.

deviledeggs · on July 4, 2021

I agree partially. But excuses for using dog slow languages, at least where I've worked, revolved around "it will be twice as fast in two years anyways"

It's clear that's no longer true, which gives less excuses for using say Ruby over Java.