> a modern processor is very resilient against losing its functional blocks duri...

bayindirh · 2024-07-21T21:53:24 1721598804

An x86 processor can detect when it makes a serious error in some pipelines and rerun these steps until things go right. This is the first line of recovery (this is why temperature spikes start to happen when a CPU reaches its overclocking limits. It starts to make mistakes and this mechanism kicks in).

Also x86 has something called “machine check architecture” which constantly monitors the system and the CPU and throws “Machine Check Exceptions” when something goes very wrong.

These exceptions divide into “recoverable” and “unrecoverable” exceptions. An unrecoverable exception generally triggers a kernel panic, and recoverable ones are logged in system logs.

Moreover, a CPU can lose (fry) some caches (e.g.: half of L1), and it’ll boot with whatever available, and report what it can access and address. In some extreme cases, it loses its FPU or vector units, and instead of getting upset, it tries to do the operations at microcode level or with whatever units available. This manifests as extremely low LINPACK numbers. We had a couple of these, but I didn’t run accuracy tests on these specimens, but LINPACK didn’t say anything about the results. Just the performance was very low when compared to normal processors.

Throttling is a normal defense against poor cooling. Above mechanisms try to keep the processor operational in limp mode, so you can diagnose and migrate somehow.

markus_zhang · 2024-07-22T01:51:35 1721613095

Thanks. How did you know all of these? I guess working in data center does have its boon and curse.

bayindirh · 2024-07-22T07:40:04 1721634004

Actually it has accumulated over years. First being interested in hardware itself, and following the overclocking scene (and doing my own experiments), then my job as an HPC administrator allowed me to touch a lot of systems. Trying to drive them to max performance without damaging them resulted in seeing lots of edge cases over the years.

On top of that, I was always interested in high performance / efficient computing and did my M.Sc. and Ph.D. in related subjects.

It's not impossible to gather this knowledge, but it's a lot of rabbit holes which are a bit hard to find sometimes.

markus_zhang · 2024-07-22T12:00:02 1721649602

Thanks. Do you think the M.Sc. and Ph.D. helped a lot? I don't have any experience in this field and feel that this is probably one of the domains that people HAVE to rely mostly on vendor manuals and very low level debugging messages. Maybe at the same level of AAA game engine optimization?

bayindirh · 2024-07-22T12:24:43 1721651083

Yes, they helped a lot, but because I was already interested in high performance programming and was looking to improve myself on these fronts. Also, I have started my job after my B.Sc., so there was a positive feedback between my work and research (I fed my work with research and fed my research with my know-how from my job, which was encouraged and required by the place I work).

You need to know a lot of things to do this. Actually it's half dark art and half science. Vendors do not always tell the full story about their hardware (Intel's infamous AVX frequency, and their compilers' shenanigans when it detects an AMD CPU), and you need to be able to bend the compiler to your will to build the binary the way you want. Lastly, of course, you shall know what you're doing with your code and understand how it translates to assembly and what your target CPU does with all that.

To be able to understand that kind of details, we have valgrind/callgrind, perf, software traces, some vendor specific low-level tools to see what processor is doing, and pure timing-related logging.

Game engines are different beast, I do scientific software, but a friend of mine was doing game engines. Highly optimized graphics drivers are black boxes and that's a whole different game. It's not very well documented, trade secrets ridden, and tons of undocumented behaviors which drivers do to optimize stuff. Plus, you have to use the driver very complex and sometimes ugly ways to make it perform.

While this is hard to start and looks like a big mountain, all this gets way easier when you develop a "feeling for the machine". It's similar how mechanics listen to an engine and say "it's spark plug 3, most probably". You can feel how a program runs and where it chokes just by observing how it runs.

This is why C/C++ is also used in a lot of low level contexts. Yes, it allows you to do some very dangerous things, but if you need to do things fast, and you can prove mathematically that this dangerous thing can't happen, you can unlock some more potential from your system. People doing this is very few, and people who do this recklessly (or just code carelessly) give C/C++ a bad name.

It's not impossible. If Carmack, Unreal, scientific software companies, Pixar, Blender and more are able to do it, you can do it, too.

markus_zhang · 2024-07-22T13:33:53 1721655233

Thanks man, really appreciate the detailed reply.

> To be able to understand that kind of details, we have valgrind/callgrind, perf, software traces, some vendor specific low-level tools to see what processor is doing, and pure timing-related logging.

Working as a data warehouse engineer, I'm not exposed to these kinds of things. Our upstream team, the streaming guys, does have a bit of exposure to performance related to Spark.

> It's similar how mechanics listen to an engine and say "it's spark plug 3, most probably". You can feel how a program runs and where it chokes just by observing how it runs.

> It's not impossible. If Carmack, Unreal, scientific software companies, Pixar, Blender and more are able to do it, you can do it, too.

I kinda feel that I have to switch job to learn this kind of things. I do have some personal projects but they do not need the attention. I'll see what I can do. I have always wanted to move away from data engineering anyway.

echoangle · 2024-07-21T20:15:51 1721592951

I’m not sure sure if that’s what they meant, but generally, CPUs will throttle or shut down if they detect overtemp, hopefully before they start encountering errors which lead to wrong calculation results/crashes.