The benefit is obvious in book keeping space overhead for the shared cache as th...

creshal · on Sept 10, 2015

Don't we already hit that on GPUs? Or do they only have shared caches?

btown · on Sept 10, 2015

Intel's Xeon Phi architecture has 60+ cores with multiple threads each, with cache coherence (albeit using a more standard dictionary): http://htor.inf.ethz.ch/publications/img/ramos-hoefler-cc-mo... - so we're getting there.

ajross · on Sept 10, 2015

Yes, though IIRC they had to modify the long-standing IA memory ordering rules in order to make that happen. Xeon Phi requires manual fence management in most cases, I believe.

seanmcdirmid · on Sept 10, 2015

Note that those particular cores are also doing SIMT like GPUs are, so the problem is different from cache coherence on general purpose cores.

varelse · on Sept 10, 2015

GPUs aren't cache-coherent yet. That said, they have fantastic atomic ops performance relative to CPUs. I'd guess that the lack of performance benefit so far is because it's possible to write many algorithms with the assumption that there is no cache coherency.

robmccoll · on Sept 10, 2015

Their atomic operations used to be extremely costly from a utilization perspective. They would shut down all other threads in a warp while the thread performing the atomic ran alone. Is that still the case?

varelse · on Sept 10, 2015

Fixed as of Maxwell to the best of my knowledge. But even then, I found them more efficient for reduction operations in global memory than any other method (using fixed-point math for places where a deterministic sum was required).

seanmcdirmid · on Sept 10, 2015

I'm not really a GPU person, but my understanding is that GPUs have very simple memory models, memory reads and writes are scheduled more manually. But this is part of their appeal, why they can run so fast, but also why they aren't as general as CPUs.

But my knowledge is a bit out of date, since GPUs are actually including caches now, and what they can do seems to evolve rapidly.

ajross · on Sept 10, 2015

Yeah, the traditional GPU usage is a streaming thing. Textures are read only, vertex and fragment data is either output-only or input-only depending on shader type. The only R/W memory requiring synchronization is the framebuffer, which is handled by specialized hardware anyway.

robmccoll · on Sept 10, 2015

Kind of, you get the experience of coherent access from a software perspective.

    http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf

seanmcdirmid · on Sept 10, 2015

Just a formatting note: if you could remove the verbatim space before the url, HN would recognize it as one.

MaulingMonkey · on Sept 10, 2015

http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_...