The benefit is obvious in book keeping space overhead for the shared cache as the number of core rise. This could be very useful when/if we ever hit 256 cores per CPU.
The technique kind of reminds me of Jefferson's virtual time (and time warp), which rather exists in a distributed simulation context. Virtualizing time to manage coherence of reads and writes is a very good idea.
Yes, though IIRC they had to modify the long-standing IA memory ordering rules in order to make that happen. Xeon Phi requires manual fence management in most cases, I believe.
GPUs aren't cache-coherent yet. That said, they have fantastic atomic ops performance relative to CPUs. I'd guess that the lack of performance benefit so far is because it's possible to write many algorithms with the assumption that there is no cache coherency.
Their atomic operations used to be extremely costly from a utilization perspective. They would shut down all other threads in a warp while the thread performing the atomic ran alone. Is that still the case?
Fixed as of Maxwell to the best of my knowledge. But even then, I found them more efficient for reduction operations in global memory than any other method (using fixed-point math for places where a deterministic sum was required).
I'm not really a GPU person, but my understanding is that GPUs have very simple memory models, memory reads and writes are scheduled more manually. But this is part of their appeal, why they can run so fast, but also why they aren't as general as CPUs.
But my knowledge is a bit out of date, since GPUs are actually including caches now, and what they can do seems to evolve rapidly.
Yeah, the traditional GPU usage is a streaming thing. Textures are read only, vertex and fragment data is either output-only or input-only depending on shader type. The only R/W memory requiring synchronization is the framebuffer, which is handled by specialized hardware anyway.
The technique kind of reminds me of Jefferson's virtual time (and time warp), which rather exists in a distributed simulation context. Virtualizing time to manage coherence of reads and writes is a very good idea.