Current PCIe devices work without cache coherency to host memory. Drivers follow a handful of patterns for communicating with the device via PCI BARs and DMA. Cache coherency is not required.
If we're not talking about devices but disaggregated memory, then an application running on a single host doesn't need cache coherency on the fabric because the host already ensures cache coherency between the CPUs. (Is that correct?)
The only case I see where cache coherency might be interesting is for multiple nodes running applications on disaggregated memory. Kind of like a single system image cluster, but it's not a software architecture that has been successful in recent years.
This is something that has puzzled me about CXL. Would be great to understand what use case cache coherency serves. Thanks!
Cache coherence across accelerators lets you pretend you have one large unified memory similar to what apple does. It means you can easily share pointers between CPU and GPU with zero regard for how to move the data around and still be efficient. This is irrelevant for AI since AI does well with primitive DMA. You absolutely need CXL to have a viable OpenCL 2.2 implementation.
CXL is probably a non-starter for disaggregated memory because it blocks the CPU on remote reads/writes.
To understand how expensive that is, consider the fact that NVMe durable writes to an enterprise SSD are generally faster than a CXL write to remote ram, but applications still insist on using asynchronous I/O to talk to the SSD. CPU level access to CXL is fundamentally synchronous (because of the X86, ARM and RISC-V instruction sets).
I'm not sure how you managed to get that much wrong. Most applications perform synchronous filesystem I/O. Async filesystem I/O is a rare curiosity implemented in a handful of high end databases.
You also seem completely unaware of the super scalar and out of order nature of modern processors or why hyperthreading/SMT is even a thing. The way a processor works nowadays is more akin to treating the instruction sequence as a graph for which it is possible to find subgraphs, which can be processed in parallel and that includes memory accesses.
Then finally you didn't consider how frequent a remote memory access or how expensive a remote memory access is without CXL.
Since CXL is cache coherent, you only need to load from remote memory when the data is not in local DRAM or your local cache. This means that remote memory accesses are relatively rare. In the case that you do nothing but remote memory accesses, you have to consider your competition as a baseline. The competition used to be RDMA over infiniband or Ethernet and the total latency for those is significantly higher. This means that if remote memory access is important to you, CXL is faster than the alternatives. You seem to be thinking of a hypothetical situation in which no one uses remote memory accesses at all. That is not a fair comparison.
The reason why CXL is unsuitable for AI is already mentioned in the article. AI doesn't care about cache coherence, only bandwidth but NVLink is even worse at what you are concerned about.
That doesn't match my intuition about CXL memory. It's latency AFAIK is more similar to accessing NUMA memory on another socket (and applications are OK with blocking on that) than an SSD. So quite a bit faster.
The devices they simulated aren’t disaggregated. They measured accessing a device that’s a single PCIe hop away from the CPU.
They don’t provide absolute numbers, so it’s hard to compare with other technologies, but note that you can (in theory; and in prototypes) dispatch and complete multiple NVMe I/Os per cache miss, thanks to amortization. CXL
is blocking the CPU for the equivalent of multiple cache misses per I/O.
The numbers I have seen are CXL latency around 250 ns while NVMe latency is 5-10 us (Intel Octane DC series for local flash).
You mention amortization for NVMe, but when operating at high queue depths the per-request latency is larger than at low queue depths.
Fast local NVMe drives aren't an alternative to disaggregated memory. There is no fabric with local NVMe drives. It would be necessary to compare against NVMe over Fabrics, which is quite a bit slower (look at flash SAN appliance datasheets, probably around 100 us latency).
Both the comparison itself and the numbers you mentioned are not what I expected. Can you elaborate and share links to numbers?
I don't understand this. How is that any different to a standard cache miss that can be masked using e.g. out-of-order execution? Also you are ignoring issues such as read and write amplification for fine grained memory accesses
Current PCIe devices work without cache coherency to host memory. Drivers follow a handful of patterns for communicating with the device via PCI BARs and DMA. Cache coherency is not required.
If we're not talking about devices but disaggregated memory, then an application running on a single host doesn't need cache coherency on the fabric because the host already ensures cache coherency between the CPUs. (Is that correct?)
The only case I see where cache coherency might be interesting is for multiple nodes running applications on disaggregated memory. Kind of like a single system image cluster, but it's not a software architecture that has been successful in recent years.
This is something that has puzzled me about CXL. Would be great to understand what use case cache coherency serves. Thanks!