What is the killer app for cache coherency? Current PCIe devices work without ca...

imtringued · 2024-03-17T13:04:50 1710680690

Cache coherence across accelerators lets you pretend you have one large unified memory similar to what apple does. It means you can easily share pointers between CPU and GPU with zero regard for how to move the data around and still be efficient. This is irrelevant for AI since AI does well with primitive DMA. You absolutely need CXL to have a viable OpenCL 2.2 implementation.

Here is a relevant paper: https://dl.acm.org/doi/abs/10.1145/3529336.3530817

hedora · 2024-03-17T02:29:20 1710642560

CXL is probably a non-starter for disaggregated memory because it blocks the CPU on remote reads/writes.

To understand how expensive that is, consider the fact that NVMe durable writes to an enterprise SSD are generally faster than a CXL write to remote ram, but applications still insist on using asynchronous I/O to talk to the SSD. CPU level access to CXL is fundamentally synchronous (because of the X86, ARM and RISC-V instruction sets).

imtringued · 2024-03-17T13:16:16 1710681376

I'm not sure how you managed to get that much wrong. Most applications perform synchronous filesystem I/O. Async filesystem I/O is a rare curiosity implemented in a handful of high end databases.

You also seem completely unaware of the super scalar and out of order nature of modern processors or why hyperthreading/SMT is even a thing. The way a processor works nowadays is more akin to treating the instruction sequence as a graph for which it is possible to find subgraphs, which can be processed in parallel and that includes memory accesses.

Then finally you didn't consider how frequent a remote memory access or how expensive a remote memory access is without CXL.

Since CXL is cache coherent, you only need to load from remote memory when the data is not in local DRAM or your local cache. This means that remote memory accesses are relatively rare. In the case that you do nothing but remote memory accesses, you have to consider your competition as a baseline. The competition used to be RDMA over infiniband or Ethernet and the total latency for those is significantly higher. This means that if remote memory access is important to you, CXL is faster than the alternatives. You seem to be thinking of a hypothetical situation in which no one uses remote memory accesses at all. That is not a fair comparison.

The reason why CXL is unsuitable for AI is already mentioned in the article. AI doesn't care about cache coherence, only bandwidth but NVLink is even worse at what you are concerned about.

riehwvfbk · 2024-03-17T02:43:56 1710643436

That doesn't match my intuition about CXL memory. It's latency AFAIK is more similar to accessing NUMA memory on another socket (and applications are OK with blocking on that) than an SSD. So quite a bit faster.

Actual measurements seem to support this: https://arxiv.org/pdf/2303.15375.pdf (see Figure 3)

hedora · 2024-03-17T03:36:25 1710646585

The devices they simulated aren’t disaggregated. They measured accessing a device that’s a single PCIe hop away from the CPU.

They don’t provide absolute numbers, so it’s hard to compare with other technologies, but note that you can (in theory; and in prototypes) dispatch and complete multiple NVMe I/Os per cache miss, thanks to amortization. CXL is blocking the CPU for the equivalent of multiple cache misses per I/O.

stefanha · 2024-03-17T11:12:04 1710673924

The numbers I have seen are CXL latency around 250 ns while NVMe latency is 5-10 us (Intel Octane DC series for local flash).

You mention amortization for NVMe, but when operating at high queue depths the per-request latency is larger than at low queue depths.

Fast local NVMe drives aren't an alternative to disaggregated memory. There is no fabric with local NVMe drives. It would be necessary to compare against NVMe over Fabrics, which is quite a bit slower (look at flash SAN appliance datasheets, probably around 100 us latency).

Both the comparison itself and the numbers you mentioned are not what I expected. Can you elaborate and share links to numbers?

riehwvfbk · 2024-03-17T06:47:20 1710658040

I thought the paper wasn't a simulation? They refer to the CXL memories as "hard IP" and "soft IP", so two ASICs and and FPGA.

You could also imagine a DMA engine within the CXL ASIC to assist with paging things in and out of main memory in an asynchronous fashion.

anonymousDan · 2024-03-17T08:02:47 1710662567

I don't understand this. How is that any different to a standard cache miss that can be masked using e.g. out-of-order execution? Also you are ignoring issues such as read and write amplification for fine grained memory accesses