CXL Is Dead in the AI Era

esafak · 2024-03-16T22:33:43 1710628423

I'm a seasoned machine learning engineer who has never heard of CXL, so it's worth spelling it out: https://en.wikipedia.org/wiki/Compute_Express_Link

SideburnsOfDoom · 2024-03-16T22:40:54 1710628854

And people take note, if you write an essay that uses a TLA 3 times in the first paragraph, it's worth expanding the Three Letter Acronym on the first of those in your essay.

throwup238 · 2024-03-17T00:38:31 1710635911

You’re not the only one who hasn’t heard of it. Usually any time I see a tech related acronym on HN I at least have a vague idea of what it’s related to based on incidental exposure but not this time.

I searched HN for the acronym and there are two posts with more than a few comments, with the most popular one posted five months ago. “Promised as the messiah” my behind.

wmf · 2024-03-17T00:45:31 1710636331

HN doesn't really deal with hardware much so it's not surprising that it hasn't been discussed. In the server world CXL has been a big topic.

rajnathani · 2024-03-19T08:14:14 1710836054

You need to be a chip designer to have necessarily heard of it. For bandwidth, most ML engineers really just care about the bandwidth between nodes in their cluster (eg: Mellanox, or whichever networking solution).

anonymousDan · 2024-03-17T11:13:30 1710674010

Whatever about CXL being unsuitable for AI workloads, it's unclear why memory disaggregation wouldn't benefit from it. I don't have access to the second half of the article, can anyone give the gist of the argument?

jauntywundrkind · 2024-03-16T23:20:57 1710631257

Hoping this is shortsighted.

It's an interesting & valuable comparison: low bandwidth low latency PCIe serdes vs high band high latency Ethernet-like serdes.

But I'm not sure I'm on board with the half-conclusion we get from this half-paywalled analysis. AI is seemingly insatiable sure & there's a relentless push to higher bandwidth, yes. NVLink seems to be kicking ass & PCIe is super struggling to keep any kind of pace absolutely, but it still seems wild to me to write off CXL at such an early stage

CXL has some really real things going for it. It's lower level than PCIe and reuses the same serdes that many systems already have, so it's a natural low cost extension of what we already have, for little intrinsic cost. There's 512GBps of PCIe on a modern Epyc (128 lanes of PCIe 5.0); the proposal of CXL is that simplifying the protocol can let us do more with those links easily. GPUs could use more efficienct protocols to talk to the cpu, with better coherency, that might let cpu and main memory work much better in tandem; not unified memory but no longer so very far apart memories. Large memory or storage pools that nodes can attach to on demand offers some interesting capabilities to disaggregate storage, which mirrors the huge success we've seen with modern disaggregated databases being able to do successfully scale.

CXL 3.1 also introduces host to host capabilities. This makes CXL a potential competitor to Ethernet at some levels of scale. Ethernet is getting around to 1.6Tbps/200GBps but via enormous nics and switches. For sure the CXL switches will also be huge but there's no nic. Imagine a Epyc with PCIe 6.0 speed (8Gbps/1TBps) or 7.0 (2TBps) that doesn't need a nic to talk to fabric, that has much much lower latency than Ethernet, that is an rdma beast. It's insufficient as a full speed AI fabric that can satisfy AI needs, but its a huge speed bump, built right into the core. It's hard for me to reject our ability to find good uses for this in the future, just because it's now how AI works today.

The article very briefly mentions Infinity Fabric XGMI, which is what AMD is floating as what they'll use to compete with NVLink. The article talks about this like it's the future, like it's what will crush CXL. But the same PCIe switch chips that run Infinity Fabric XGMI will also run CXL. Maybe perhaps CXL goes nowhere, IF-XGMI rises, but I tend to think there'll be more sympathetic growth & development among these standards. https://www.servethehome.com/next-gen-broadcom-pcie-switches...

stefanha · 2024-03-17T01:32:08 1710639128

What is the killer app for cache coherency?

Current PCIe devices work without cache coherency to host memory. Drivers follow a handful of patterns for communicating with the device via PCI BARs and DMA. Cache coherency is not required.

If we're not talking about devices but disaggregated memory, then an application running on a single host doesn't need cache coherency on the fabric because the host already ensures cache coherency between the CPUs. (Is that correct?)

The only case I see where cache coherency might be interesting is for multiple nodes running applications on disaggregated memory. Kind of like a single system image cluster, but it's not a software architecture that has been successful in recent years.

This is something that has puzzled me about CXL. Would be great to understand what use case cache coherency serves. Thanks!

imtringued · 2024-03-17T13:04:50 1710680690

Cache coherence across accelerators lets you pretend you have one large unified memory similar to what apple does. It means you can easily share pointers between CPU and GPU with zero regard for how to move the data around and still be efficient. This is irrelevant for AI since AI does well with primitive DMA. You absolutely need CXL to have a viable OpenCL 2.2 implementation.

Here is a relevant paper: https://dl.acm.org/doi/abs/10.1145/3529336.3530817

hedora · 2024-03-17T02:29:20 1710642560

CXL is probably a non-starter for disaggregated memory because it blocks the CPU on remote reads/writes.

To understand how expensive that is, consider the fact that NVMe durable writes to an enterprise SSD are generally faster than a CXL write to remote ram, but applications still insist on using asynchronous I/O to talk to the SSD. CPU level access to CXL is fundamentally synchronous (because of the X86, ARM and RISC-V instruction sets).

imtringued · 2024-03-17T13:16:16 1710681376

I'm not sure how you managed to get that much wrong. Most applications perform synchronous filesystem I/O. Async filesystem I/O is a rare curiosity implemented in a handful of high end databases.

You also seem completely unaware of the super scalar and out of order nature of modern processors or why hyperthreading/SMT is even a thing. The way a processor works nowadays is more akin to treating the instruction sequence as a graph for which it is possible to find subgraphs, which can be processed in parallel and that includes memory accesses.

Then finally you didn't consider how frequent a remote memory access or how expensive a remote memory access is without CXL.

Since CXL is cache coherent, you only need to load from remote memory when the data is not in local DRAM or your local cache. This means that remote memory accesses are relatively rare. In the case that you do nothing but remote memory accesses, you have to consider your competition as a baseline. The competition used to be RDMA over infiniband or Ethernet and the total latency for those is significantly higher. This means that if remote memory access is important to you, CXL is faster than the alternatives. You seem to be thinking of a hypothetical situation in which no one uses remote memory accesses at all. That is not a fair comparison.

The reason why CXL is unsuitable for AI is already mentioned in the article. AI doesn't care about cache coherence, only bandwidth but NVLink is even worse at what you are concerned about.

riehwvfbk · 2024-03-17T02:43:56 1710643436

That doesn't match my intuition about CXL memory. It's latency AFAIK is more similar to accessing NUMA memory on another socket (and applications are OK with blocking on that) than an SSD. So quite a bit faster.

Actual measurements seem to support this: https://arxiv.org/pdf/2303.15375.pdf (see Figure 3)

hedora · 2024-03-17T03:36:25 1710646585

The devices they simulated aren’t disaggregated. They measured accessing a device that’s a single PCIe hop away from the CPU.

They don’t provide absolute numbers, so it’s hard to compare with other technologies, but note that you can (in theory; and in prototypes) dispatch and complete multiple NVMe I/Os per cache miss, thanks to amortization. CXL is blocking the CPU for the equivalent of multiple cache misses per I/O.

stefanha · 2024-03-17T11:12:04 1710673924

The numbers I have seen are CXL latency around 250 ns while NVMe latency is 5-10 us (Intel Octane DC series for local flash).

You mention amortization for NVMe, but when operating at high queue depths the per-request latency is larger than at low queue depths.

Fast local NVMe drives aren't an alternative to disaggregated memory. There is no fabric with local NVMe drives. It would be necessary to compare against NVMe over Fabrics, which is quite a bit slower (look at flash SAN appliance datasheets, probably around 100 us latency).

Both the comparison itself and the numbers you mentioned are not what I expected. Can you elaborate and share links to numbers?

riehwvfbk · 2024-03-17T06:47:20 1710658040

I thought the paper wasn't a simulation? They refer to the CXL memories as "hard IP" and "soft IP", so two ASICs and and FPGA.

You could also imagine a DMA engine within the CXL ASIC to assist with paging things in and out of main memory in an asynchronous fashion.

anonymousDan · 2024-03-17T08:02:47 1710662567

I don't understand this. How is that any different to a standard cache miss that can be masked using e.g. out-of-order execution? Also you are ignoring issues such as read and write amplification for fine grained memory accesses

gpapilion · 2024-03-16T22:47:17 1710629237

This is a feature large csps we’re looking for to improve their unit economics. There are a lot of issues that the author points out, but first and foremost it’s not a clear TCO win.