Apple is using LPDDR5 for M3. The bandwidth doesn't come from unified memory - it comes from using many channels. You could get the same bandwidth or more with normal DDR5 modules if you could use 8 or more channels, but in the PC space you don't usually see more than 2 or 4 channels (only common for servers).
Unrelated but unified memory is a strange buzzword being used by Apple. Their memory is no different than other computers. In fact, every computer without a discrete GPU uses a unified memory model these days!
On PC desktops I always recommend getting a mid-range tower server precisely for that reason. My oldest one is about 8 years old and only now it's showing signs of age (as in not being faster than the average laptop).
The new idea is having 512 bit wide memory instead of PC limitation of 128 bit wide. Normal CPU cores running normal codes are not particularly bandwidth limited. However APUs/iGPUs are severely bandwidth limited, thus the huge number of slow iGPUs that are fine for browsing but terrible for anything more intensive.
So apple manages decent GPU performance, a tiny package, and great battery life. It's much harder on the PC side because every laptop/desktop chip from Intel and AMD use a 128 bit memory bus. You have to take a huge step up in price, power, and size with something like a thread ripper, xeon, or epyc to get more than 128 bit wide memory, none of which are available in a laptop or mac mini size SFF.
> The new idea is having 512 bit wide memory instead of PC limitation of 128 bit wide.
It's not really a new idea, just unusual in computers. The custom SOCs that AMD makes for Playstation and Xbox have wide (up to 384-bit) unified memory buses, very similar to what Apple is doing, with the main distinction being Apples use of low-power LPDDR instead of the faster but power hungrier GDDR used in the consoles.
Yeah, a lot of it is just market forces. I guess going to four channels is costly for the desktop PC space and that's why that didn't happen, and laptops just kind of followed suite. But now that Apple is putting pressure on the market, perhaps we'll finally see quad channel becoming the norm in desktop PCs? Would be nice...
Memory interface width of modern CPUs is 64-bit (DDR4) and 32+32 (DDR5).
No CPU uses 128b memory bus as it results in overfetch of data, i.e., 128B per access, or two cache lines.
AFAIK Apple uses 128B cache lines, so they can do much better design and customization of memory subsystem as they do not have to use DIMMs -- they simply solder DRAM to the motherboard, hence memory interface is whatever they want.
> Memory interface width of modern CPUs is 64-bit (DDR4) and 32+32 (DDR5).
Sure, per channel. PCs have 2x64 bit or 4x32 bit memory channels.
Not sure I get your point, yes PCs have 64 bit cache lines and apple uses 128. I wouldn't expect any noticeable difference because of this. Generally cache miss is sent to a single memory channel and result in a wait of 50-100ns, then you get 4 or 8 bytes per cycle at whatever memory clock speed you have. So apple gets twice the bytes per cache line miss, but the value of those extra bytes is low in most cases.
Other bigger differences is that apple has a larger page size (16KB vs 4KB) and arm supports a looser memory model, which makes it easier to reach a large fraction of peak memory bandwidth.
However, I don't see any relationship between Apple and PCs as far as DIMMS. Both Apple and PCs can (and do) solder dram chips directly to the motherboard, normally on thin/light laptops. The big difference between Apple and PC is that apple supports 128, 256, and 512 bit wide memory on laptops and 1024 bit on the studio (a bit bigger than most SFFs). To get more than 128 bits with a PC that means no laptops, no SFFs, generally large workstations with Xeon, Threadrippers, or Epyc with substantial airflow and power requirements
FYI cache lines are 64 bytes, not bits. So Apple is using 128 bytes.
Also important to consider that the RTX 4090 has a relatively tiny 384-bit memory bus. Smaller than the M1 Max's 512-bit bus. But the RTX 4090 has 1 TB/s bandwidth and significantly more compute power available to make use of that bandwidth.
The M4 max is definitely not a 4090 killer, does not match it in any way. It can however work on larger models than the 4090 and have a battery that can last all day.
My memory is a bit fuzzy, but I believe the m3 max did decent on some games compared to the laptop Nvidia 4070 (which is not the same as the desktop 4070). But highly depended on if the game was x86-64 (requiring emulation) and if it was DX11 or apple native. I believe apple claims improvements in metal (the Apple's GPU lib) and that the m4 GPUs have better FP for ray tracing, but no significant changes in rasterized performance.
I look forward to the 3rd party benchmarks for LLM and gaming on the m4 max.
Eh… not quite. Maybe on an Instinct. Unified memory means the CPU and CPU means they can do zero copy to use the same memory buffer.
Many integrated graphics segregate the memory into CPU owned and GPU owned, so that even if data is on the same DIMM, a copy still needs to be performed for one side to use what the other side already has.
This means that the drivers, etc, all have to understand the unified memory model, etc. it’s not just hardware sharing DIMMs.
Yes, you could buy a brand new (announced weeks ago) AMD Turin. 12 channels of DDR5-6000, $11,048 and 320 watts (for the CPU) and get 576GB/sec peak.
Or you could buy a M3 max laptop for $4k, get 10+ hour battery life, have it fit in a thin/light laptop, and still get 546GB/sec. However those are peak numbers. Apple uses longer cache lines (double), large page sizes (quadruple), and a looser memory model. Generally I'd expect nearly every memory bandwidth measure to win on Apple over AMD's turin.
AnandTech did bandwidth benchmarks for the M1 Max and was only able to utilize about half of it from the CPU, and the GPU used even less in 3D workloads because it wasn't bandwidth limited. It's not all about bandwidth. https://www.anandtech.com/show/17024/apple-m1-max-performanc...
Indeed. RIP Anandtech. I've seen bandwidth tests since then that showed similar for newer generations, but not the m4. Not sure if the common LLM tools on mac can use CPU (vector instructions), AMX, and Neural engine in parallel to make use of the full bandwidth.
You lose out on things like expandability (more storage, more PCIe lanes) and repairability though. You are also (on M4 for probably a few years) compelled to use macOS, for better or worse.
There are, in my experience, professionals who want to use the best tools someone else builds for them, and professionals who want to keep iterating on their tools to make them the best they can be. It's the difference between, say, a violin and a Eurorack. Neither's better or worse, they're just different kinds of tools.
I was sorely tempted by the Mac studio, but ended up with a 96GB ram Ryzen 7900 (12 core) + Radeon 7800 XT (16GB vram). It was a fraction of the price and easy to add storage. The Mac M2 studio was tempting, but wasn't refreshed for the M3 generation. It really bothered me that the storage was A) expensive, B) proprietary, C) tightly controlled, and D) you can't boot without internal storage.
Even moving storage between Apple studios can be iffy. Would I be able to replace the storage if it died in 5 years? Or expand it?
As tempting as the size, efficiency, and bandwidth were I just couldn't justify top $ without knowing how long it would be useful. Sad they just didn't add two NVMe ports or make some kind of raw storage (NVMe flash, but without the smarts).
> Even moving storage between Apple studios can be iffy.
This was really driven home to me by my recent purchase of an Optane 905p, a drive that is both very fast and has an MTBF measured in the hundreds of years. Short of a power surge or (in California) an earthquake, it's not going to die in my lifetime -- why should I not keep using it for a long time?
Many kinds of professionals are completely fine with having their Optanes and what not only be plugged in externally, though, even though it may mean their boot drive will likely die at some point. That's completely okay I think.
I doubt you'll get 10+ hours on battery if you utilize it at max. I don't even know if it can really sustain the maximum load for more than a couple of minutes because of thermal or some other limits.
The 14" MBP has a 72 watt-hour battery and the 16" has a 100 watt-hour battery.
At full tilt an M3 Max will consume 50 to 75 watts, meaning you get 1 to 2 hours of runtime at best, if you use the thing full tilt.
That's the thing I find funny about the Apple Silicon MBP craze, sure they are efficient but if you use the thing as a workstation, battery life is still not good enough to really use unplugged.
Most claiming insane battery life are using the thing effectively as an information appliance or a media machine. At this game the PC laptops might not be as efficient but the runtime is not THAT different provided the same battery capacity.
FWIW I ran a quick test of gemma.cpp on M3 Pro with 8 threads. Similar PaliGemma inference speed to an older AMD (Rome or Milan) with 8 threads. But the AMD has more cores than that, and more headroom :)
"Unified memory" doesn't really imply anything about the memory being located on-package, just that it's a shared pool that the CPU, GPU, etc. all have fast access to.
Also, DRAM is never on-die. On-package, yes, for Apple's SoCs and various other products throughout the industry, but DRAM manufacturing happens in entirely different fabs than those used for logic chips.
It's mostly an IBM thing. In the consumer space, it's been in game consoles with IBM-fabbed chips. Intel's use of eDRAM was on a separate die (there was a lot that was odd about those parts).
Yeah memory bandwidth is one of the really unfortunate things about the consumer stuff. Even the 9950x/7950x, which are comfortably workstation-level in terms of compute, are bound by their 2 channel limits. The other day I was pricing out a basic Threadripper setup with a 7960x (not just for this reason but also for more PCIe lanes), and it would cost around $3000 -- somewhat out of my budget.
This is one of the reasons the "3D vcache" stuff with the giant L3 cache is so effective.
For comparison, a Threadripper Pro 5000 workstation with 8x DDR4 3200 has 204.8GB/s of memory bandwidth.
The Threadripper Pro 7000 with DDR5-5200 can achieve 325GB/s.
And no, manaskarekar, the M4 Max does 546 GB/s not GBps (which would be 8x less!).
Thanks for the numbers. Someone here on hackernews got me convinced that a Threadripper would be a better investment for inference than a MacBook Pro with a M3 Max.
> So for example if you have a server with 16 DDR5 DIMMs (sticks) it equates to 1,024 GB/s of total bandwidth.
Not quite as it depends on number of channels and not on the number of DIMMs. An extreme example: put all 16 DIMMs on single channel, you will get performance of a single channel.
If you're referring to the line you quoted, then no, it's not wrong. Each DIMM is perfectly capable of 64GiB/s, just as the article says. Where it might be confusing is that this article seems to only be concerning itself with the DIMM itself and not with the memory controller on the other end. As the other reply said, the actual bandwidth available also depends on the number of memory channels provided by the CPU, where each channel provides one DIMM worth of bandwidth.
This means that in practice, consumer x86 CPUs have only 128GiB/s of DDR5 memory bandwidth available (regardless of the number of DIMM slots in the system), because the vast majority of them only offer two memory channels. Server CPUs can offer 4, 8, 12, or even more channels, but you can't just install 16 DIMMs and expect to get 1024GiB/s of bandwidth, unless you've verified that your CPU has 16 memory channels.
It's not the memory being unified that makes it fast, it's the combination of the memory bus being extremely wide and the memory being extremely close to the processor. It's the same principle that discrete GPUs or server CPUs with onboard HBM memory use to make their non-unified memory go ultra fast.
No, unified memory usually means the CPU and GPU (and miscellaneous things like the NPU) all use the same physical pool of RAM and moving data between them is essentially zero-cost. That's in contrast to the usual PC setup where the CPU has its own pool of RAM, which is unified with the iGPU if it has one, but the discrete GPU has its own independent pool of VRAM and moving data between the two pools is a relatively slow operation.
An RTX4090 or H100 has memory extremely close to the processor but I don't think you would call it unified memory.
I don't quite understand one of the finer points of this, under caffeinated :) - if GPU memory is extremely close to the CPU memory, what sort of memory would not be extremely close to the CPU?
I think you misunderstood what I meant by "processor", the memory on a discrete GPU is very close to the GPUs processor die, but very far away from the CPU. The GPU may be able to read and write its own memory at 1TB/sec but the CPU trying to read or write that same memory will be limited by the PCIe bus, which is glacially slow by comparison, usually somewhere around 16-32GB/sec.
A huge part of optimizing code for discrete GPUs is making sure that data is streamed into GPU memory before the GPU actually needs it, because pushing or pulling data over PCIe on-demand decimates performance.
> CPU trying to read or write that same memory will be limited by the PCIe bus, which is glacially slow by comparison, usually somewhere around 16-32GB/sec.
If you’re forking out for H100’s you’ll usually be putting them on a bus with much higher throughput, 200GB/s or more.
I thought it meant that both the GPU and the CPU can access it. In most systems, GPU memory cannot be accessed by the CPU (without going through the GPU); and vice versa.
CPUs access GPU memory via MMIO (though usually only a small portion), and GPUs can in principle access main memory via DMA. Meaning, both can share an address space and access each other’s memory. However, that wouldn’t be called Unified Memory, because it’s still mediated by an external bus (PCIe) and thus relatively slower.
This is still half the speed of a consumer NVidia card, but the large amounts of memory is great, if you don't mind running things more slowly and with fewer libraries.
Was this example intended to describe any particular device? Because I'm not aware of anything that operates at 8800 MT/s, especially not with 64-bit channels.
That seems unlikely given the mismatched memory speed (see the parent comment) and the fact that Apple uses LPDDR which is typically 16 bits per channel. 8800MT/s seems to be a number pulled out of thin air or bad arithmetic.
Heh, ok, maybe slightly different. But apple spec claims 546GB/sec which works out to 512 bits (64 bytes) * 8533. I didn't think the point was 8533 vs 8800.
I believe I saw somewhere that the actual chips used are LPDDR5X-8533.
Effectively the parents formula describes the M4 max, give or take 5%.
Fewer libraries? Any that a normal LLM user would care about? Pytorch, ollama, and others seem to have the normal use cases covered. Whenever I hear about a new LLM seems like the next post is some mac user reporting the token/sec. Often about 5 tokens/sec for 70B models which seems reasonable for a single user.
Is there a normal LLM user yet? Most people would want their options to be as wide as possible. The big ones usually get covered (eventually), and there are distinct good libraries emerging for Mac only (sigh), but last I checked the experience of running every kit (stable diffusion, server-class, etc) involved overhead for the Mac world.
A 24gb model is fast and ranks 3.
A 70b model is slow and 8.
A top tier hosted model is fast and 100.
Past what specialized models can do, it's about a mixture/agentic approach and next level, nuclear power scale. Having a computer with lots of relatively fast RAM is not magic.
Thanks, but just to put things into perspective, this calculation has counted 8 channels which is 4 DIMMs and that's mostly desktops (not dismissing desktops, just highlighting that it's a different beast).
Desktops are two channels of 64 bits, or with DDR5 now four (sub)channels of 32 bits; either way, mainstream desktop platforms have had a total bus width of 128 bits for decades. 8x64 bit channels is only available from server platforms. (Some high-end GPUs have used 512-bit bus widths, and Apple's Max level of processors, but those are with memory types where the individual channels are typically 16 bits.)
The vast majority of any x86 laptop or desktops are 128 bits wide. Often 2x64 bit channels up till last year or so, now 4x32 bit DDR5 in the last year or so. There are some benefits to 4 channels over 2, but generally you are still limited by 128 bits unless you buy a Xeon, Epyc, or Threadripper (or Intel equiv) that are expensive, hot, and don't fit in SFFs or laptops.
So basically the PC world is crazy behind the 256, 512, and 1024 bit wide memory busses apple has offered since the M1 arrived.
> This is still half the speed of a consumer NVidia card, but the large amounts of memory is great, if you don't mind running things more slowly and with fewer libraries.
But it has more than 2x longer battery life and a better keyboard than a GPU card ;)
I don't think so? That PDF I linked is from 2015, way before Apple put focus on it through their M-series chips... And the Wikipedia article on "Glossary of computer graphics" has had an entry for unified memory since 2016: https://en.wikipedia.org/w/index.php?title=Glossary_of_compu...
For Apple to have come up with using the term "unified memory" to describe this kind of architecture, they would've needed to come up with it at least before 2016, meaning A9 chip or earlier. I have paid some attention to Apple's SoC launches through the years and can't recall them touting it as a feature in marketing materials before the M1. Do you have something which shows them using the term before 2016?
To be clear, it wouldn't surprise me if it has been used by others before Intel did in 2015 as well, but it's a starting point: if Apple hasn't used the term before then, we know for sure that they didn't come up with it, while if Apple did use it to describe A9 or earlier, we'll have to go digging for older documents to determine whether Apple came up with it
There are actual differences but they're mostly up to the drivers. "Shared" memory typically means it's the same DRAM but part of it is carved out and can only be used by the GPU. "Unified" means the GPU/CPU can freely allocate individual pages as needed.
We run our LLM workloads on a M2 Ultra because of this. 2x the VRAM; one-time cost at $5350 was the same as, at the time, 1 month of 80GB VRAM GPU in GCP. Works well for us.
Can you elaborate, are those workflows in queue or can they serve multiple users in parallel ?
I think it’s super interesting to know real life workflows and performance of different LLMs and hardware, in case you can direct me to other resources.
Thanks !
About 10-20% of my companies gpu usage is inference dev. Yes horribly not efficient usage of resources. We could upgrade the 100ish devs who do this dev work to M4 mbp and free up gpu resources
At some point there should be an upgrade to the M2 Ultra. It might be an M4 Ultra, it might be this year or next year. It might even be after the M5 comes out. Or it could be skipped in favour of the M5 Ultra. If anyone here knows they are definitely under NDA.
They aren't going to be using fp32 for inferencing, so those FP numbers are meaningless.
Memory and memory bandwidth matters most for inferencing. 819.2 GB/s for M2 Ultra is less than half that of A100, but having 192GB of RAM instead of 80gb means they can run inference on models that would require THREE of those A100s and the only real cost is that it takes longer for the AI to respond.
3 A100 at $5300/mo each for the past 2 years is over $380,000. Considering it worked for them, I'd consider it a massive success.
From another perspective though, they could have bought 72 of those Ultra machines for that much money and had most devs on their own private instance.
The simple fact is that Nvidia GPUs are massively overpriced. Nvidia should worry a LOT that Apple's private AI cloud is going to eat their lunch.
High availability story for AI workloads will be a problem for another decade. From what I can see the current pressing problem is to get stuff working quickly and iterate quickly.
I'm curious about getting one of these to run LLM models locally, but I don't understand the cost benefit very well. Even 128GB can't run, like, a state of the art Claude 3.5 or GPT 4o model right? Conversely, even 16GB can (I think?) run a smaller, quantized Llama model. What's the sweet spot for running a capable model locally (and likely future local-scale models)?
You'll be able to run 72B models w/ large context, lightly quantized with decent'ish performance, like 20-25 tok/sec. The best of the bunch are maybe 90% of a Claude 3.5.
If you need to do some work offline, or for some reason the place you work blocks access to cloud providers, it's not a bad way to go, really. Note that if you're on battery, heavy LLM use can kill your battery in an hour.
Having 128GB is really nice if you want to regularly run different full OSes as VMs simultaneously (and if those OSes might in turn have memory-intensive workloads running on them).
At least in the recent past, a hindrance was that MacOS limited how much of that unified memory could be assigned as VRAM. Those who wanted to exceed the limits had to tinker with kernel settings.
I wonder if that has changed or is about to change as Apple pivots their devices to better serve AI workflows as well.
you'd probably save money just paying for a VPS. And you wouldn't cook your personal laptop as fast. Not that people nowadays keep their electronics for long enough for that to matter :/
This is definitely tempting me to upgrade my M1 macbook pro. I think I have 400GB/s of memory bandwidth. I am wondering what the specific number "over half a terabyte" means.
I am always wondering if one shouldn't be doing the resource intensive LLM stuff in the cloud. I don't know enough to know the advantages of doing it locally.