The M4 Max goes up to 128GB RAM, and "over half a terabyte per second of unified...

manaskarekar · 2024-10-30T15:09:39 1730300979

The M3 Max was 400GBps, this is 540GBps. Truly an outstanding case for unified memory. DDR5 doesn't come anywhere near.

Rohansi · 2024-10-30T15:31:32 1730302292

Apple is using LPDDR5 for M3. The bandwidth doesn't come from unified memory - it comes from using many channels. You could get the same bandwidth or more with normal DDR5 modules if you could use 8 or more channels, but in the PC space you don't usually see more than 2 or 4 channels (only common for servers).

Unrelated but unified memory is a strange buzzword being used by Apple. Their memory is no different than other computers. In fact, every computer without a discrete GPU uses a unified memory model these days!

rbanffy · 2024-10-30T15:56:08 1730303768

> (only common for servers).

On PC desktops I always recommend getting a mid-range tower server precisely for that reason. My oldest one is about 8 years old and only now it's showing signs of age (as in not being faster than the average laptop).

astrange · 2024-10-30T22:25:44 1730327144

> In fact, every computer without a discrete GPU uses a unified memory model these days!

On PCs some other hardware (notably the SSD) comes with its own memory. But here it's shared with the main DRAM too.

This is not necessarily a performance improvement, it can avoid copies but also means less is available to the CPU.

kalleboo · 2024-10-31T04:12:53 1730347973

DRAM-less NVMe (utilizing HMB) is also common on PCs, but it's seen as a slower budget alternative rather than a good thing.

binary132 · 2024-10-30T15:41:10 1730302870

I read all that marketing stuff and my brain just sees APU. I guess at some level, that’s just marketing stuff too, but it’s not a new idea.

sliken · 2024-10-30T16:34:23 1730306063

The new idea is having 512 bit wide memory instead of PC limitation of 128 bit wide. Normal CPU cores running normal codes are not particularly bandwidth limited. However APUs/iGPUs are severely bandwidth limited, thus the huge number of slow iGPUs that are fine for browsing but terrible for anything more intensive.

So apple manages decent GPU performance, a tiny package, and great battery life. It's much harder on the PC side because every laptop/desktop chip from Intel and AMD use a 128 bit memory bus. You have to take a huge step up in price, power, and size with something like a thread ripper, xeon, or epyc to get more than 128 bit wide memory, none of which are available in a laptop or mac mini size SFF.

jsheard · 2024-10-30T17:31:06 1730309466

> The new idea is having 512 bit wide memory instead of PC limitation of 128 bit wide.

It's not really a new idea, just unusual in computers. The custom SOCs that AMD makes for Playstation and Xbox have wide (up to 384-bit) unified memory buses, very similar to what Apple is doing, with the main distinction being Apples use of low-power LPDDR instead of the faster but power hungrier GDDR used in the consoles.

atq2119 · 2024-10-30T21:29:11 1730323751

Yeah, a lot of it is just market forces. I guess going to four channels is costly for the desktop PC space and that's why that didn't happen, and laptops just kind of followed suite. But now that Apple is putting pressure on the market, perhaps we'll finally see quad channel becoming the norm in desktop PCs? Would be nice...

reliabilityguy · 2024-10-30T16:56:59 1730307419

> instead of PC limitation of 128 bit wide

Memory interface width of modern CPUs is 64-bit (DDR4) and 32+32 (DDR5).

No CPU uses 128b memory bus as it results in overfetch of data, i.e., 128B per access, or two cache lines.

AFAIK Apple uses 128B cache lines, so they can do much better design and customization of memory subsystem as they do not have to use DIMMs -- they simply solder DRAM to the motherboard, hence memory interface is whatever they want.

sliken · 2024-10-30T17:28:46 1730309326

> Memory interface width of modern CPUs is 64-bit (DDR4) and 32+32 (DDR5).

Sure, per channel. PCs have 2x64 bit or 4x32 bit memory channels.

Not sure I get your point, yes PCs have 64 bit cache lines and apple uses 128. I wouldn't expect any noticeable difference because of this. Generally cache miss is sent to a single memory channel and result in a wait of 50-100ns, then you get 4 or 8 bytes per cycle at whatever memory clock speed you have. So apple gets twice the bytes per cache line miss, but the value of those extra bytes is low in most cases.

Other bigger differences is that apple has a larger page size (16KB vs 4KB) and arm supports a looser memory model, which makes it easier to reach a large fraction of peak memory bandwidth.

However, I don't see any relationship between Apple and PCs as far as DIMMS. Both Apple and PCs can (and do) solder dram chips directly to the motherboard, normally on thin/light laptops. The big difference between Apple and PC is that apple supports 128, 256, and 512 bit wide memory on laptops and 1024 bit on the studio (a bit bigger than most SFFs). To get more than 128 bits with a PC that means no laptops, no SFFs, generally large workstations with Xeon, Threadrippers, or Epyc with substantial airflow and power requirements

Rohansi · 2024-10-30T19:15:35 1730315735

FYI cache lines are 64 bytes, not bits. So Apple is using 128 bytes.

Also important to consider that the RTX 4090 has a relatively tiny 384-bit memory bus. Smaller than the M1 Max's 512-bit bus. But the RTX 4090 has 1 TB/s bandwidth and significantly more compute power available to make use of that bandwidth.

sliken · 2024-10-30T19:46:05 1730317565

Ugh, should have caught the bit vs byte, thanks.

The M4 max is definitely not a 4090 killer, does not match it in any way. It can however work on larger models than the 4090 and have a battery that can last all day.

My memory is a bit fuzzy, but I believe the m3 max did decent on some games compared to the laptop Nvidia 4070 (which is not the same as the desktop 4070). But highly depended on if the game was x86-64 (requiring emulation) and if it was DX11 or apple native. I believe apple claims improvements in metal (the Apple's GPU lib) and that the m4 GPUs have better FP for ray tracing, but no significant changes in rasterized performance.

I look forward to the 3rd party benchmarks for LLM and gaming on the m4 max.

reliabilityguy · 2024-10-31T01:52:06 1730339526

What I was trying to say is that there is no 128b limitation for PCs.

sroussey · 2024-10-30T16:26:40 1730305600

Eh… not quite. Maybe on an Instinct. Unified memory means the CPU and CPU means they can do zero copy to use the same memory buffer.

Many integrated graphics segregate the memory into CPU owned and GPU owned, so that even if data is on the same DIMM, a copy still needs to be performed for one side to use what the other side already has.

This means that the drivers, etc, all have to understand the unified memory model, etc. it’s not just hardware sharing DIMMs.

binary132 · 2024-10-30T21:32:30 1730323950

I was under the impression PS4’s APU implemented unified memory, and it was even referred to by that name[1].

APUs with shared everything are not a new concept, they are actually older than programmable graphics coprocessors…

https://www.heise.de/news/Gamescom-Playstation-4-bietet-Unif...

sunshowers · 2024-10-30T22:31:56 1730327516

I believe that at least on Linux you get zero-copy these days. https://www.phoronix.com/news/AMD-AOMP-19.0-2-Compiler

manaskarekar · 2024-10-30T15:36:25 1730302585

Yes, it's just easier to call it that without having to sprinkle asterisks at each mention of it :)

And yes, the impressive part is that this kind of bandwidth is hard to get on laptops. I suppose I should have been a bit more specific in my remark.

throwaway48476 · 2024-10-30T17:36:22 1730309782

High end servers now have 12 ddr5 channels.

sliken · 2024-10-30T18:26:15 1730312775

Yes, you could buy a brand new (announced weeks ago) AMD Turin. 12 channels of DDR5-6000, $11,048 and 320 watts (for the CPU) and get 576GB/sec peak.

Or you could buy a M3 max laptop for $4k, get 10+ hour battery life, have it fit in a thin/light laptop, and still get 546GB/sec. However those are peak numbers. Apple uses longer cache lines (double), large page sizes (quadruple), and a looser memory model. Generally I'd expect nearly every memory bandwidth measure to win on Apple over AMD's turin.

Rohansi · 2024-10-30T19:11:55 1730315515

AnandTech did bandwidth benchmarks for the M1 Max and was only able to utilize about half of it from the CPU, and the GPU used even less in 3D workloads because it wasn't bandwidth limited. It's not all about bandwidth. https://www.anandtech.com/show/17024/apple-m1-max-performanc...

sliken · 2024-10-30T19:40:29 1730317229

Indeed. RIP Anandtech. I've seen bandwidth tests since then that showed similar for newer generations, but not the m4. Not sure if the common LLM tools on mac can use CPU (vector instructions), AMX, and Neural engine in parallel to make use of the full bandwidth.

sunshowers · 2024-10-30T22:28:57 1730327337

You lose out on things like expandability (more storage, more PCIe lanes) and repairability though. You are also (on M4 for probably a few years) compelled to use macOS, for better or worse.

There are, in my experience, professionals who want to use the best tools someone else builds for them, and professionals who want to keep iterating on their tools to make them the best they can be. It's the difference between, say, a violin and a Eurorack. Neither's better or worse, they're just different kinds of tools.

sliken · 2024-10-30T22:59:12 1730329152

Agreed.

I was sorely tempted by the Mac studio, but ended up with a 96GB ram Ryzen 7900 (12 core) + Radeon 7800 XT (16GB vram). It was a fraction of the price and easy to add storage. The Mac M2 studio was tempting, but wasn't refreshed for the M3 generation. It really bothered me that the storage was A) expensive, B) proprietary, C) tightly controlled, and D) you can't boot without internal storage.

Even moving storage between Apple studios can be iffy. Would I be able to replace the storage if it died in 5 years? Or expand it?

As tempting as the size, efficiency, and bandwidth were I just couldn't justify top $ without knowing how long it would be useful. Sad they just didn't add two NVMe ports or make some kind of raw storage (NVMe flash, but without the smarts).

sunshowers · 2024-10-30T23:51:19 1730332279

> Even moving storage between Apple studios can be iffy.

This was really driven home to me by my recent purchase of an Optane 905p, a drive that is both very fast and has an MTBF measured in the hundreds of years. Short of a power surge or (in California) an earthquake, it's not going to die in my lifetime -- why should I not keep using it for a long time?

Many kinds of professionals are completely fine with having their Optanes and what not only be plugged in externally, though, even though it may mean their boot drive will likely die at some point. That's completely okay I think.

ciupicri · 2024-10-30T20:25:35 1730319935

I doubt you'll get 10+ hours on battery if you utilize it at max. I don't even know if it can really sustain the maximum load for more than a couple of minutes because of thermal or some other limits.

seec · 2024-11-01T16:32:57 1730478777

The 14" MBP has a 72 watt-hour battery and the 16" has a 100 watt-hour battery.

At full tilt an M3 Max will consume 50 to 75 watts, meaning you get 1 to 2 hours of runtime at best, if you use the thing full tilt.

That's the thing I find funny about the Apple Silicon MBP craze, sure they are efficient but if you use the thing as a workstation, battery life is still not good enough to really use unplugged.

Most claiming insane battery life are using the thing effectively as an information appliance or a media machine. At this game the PC laptops might not be as efficient but the runtime is not THAT different provided the same battery capacity.

janwas · 2024-10-30T20:23:41 1730319821

FWIW I ran a quick test of gemma.cpp on M3 Pro with 8 threads. Similar PaliGemma inference speed to an older AMD (Rome or Milan) with 8 threads. But the AMD has more cores than that, and more headroom :)

throwaway48476 · 2024-10-30T19:52:23 1730317943

CXL memory is also a thing.

oDot · 2024-10-30T15:53:18 1730303598

Isn't unified memory* a crucial part in avoiding signal integrity problems?

Servers do have many channels but they run relatively slower memory

* Specifically, it being on-die

wtallis · 2024-10-30T15:55:10 1730303710

"Unified memory" doesn't really imply anything about the memory being located on-package, just that it's a shared pool that the CPU, GPU, etc. all have fast access to.

Also, DRAM is never on-die. On-package, yes, for Apple's SoCs and various other products throughout the industry, but DRAM manufacturing happens in entirely different fabs than those used for logic chips.

kube-system · 2024-10-30T17:33:54 1730309634

System memory DRAM never is, but sometimes DRAM is technically included on CPU dies as a cache

https://en.wikipedia.org/wiki/EDRAM

wtallis · 2024-10-30T18:36:24 1730313384

It's mostly an IBM thing. In the consumer space, it's been in game consoles with IBM-fabbed chips. Intel's use of eDRAM was on a separate die (there was a lot that was odd about those parts).

sunshowers · 2024-10-30T22:14:43 1730326483

Yeah memory bandwidth is one of the really unfortunate things about the consumer stuff. Even the 9950x/7950x, which are comfortably workstation-level in terms of compute, are bound by their 2 channel limits. The other day I was pricing out a basic Threadripper setup with a 7960x (not just for this reason but also for more PCIe lanes), and it would cost around $3000 -- somewhat out of my budget.

This is one of the reasons the "3D vcache" stuff with the giant L3 cache is so effective.

Tepix · 2024-10-30T15:55:13 1730303713

For comparison, a Threadripper Pro 5000 workstation with 8x DDR4 3200 has 204.8GB/s of memory bandwidth. The Threadripper Pro 7000 with DDR5-5200 can achieve 325GB/s.

And no, manaskarekar, the M4 Max does 546 GB/s not GBps (which would be 8x less!).

wtallis · 2024-10-30T15:59:24 1730303964

> And no, manaskarekar, the M4 Max does 546 GB/s not GBps (which would be 8x less!).

GB/s and GBps mean the same thing, though GB/s is the more common way to express it. Gb/s and Gbps are the units that are 8x less: bits vs Bytes.

leptons · 2024-10-30T17:51:25 1730310685

B = Bytes, b = bits.

GB/s is the same thing as GBps

The "ps" means "per second"

hmottestad · 2024-10-30T17:11:33 1730308293

Thanks for the numbers. Someone here on hackernews got me convinced that a Threadripper would be a better investment for inference than a MacBook Pro with a M3 Max.

metadat · 2024-10-30T15:25:05 1730301905

I was curious so I looked it up:

https://en.wikipedia.org/wiki/DDR5_SDRAM (info from the first section):

> DDR5 is capable of 8GT/s which translates to 64 GB/s (8 gigatransfers/second * 64-bit width / 8 bits/byte = 64 GB/s) of bandwidth per DIMM.

So for example if you have a server with 16 DDR5 DIMMs (sticks) it equates to 1,024 GB/s of total bandwidth.

DDR4 clocks in at 3.2GT/s and the fastest DDR3 at 2.1GT/s.

DDR5 is an impressive jump. HBM is totally bonkers at 128GB/s per DIMM (HBM is the memory used in the top end Nvidia datacenter cards).

Cheers.

reliabilityguy · 2024-10-30T16:58:36 1730307516

> So for example if you have a server with 16 DDR5 DIMMs (sticks) it equates to 1,024 GB/s of total bandwidth.

Not quite as it depends on number of channels and not on the number of DIMMs. An extreme example: put all 16 DIMMs on single channel, you will get performance of a single channel.

metadat · 2024-10-30T23:58:08 1730332688

Thanks for your reply. Are you up for updating the Wikipedia page?, because as of now the canonical reference is wrong.

angoragoats · 2024-10-31T01:19:28 1730337568

If you're referring to the line you quoted, then no, it's not wrong. Each DIMM is perfectly capable of 64GiB/s, just as the article says. Where it might be confusing is that this article seems to only be concerning itself with the DIMM itself and not with the memory controller on the other end. As the other reply said, the actual bandwidth available also depends on the number of memory channels provided by the CPU, where each channel provides one DIMM worth of bandwidth.

This means that in practice, consumer x86 CPUs have only 128GiB/s of DDR5 memory bandwidth available (regardless of the number of DIMM slots in the system), because the vast majority of them only offer two memory channels. Server CPUs can offer 4, 8, 12, or even more channels, but you can't just install 16 DIMMs and expect to get 1024GiB/s of bandwidth, unless you've verified that your CPU has 16 memory channels.

metadat · 2024-10-31T01:21:26 1730337686

Got it, thanks for clarifying.

Happy Halloween!

sroussey · 2024-10-30T16:28:22 1730305702

Yes, and wouldn’t it be bonkers if the M4 Max supported HBM on desktops?

jsheard · 2024-10-30T15:24:32 1730301872

It's not the memory being unified that makes it fast, it's the combination of the memory bus being extremely wide and the memory being extremely close to the processor. It's the same principle that discrete GPUs or server CPUs with onboard HBM memory use to make their non-unified memory go ultra fast.

smith7018 · 2024-10-30T15:34:44 1730302484

I thought “unified memory” was just a marketing term for the memory being extremely close to the processor?

jsheard · 2024-10-30T15:36:34 1730302594

No, unified memory usually means the CPU and GPU (and miscellaneous things like the NPU) all use the same physical pool of RAM and moving data between them is essentially zero-cost. That's in contrast to the usual PC setup where the CPU has its own pool of RAM, which is unified with the iGPU if it has one, but the discrete GPU has its own independent pool of VRAM and moving data between the two pools is a relatively slow operation.

An RTX4090 or H100 has memory extremely close to the processor but I don't think you would call it unified memory.

refulgentis · 2024-10-30T16:52:37 1730307157

I don't quite understand one of the finer points of this, under caffeinated :) - if GPU memory is extremely close to the CPU memory, what sort of memory would not be extremely close to the CPU?

jsheard · 2024-10-30T16:59:23 1730307563

I think you misunderstood what I meant by "processor", the memory on a discrete GPU is very close to the GPUs processor die, but very far away from the CPU. The GPU may be able to read and write its own memory at 1TB/sec but the CPU trying to read or write that same memory will be limited by the PCIe bus, which is glacially slow by comparison, usually somewhere around 16-32GB/sec.

A huge part of optimizing code for discrete GPUs is making sure that data is streamed into GPU memory before the GPU actually needs it, because pushing or pulling data over PCIe on-demand decimates performance.

mr_toad · 2024-11-01T03:47:57 1730432877

> CPU trying to read or write that same memory will be limited by the PCIe bus, which is glacially slow by comparison, usually somewhere around 16-32GB/sec.

If you’re forking out for H100’s you’ll usually be putting them on a bus with much higher throughput, 200GB/s or more.

refulgentis · 2024-10-30T17:15:33 1730308533

I see, TL;DR == none; and processor switches from {CPU,GPU} to {GPU} in the 2nd paragraph. Thanks!

hollerith · 2024-10-30T15:35:26 1730302526

I thought it meant that both the GPU and the CPU can access it. In most systems, GPU memory cannot be accessed by the CPU (without going through the GPU); and vice versa.

layer8 · 2024-10-30T17:32:15 1730309535

CPUs access GPU memory via MMIO (though usually only a small portion), and GPUs can in principle access main memory via DMA. Meaning, both can share an address space and access each other’s memory. However, that wouldn’t be called Unified Memory, because it’s still mediated by an external bus (PCIe) and thus relatively slower.

bobmcnamara · 2024-10-30T22:21:42 1730326902

Are they cache coherent these days? I feel like any unified memories should be.

vid · 2024-10-30T15:13:11 1730301191

It's not "DDR5" on its own, it's a few factors.

Bandwidth (GB/s) = (Data Rate (MT/s) * Channel Width (bits) * Number of Channels) / 8 / 1000

(8800 MT/s * 64 bits * 8 channels) / 8 / 1000 = 563.2 GB/s

This is still half the speed of a consumer NVidia card, but the large amounts of memory is great, if you don't mind running things more slowly and with fewer libraries.

wtallis · 2024-10-30T15:35:55 1730302555

> (8800 MT/s * 64 bits * 8 channels) / 8 / 1000 = 563.2 GB/s

Was this example intended to describe any particular device? Because I'm not aware of anything that operates at 8800 MT/s, especially not with 64-bit channels.

sliken · 2024-10-30T16:40:28 1730306428

M4 max in the MBP (today) and in the Studio at some later date.

wtallis · 2024-10-30T18:56:53 1730314613

That seems unlikely given the mismatched memory speed (see the parent comment) and the fact that Apple uses LPDDR which is typically 16 bits per channel. 8800MT/s seems to be a number pulled out of thin air or bad arithmetic.

sliken · 2024-10-30T19:35:29 1730316929

Heh, ok, maybe slightly different. But apple spec claims 546GB/sec which works out to 512 bits (64 bytes) * 8533. I didn't think the point was 8533 vs 8800.

I believe I saw somewhere that the actual chips used are LPDDR5X-8533.

Effectively the parents formula describes the M4 max, give or take 5%.

sliken · 2024-10-30T16:39:47 1730306387

Fewer libraries? Any that a normal LLM user would care about? Pytorch, ollama, and others seem to have the normal use cases covered. Whenever I hear about a new LLM seems like the next post is some mac user reporting the token/sec. Often about 5 tokens/sec for 70B models which seems reasonable for a single user.

vid · 2024-10-30T17:26:37 1730309197

Is there a normal LLM user yet? Most people would want their options to be as wide as possible. The big ones usually get covered (eventually), and there are distinct good libraries emerging for Mac only (sigh), but last I checked the experience of running every kit (stable diffusion, server-class, etc) involved overhead for the Mac world.

cjbprime · 2024-10-30T15:22:21 1730301741

Right, the nvidia card maxes out at 24GB.

vid · 2024-10-31T03:46:40 1730346400

A 24gb model is fast and ranks 3. A 70b model is slow and 8.

A top tier hosted model is fast and 100.

Past what specialized models can do, it's about a mixture/agentic approach and next level, nuclear power scale. Having a computer with lots of relatively fast RAM is not magic.

manaskarekar · 2024-10-30T15:32:33 1730302353

Thanks, but just to put things into perspective, this calculation has counted 8 channels which is 4 DIMMs and that's mostly desktops (not dismissing desktops, just highlighting that it's a different beast).

Most laptops will be 2 DIMMS (probably soldered).

wtallis · 2024-10-30T15:40:05 1730302805

Desktops are two channels of 64 bits, or with DDR5 now four (sub)channels of 32 bits; either way, mainstream desktop platforms have had a total bus width of 128 bits for decades. 8x64 bit channels is only available from server platforms. (Some high-end GPUs have used 512-bit bus widths, and Apple's Max level of processors, but those are with memory types where the individual channels are typically 16 bits.)

sliken · 2024-10-30T16:44:05 1730306645

I think you are confusing channels and dimms.

The vast majority of any x86 laptop or desktops are 128 bits wide. Often 2x64 bit channels up till last year or so, now 4x32 bit DDR5 in the last year or so. There are some benefits to 4 channels over 2, but generally you are still limited by 128 bits unless you buy a Xeon, Epyc, or Threadripper (or Intel equiv) that are expensive, hot, and don't fit in SFFs or laptops.

So basically the PC world is crazy behind the 256, 512, and 1024 bit wide memory busses apple has offered since the M1 arrived.

Y-bar · 2024-10-30T17:43:31 1730310211

> This is still half the speed of a consumer NVidia card, but the large amounts of memory is great, if you don't mind running things more slowly and with fewer libraries.

But it has more than 2x longer battery life and a better keyboard than a GPU card ;)

jedisct1 · 2024-10-30T19:36:54 1730317014

Is it GBps or Gbps?

convexstrictly · 2024-10-30T21:19:15 1730323155

GB per second

mort96 · 2024-10-30T16:15:24 1730304924

This is a case for on-package memory, not for unified memory... Laptops have had unified memory forever

EDIT: wtf what's so bad about this comment that it deserves being downvoted so much

willseth · 2024-10-30T20:46:11 1730321171

Intel typically calls their iGPU architecture "shared memory"

mort96 · 2024-10-30T20:55:09 1730321709

Hm it seems like they call it unified memory too, at least in some places, have a look at 5.7.1 "Unified Memory Architecture" in this document: https://www.intel.com/content/dam/develop/external/us/en/doc...

    Intel processor graphics architecture has long pioneered
    sharing DRAM physical memory with the CPU.
    This unified memory architecture offers [...]

It more or less seems like they use "unified memory" and "shared memory" interchangeably in that section

Detrytus · 2024-10-30T21:21:02 1730323262

I think "Unified" vs "shared" is just something Apple marketing department came up with.

Calling something "shared" makes you think: "there's not enough of it, so it has to be shared".

Calling something "unified" makes you think: "they are good engineers, they managed to unify two previously separate things, for my benefit".

mort96 · 2024-10-30T22:18:57 1730326737

I don't think so? That PDF I linked is from 2015, way before Apple put focus on it through their M-series chips... And the Wikipedia article on "Glossary of computer graphics" has had an entry for unified memory since 2016: https://en.wikipedia.org/w/index.php?title=Glossary_of_compu...

For Apple to have come up with using the term "unified memory" to describe this kind of architecture, they would've needed to come up with it at least before 2016, meaning A9 chip or earlier. I have paid some attention to Apple's SoC launches through the years and can't recall them touting it as a feature in marketing materials before the M1. Do you have something which shows them using the term before 2016?

To be clear, it wouldn't surprise me if it has been used by others before Intel did in 2015 as well, but it's a starting point: if Apple hasn't used the term before then, we know for sure that they didn't come up with it, while if Apple did use it to describe A9 or earlier, we'll have to go digging for older documents to determine whether Apple came up with it

astrange · 2024-10-30T22:28:55 1730327335

There are actual differences but they're mostly up to the drivers. "Shared" memory typically means it's the same DRAM but part of it is carved out and can only be used by the GPU. "Unified" means the GPU/CPU can freely allocate individual pages as needed.

garciasn · 2024-10-30T15:12:03 1730301123

We run our LLM workloads on a M2 Ultra because of this. 2x the VRAM; one-time cost at $5350 was the same as, at the time, 1 month of 80GB VRAM GPU in GCP. Works well for us.

alfonsodev · 2024-10-30T15:21:15 1730301675

Can you elaborate, are those workflows in queue or can they serve multiple users in parallel ?

I think it’s super interesting to know real life workflows and performance of different LLMs and hardware, in case you can direct me to other resources. Thanks !

garciasn · 2024-10-30T15:39:09 1730302749

Our use case is atypical, based on what others seem to require. While we serve multiple requests in parallel, our workloads are not 'chat'.

bushbaba · 2024-10-30T15:14:53 1730301293

About 10-20% of my companies gpu usage is inference dev. Yes horribly not efficient usage of resources. We could upgrade the 100ish devs who do this dev work to M4 mbp and free up gpu resources

Smart move by Apple

manaskarekar · 2024-10-30T15:13:28 1730301208

If the 2x multiplier holds up, the Ultra update should bring it up to 1080GBps. Amazing.

SirMaster · 2024-10-30T16:19:49 1730305189

There isn't even an M3 Ultra. Will there be an M4 Ultra?

hmottestad · 2024-10-30T17:56:54 1730311014

At some point there should be an upgrade to the M2 Ultra. It might be an M4 Ultra, it might be this year or next year. It might even be after the M5 comes out. Or it could be skipped in favour of the M5 Ultra. If anyone here knows they are definitely under NDA.

wpm · 2024-10-31T05:52:43 1730353963

M3 was built on an expensive process node, I don’t think it was ever meant to be around long.

tromp · 2024-10-30T16:30:37 1730305837

That would make the most sense for the next Mac Studio version.

int_19h · 2024-10-30T18:02:57 1730311377

There were rumors that the next Mac Studio will top out at 512Gb RAM, too.

Good news for anyone who wants to run 405B LMs locally...

mpweiher · 2024-10-30T17:18:14 1730308694

And the week isn't over...

smith7018 · 2024-10-30T19:28:09 1730316489

They announced earlier in the week that there will only be three days of announcements

charlescurt123 · 2024-10-30T20:00:31 1730318431

comparing a laptop to a A100 (312 teraFLOPS) or H100 (~1P FLOPS) server is a stretch to say the least.

An M2 is according to a reddit post around 27 tflops

So < 1/10 the performance of just computation. let alone the memory.

What workflow would use something like this?

hajile · 2024-10-30T20:31:19 1730320279

They aren't going to be using fp32 for inferencing, so those FP numbers are meaningless.

Memory and memory bandwidth matters most for inferencing. 819.2 GB/s for M2 Ultra is less than half that of A100, but having 192GB of RAM instead of 80gb means they can run inference on models that would require THREE of those A100s and the only real cost is that it takes longer for the AI to respond.

3 A100 at $5300/mo each for the past 2 years is over $380,000. Considering it worked for them, I'd consider it a massive success.

From another perspective though, they could have bought 72 of those Ultra machines for that much money and had most devs on their own private instance.

The simple fact is that Nvidia GPUs are massively overpriced. Nvidia should worry a LOT that Apple's private AI cloud is going to eat their lunch.

kristianp · 2024-10-30T23:44:56 1730331896

> comparing a laptop

Small correction: the M2 Ultra isn't found in laptops, its in the Studio.

Der_Einzige · 2024-10-30T16:06:53 1730304413

Right now, there are 0.90$ per hour H100 80gbs that you can rent.

sgt101 · 2024-10-30T15:20:54 1730301654

You have another one with a network gateway to provide hot failover?

Right?

ithkuil · 2024-10-30T15:24:16 1730301856

High availability story for AI workloads will be a problem for another decade. From what I can see the current pressing problem is to get stuff working quickly and iterate quickly.

losvedir · 2024-10-30T20:06:15 1730318775

I'm curious about getting one of these to run LLM models locally, but I don't understand the cost benefit very well. Even 128GB can't run, like, a state of the art Claude 3.5 or GPT 4o model right? Conversely, even 16GB can (I think?) run a smaller, quantized Llama model. What's the sweet spot for running a capable model locally (and likely future local-scale models)?

brandall10 · 2024-10-30T20:37:27 1730320647

You'll be able to run 72B models w/ large context, lightly quantized with decent'ish performance, like 20-25 tok/sec. The best of the bunch are maybe 90% of a Claude 3.5.

If you need to do some work offline, or for some reason the place you work blocks access to cloud providers, it's not a bad way to go, really. Note that if you're on battery, heavy LLM use can kill your battery in an hour.

SkyMarshal · 2024-10-30T21:01:03 1730322063

Lots of discussion and testing of that over on https://www.reddit.com/r/LocalLLaMA/, worth following if you're not already.

bufferoverflow · 2024-10-30T23:10:30 1730329830

Claude 3.5 and GPT 4o are huge models. They don't run on consumer hardware.

moffkalast · 2024-10-30T15:24:44 1730301884

Well it's more like pick your poison, cause all options have caveats:

- Apple: all the capacity and bandwidth, but no compute to utilize it

- AMD/Nvidia: all the compute and bandwidth, but no capacity to load anything

- DDR5: all the capacity, but no compute or bandwidth (cheap tho)

Dibby053 · 2024-10-30T19:58:13 1730318293

Why was this downvoted?

moffkalast · 2024-10-30T22:23:04 1730326984

To quote an old meme, "They hated Jesus because he told them the truth."

jjcm · 2024-10-30T17:50:18 1730310618

For context, the 4090 has 1,008 GB/s of bandwidth.

spacedcowboy · 2024-10-30T19:00:48 1730314848

... but only 1/4 of the actual memory, right ?

The M4-Max I just ordered comes with 128GB of RAM.

Inviz · 2024-10-30T15:15:33 1730301333

I have M3 Max with 128GB of ram, it's really liberating.

sfn42 · 2024-10-30T15:22:31 1730301751

I have 32gb and I've never felt like I needed more.

umanwizard · 2024-10-30T16:16:01 1730304961

Having 128GB is really nice if you want to regularly run different full OSes as VMs simultaneously (and if those OSes might in turn have memory-intensive workloads running on them).

Somewhat niche case, I know.

moffkalast · 2024-10-30T15:33:41 1730302421

Obviously you're not a golfer.

andy_ppp · 2024-10-30T17:21:16 1730308876

https://www.youtube.com/watch?v=YzhKEHDR_rc :-) Thanks for that, I think I will watch The Big Lebowski tonight!

moffkalast · 2024-10-30T17:52:47 1730310767

Far out, man

:P

shiroiushi · 2024-10-31T04:57:44 1730350664

No one needs more than 640kB.

thimabi · 2024-10-30T15:31:36 1730302296

At least in the recent past, a hindrance was that MacOS limited how much of that unified memory could be assigned as VRAM. Those who wanted to exceed the limits had to tinker with kernel settings.

I wonder if that has changed or is about to change as Apple pivots their devices to better serve AI workflows as well.

culi · 2024-10-30T20:00:59 1730318459

you'd probably save money just paying for a VPS. And you wouldn't cook your personal laptop as fast. Not that people nowadays keep their electronics for long enough for that to matter :/

doctoboggan · 2024-10-30T15:20:54 1730301654

This is definitely tempting me to upgrade my M1 macbook pro. I think I have 400GB/s of memory bandwidth. I am wondering what the specific number "over half a terabyte" means.

rsanek · 2024-10-30T16:37:11 1730306231

joeevans1000 · 2024-10-30T23:45:32 1730331932

I am always wondering if one shouldn't be doing the resource intensive LLM stuff in the cloud. I don't know enough to know the advantages of doing it locally.

alexchantavy · 2024-10-31T00:57:20 1730336240

Curious, what are others using local LLMs on a MBP for? Hobby?

segmondy · 2024-10-30T16:03:28 1730304208

Need more memory, 256GB will be nice. MistralLarge is 123B. Can't even give a quantized Llama405B a drive. LLM users rejoice. LLM power users, weep.