Inferencing does not require Nvidia GPUs at all, and its almost criminal to be recommending dedicated GPUs with only 12GB of RAM.
Buy a MacMini or MacbookPro with RAM maxed out.
I just bought an M4 mac mini for exactly this use case that has 64GB for ~2k. You can get 128GB on the MBP for ~5k. These will run much larger (and more useful) models.
EDIT: Since the request was for < $1600, you can still get a 32GB mac mini for $1200 or 24GB for $800
> its almost criminal to be recommending dedicated GPUs with only 12GB of RAM.
If you already own a PC, it makes a hell of a lot more sense to spend $900 on a 3090 than it does to spec out a Mac Mini with 24gb of RAM. Plus, the Nvidia setup can scale to as many GPUs as you own which gives you options for upgrading that Apple wouldn't be caught dead offering.
Oh, and native Linux support that doesn't suck balls is a plus. I haven't benchmarked a Mac since the M2 generation, but the figures I can find put the M4 Max's compute somewhere near the desktop 3060 Ti: https://browser.geekbench.com/opencl-benchmarks
A Mac Mini with 24GB is ~$800 at the cheapest configuration. I can respect wanting to do a single part upgrade, but if you're using these LLMs for serious work, the price/perf for inferencing is far in favor of using Macs at the moment.
You can easily use the MacMini as a hub for running the LLM while you do work on your main computer (and it won't eat up your system resources or turn your primary computer into a heater)
I hope that more non-mac PCs come out optimized for high RAM SoC, I'm personally not a huge Apple fan but use them begrudgingly.
Also your $900 quote is a used/refurbished GPU. I've had plenty of GPUs burn out on me in the old days, not sure how it is nowadays, but that's a lot to pay for a used part IMO
if you're doing serious work, performance is more important than getting a good price/perf ratio, and a pair of 3090s is gonna be faster. It depends on your budget,
however as that configuration is a bit more expensive, however.
Whether performance or cost is more important depends on your use case. Some tasks that an LLM can do very well may not need to be done often, or even particularly quickly (as in my case).
e.g. LLM as one step of an ETL-style pipeline
Latency of the response really only matters if that response is user facing and is being actively awaited by the user
> M4 Max's compute somewhere near the desktop 3060 Ti
The only advantage is the M4 Max's ability to have way more VRAM than a 3060 Ti. You won't find many M4 Maxes with just 8 or 16 GB of RAM, and I don't think you can do much except use really small models with a 3060 Ti.
It's a bit of a moot point when CUDA will run 4 3060Tis in parallel, with further options for paging out to system memory. Since most models (particularly bigger/MOE ones) are sparsely decoded, you can get quite a lot of mileage out of multiple PCIe slots fed with enough bandwidth.
There's no doubt in my mind that the PC is the better performer if raw power is your concern. It's far-and-away the better value if you don't need to buy new hardware and only need a GPU. $2,000 of Nvidia GPUs will buy you halfway to an enterprise cluster, $2,000 of Apple hardware will get you a laptop chip with HBM.
You need a lot of space for that, cooling, and a good fuse that won't trip when you turn it on. I would totally just pay the money for an M4 Ultra MacStudio with 128 GB of RAM (or an M4 Max with 64 GB). It is a much cleaner setup, especially if you aren't interested in image generation (which the Macs are not good at yet).
If I could spend $4k on a non-Apple turn key solution that I could reasonably manage in my house, I would totally consider it.
Well, that's your call. If you're the sort of person that's willing to spend $2,000 on a M4 Ultra (which doesn't quite exist yet but we can pretend it does), then I honest to god do not understand why you'd refuse to spend that same money on a Jetson Orin with the same amount of memory in a smaller footprint with better performance and lower power consumption.
Unless you're specifically speccing out a computer for mobile use, the price premium you spend on a Mac isn't for better software or faster hardware. If you can tolerate Linux or Windows, I don't see why you'd even consider Mac hardware for your desktop. In the OP's position, suggesting Apple hardware literally makes no sense. They're not asking for the best hardware that runs MacOS, they're asking for the best hardware for AI.
> If I could spend $4k on a non-Apple turn key solution that I could reasonably manage in my house, I would totally consider it.
You can't pay Apple $4k for a turnkey solution, either. MacOS is borderline useless for headless inference; Vulkan compute and OpenCL are both MIA, package managers break on regular system updates and don't support rollback, LTS support barely exists, most coreutils are outdated and unmaintained, Asahi features things that MacOS doesn't support and vice-versa... you can't fool me into thinking that's a "turn key solution" any day of the week. If your car requires you to pick a package manager after you turn the engine over, then I really feel sorry for you. The state of MacOS for AI inference is truly no better than what Microsoft did with DirectML. By some accounts it's quite a bit worse.
M4 Ultra with enough RAM will cost more than $2000. An M2 Ultra mac studio with 64GB is $3999, and you probably want more RAM than that to run bigger models that the ultra can handle (it is basically 2X as powerful as the Max with more memory bandwidth). An M2 Max with 64GB of RAM, which is more reasonable, will run you $2,499. I have no idea if those prices will hold when the M4 Mac Studious finally come out (M4 Max MBP with 64 GB of ram starts at $3900 ATM).
> You can't pay Apple $4k for a turnkey solution, either.
I've seen/read plenty of success stories of Metal ports of models being used via LM Studio without much configuration/setup/hardware scavenging, so we can just disagree there.
>You need a lot of space for that, cooling, and a good fuse
Or live in europe where any wall-socket can give you closer to 3kW. For crazier setups like charging your EV you can have three-phase plugs with ~22kW to play with. 1m2 of floor-space isn't that substantial either unless you already live in a closet in middle of the most crowded city.
Reasonable? $7,000 for a laptop is pretty up there.
[Edit: OK I see I am adding cost when checking due to choosing a larger SSD drive, so $5,000 is more of a fair bottom price, with 1TB of storage.]
Responding specifically to this very specific claim: "Can get 128GB of ram for a reasonable price."
I'm open to your explanation of how this is reasonable — I mean, you didn't say cheap, to be fair. Maybe 128GB of ram on GPUs would be way more (that's like 6 x 4090s), is what you're saying.
For anyone who wants to reply with other amounts of memory, that's not what I'm talking about here.
But on another point, do you think the ram really buys you the equivalent of GPU memory? Is Apple's melding of CPU/GPU really that good?
I'm not just coming from a point of skepticism, I'm actually kind of hoping to be convinced you're right, so wanting to hear the argument in more detail.
It's reasonable in a "working professional who gets substantial value from" or "building an LLM driven startup project" kind of way.
It's not for the casual user, but for somebody who derives significant value from running it locally.
Personally I use the MacMini as a hub for a project I'm working on as it gives me full control and is simply much cheaper operationally. A one time ~$2000 cost isn't so bad for replacing tasks that a human would have to do. e.g. In my case I'm parsing loosely organized financial documents where structured data isn't available.
I suspect the hardware costs will continue to decline rapidly as they have in the past though, so that $5k for 128GB will likely be $5k for 256GB in a year or two, and so on.
We're almost at the inflection point where really powerful models are able to be inferenced locally for cheap
For a coding setup, should I go with a Mac Mini M4 pro with 64GB of RAM? Or is it better to go with a M4 max (only available for the MBP right now, maybe in the Studio in a few months)? I'm not really interested in the 4090/3090 approach, but it is hard to make a decision on Apple hardware ATM.
I don't see prices falling much in the near term, a Mac Studio M2 Max or Ultra has been keeping its value surprisingly well as of late (mainly because of AI?). Just like 3090s/4090s are holding their value really well also.
It's reasonable when the alternative is 2-4x4090 at $2.2K each (or 2xA6000 at 4.5K each) + server grade hardware to host them. Realistically, the vast majority of people should just buy a subscription or API access if they need to run grotesquely large models. While large LLMs (up to about 200B params) work on an MBP, they aren't super fast, and you do have to be plugged in - they chew through your battery like it's nothing. I know this because I have a 128GB M3 MBP.
How large of a model can you use with your 128GB M3? Anything you can tell would be great to hear. Number of parameters, quantization, which model, etc.
Thanks for the reply. Is that quantized? And what's the bit size of the floating point values in that model (apologies if I'm not asking the question correctly).
OP here, I almost got a decked out Mac studio before I returned it for a Asus ROG as the native Linux support, upgradability & CUDA support is much more important to me.
Meagre VRAM in these Nvidia consumer GPUs is indeed painful but with increasing performance of smaller LLMs & fine tuned models I don't think 12GB, 14GB, 16GB Nvidia GPUs offering much better performance over a Mac can be easily dismissed.
A MacBook Pro has lower peak thermal output but proportionally lower performance. For a given task you’d be dissipating somewhat similar heat, the MacBook Pro would just be spreading it out over a longer period of time.
nVidia GPUs actually have similar efficiency, despite all of Apple’s marketing. The difference is they the nVidia GPUs have a much higher ceiling.
Buy a MacMini or MacbookPro with RAM maxed out.
I just bought an M4 mac mini for exactly this use case that has 64GB for ~2k. You can get 128GB on the MBP for ~5k. These will run much larger (and more useful) models.
EDIT: Since the request was for < $1600, you can still get a 32GB mac mini for $1200 or 24GB for $800