I'm designing a new PC and I'd like to be able to run local models. It's not cle...

tharmas · 2024-11-10T21:31:07 1731274267

The 4070ti 16gb would be better as it has a 256bit bus compared to the 4060ti 16gb which, as has been mentioned, only has a 128bit bus. The 4070ti also has more Tensor cores than the 4060ti.

Get as much VRAM as you can afford.

nVIDIA is also releasing new cards starting in late January 2025. The RTX 50 series.

qingcharles · 2024-11-10T19:46:08 1731267968

The answer is that it depends which models you want to run. I'd get as much VRAM on your GPU as possible. Once that runs out, it'll start using your system RAM.

Some good info here if you dig around:

https://www.reddit.com/r/LocalLLaMA/

moffkalast · 2024-11-10T17:55:52 1731261352

You can run local models on a 10 year old laptop. As always the answer is "it depends".

The things you need: memory bandwidth, memory capacity, compute. The more of each the better. The 4060 generally has very poor bandwidth (worse than the 3060) due to its limited bus, but being able to offload more is still generally better.

32GB systems can load 8B models at fp16, 12B at 8 bits, 30B at 4 bits, 70B at 2 bits (roughly speaking). 64GB would be a good minimum if you want to use 70B at 4 bits. Without significant offloading it will be very slow though.

If you want to process long contexts in a decent amount of time it's best to run models with flash attention which requires you to have the KV cache on the GPU. It also lets you use 4 bit cache, which quadruples the amount of context you can fit.

Rastonbury · 2024-11-10T18:46:10 1731264370

GPU VRAM is the bottleneck currently, check out r/localLlama for benchmarks and calculators for what models can fit into what cards approximately

pulse7 · 2024-11-11T08:35:47 1731314147

Options:

A) 128GB RAM with the fastest Intel/AMD CPU, no GPU: you can run big/good models, but very slow (about 0.5 to 3 tokens/second)

B) Fastest Mac with 128GB/192GB: you can run big/good models with moderate speed (like 5-10 tokens/second)

C) 16/32GB RAM + RTX 4090 with 24GB VRAM: you can run smaller (but still good) models very fast - completely in VRAM (20-30 tokens/second)

pella · 2024-11-10T19:22:37 1731266557

You can create a https://tinygrad.org/#tinybox clone with multiple GPUs.