Hacker News new | past | comments | ask | show | jobs | submit login

A lot of moderate power users are running an undervolted used pair of 3090s on a 1000-1200W psu. 48 GB of vram let's you run 70B models at Q4 with 16k context.

If you use speculative decoding (a small model generates tokens verified by a larger model, I'm not sure on the specifics) you can get past 20 tokens per second it seems. You can also fit 32B models like Qwen/Qwen Coder at Q6 with lots of context this way, with spec decoding, closer to 40+ tks/s.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: