A lot of moderate power users are running an undervolted used pair of 3090s on a 1000-1200W psu. 48 GB of vram let's you run 70B models at Q4 with 16k context.
If you use speculative decoding (a small model generates tokens verified by a larger model, I'm not sure on the specifics) you can get past 20 tokens per second it seems. You can also fit 32B models like Qwen/Qwen Coder at Q6 with lots of context this way, with spec decoding, closer to 40+ tks/s.
If you use speculative decoding (a small model generates tokens verified by a larger model, I'm not sure on the specifics) you can get past 20 tokens per second it seems. You can also fit 32B models like Qwen/Qwen Coder at Q6 with lots of context this way, with spec decoding, closer to 40+ tks/s.