This assumes the only way to use LLMs effectively is to have a monolith model that does everything from translation (from ANY language to ANY language) to creative writing to coding to what have you. And supposedly GPT4 is a mixture of experts (maybe 8-cross)
The efficiency of finetuned models is quite, quite a bit improved at the cost of giving up the rest of the world to do specific things, and disk space to have a few dozen local finetunes (or even hundreds+ for SaaS services) is peanuts compared to acquiring 80GB of VRAM on a single device for monomodels
Sutskever says there's a "phase transition" at the order of 9 bn neurons, after which LLMs begin to become really useful. I don't know much here, but wouldn't the monomodels become overfit, because they don't have enough data for 9+bn parameters?
The efficiency of finetuned models is quite, quite a bit improved at the cost of giving up the rest of the world to do specific things, and disk space to have a few dozen local finetunes (or even hundreds+ for SaaS services) is peanuts compared to acquiring 80GB of VRAM on a single device for monomodels