> these things will get bigger and better much faster than we can learn to discern
I would like to ask “Why?”
Clearly, these models are just one case of “NN can learn to map anything from one domain to another” and with enough training/overfitting they can approximate reality to a high degree.
But, why would it get better to any significant extent?
Because we can collect an infinite amount of video? Because we can train models to the point where they become generative video compression algorithms that have seen it all?
> But, why would it get better to any significant extent?
Two years ago, the very best closed-source image model was unable to represent anything remotely realistic. Today, there's hundreds of open source models that can generate images that are literally indistinguishable from reality (like Flux). Not only that, there's an entire collection of tools and techniques around style transfer, facial reconstruction, pose control, etc. It's mindblowing, and every week there's a new paper making it even better. Some of that could have been more training data. Most of it wasn't.
I guess it's fair to extrapolate that same trend to video, since it's the arc text, audio and images have taken? No reason it would be different.
I get that. But, let’s say you have a glass, you fill it to one third, then to half, then to three quarter, then to full. Can you expect to fill it beyond full?
Not every process has an infinite ramp.
It seems frontier labs have been throwing all the compute and all the data they could get their hands on at model training for at least the past 2 years.
Is that glass a third full or is it nearly full already?
Is the process of filling that particular glass linear or does the top 20% of the glass require X times as much water to fill as the bottom 20%?
I don’t see how that analogy makes any sense. We’re not talking about containers of a known and fixed size here, nor a single technique, nor a single method. Stuff like LLMs using Transformer architectures might have reached a plateau, for instance. But there’s tons of techniques _around_ those models that keep making them more capable (o1, etc), and also other architectures.
> these things will get bigger and better much faster than we can learn to discern
I would like to ask “Why?”
Clearly, these models are just one case of “NN can learn to map anything from one domain to another” and with enough training/overfitting they can approximate reality to a high degree.
But, why would it get better to any significant extent?
Because we can collect an infinite amount of video? Because we can train models to the point where they become generative video compression algorithms that have seen it all?