And another thing that irks me: none of these video generators get motion right...
Especially anything involving fluid/smoke dynamics, or fast dynamic momements of humans and animals all suffer from the same weird motion artifacts. I can't describe it other than that the fluidity of the movements are completely off.
And as all genai video tools I've used are suffering from the same problem, I wonder if this is somehow inherent to the approach & somehow unsolvable with the current model architectures.
I think one of the biggest problems is the models are trained on 2D sequences and don't have any understanding of what they're actually seeing. They see some structure of pixels shift in a frame and learn that some 2D structures should shift in a frame over time. They don't actually understand the images are 2D capture of an event that occurred in four dimensions and the thing that's been imaged is under the influence of unimaged forces.
I saw a Santa dancing video today and the suspension of disbelief was almost instantly dispelled when the cuffs of his jacket moved erratically. The GenAI was trying to get them to sway with arm movements but because it didn't understand why they would sway it just generated a statistical approximation of swaying.
GenAI also definitely doesn't understand 3D structures easily demonstrated by completely incorrect morphological features. Even my dogs understand gravity, if I drop an object they're tracking (food) they know it should hit the ground. They also understand 3D space, if they stand on their back legs they can see over things or get a better perspective.
I've yet to see any GenAI that demonstrates even my dogs' level of understanding the physical world. This leaves their output in the uncanny valley.
As far as I can tell it's a problem with CGI at all. Whether you're using precise physics models or learned embeddings from watching videos, reproducing certain physical events is computationally very hard, whereas recording them just requires a camera (and of course setting up the physical world to produce what you're filming, or getting very lucky). The behind the scenes from House of the Dragon has a very good discussion of this from the art directors. After a decade and a half of specializing in it, they have yet to find any convincing way to create fire other than to actually create fire and film it. This isn't a limitation of AI and it has nothing to do with intelligence. A human can't convincingly animate fire, either. It seems to me that discussions like this from the optimist side always miss this distinction and it's part of why I think Ben Affleck was absolutely correct that AI can't replace filmmaking. Regardless of the underlying approach, computationally reproducing what the world gives you for free is simply very hard, maybe impossible. The best rendering systems out there come nowhere close to true photorealism over arbitrary scenarios and probably never will.
What's the point of poking holes in new technology and nitpiking like this? Are you blind to the immense breakthroughs made today and yet you focus what irks you about some tiny detail that might go away after a couple of versions?
At this phase of the game a lot of people are pretty accustomed to the pace of technological innovation in this space, and I think it’s reasonable for people to have a sense of what will/won’t go away in a few versions. Some of Sora’s issues may just require more training, some of these issues are intrinsic to their approach and will not be solvable with their current method.
To that end, it is actually extremely important to nit-pick this stuff. For those of us using these tools, we need to be able to talk shop about which ones are keeping up, which are work like shit in practice, and which ones work but only in certain situations, and which situations those are.
Neural networks use smooth manifolds as their underlying inductive bias so in theory it should be possible to incorporate smooth kinematic and Hamiltonian constraints but I am certain no one at OpenAI actually understands enough of the theory to figure out how to do that.
I don't work there so I'm certain there is no one with enough knowledge to make it work with Hamiltonian constraints because the idea is very obvious but they haven't done it because they don't have the wherewithal to do so. In other words, no one at OpenAI understands enough basic physics to incorporate conservation principles into the generative network so that objects with random masses don't appear and disappear on the "video" manifold as it evolves in time.
I've seen the quality of OpenAI engineers on Twitter and it's easy enough to extrapolate. Moreoever, neural networks are not black boxes, you're just parroting whatever you've heard on social media. The underlying theory is very simple.
Seriously... The ability to identify what physics/math theories the AI should apply and being able to make the AI actually apply those are very different things. And you don't seem to understand that distinction.
Unless you have $500k to pay for the actual implementation of a Hamiltonian video generator then I don't think you're in a position to tell me what I know and don't know.
lolz, I doubt very much anyone would want to pay you $500k to perform magic. Basically, I think you are coming across as someone who is trying to sound clever rather than being clever.
My price is very cheap in terms of what it would enable and allow OpenAI to charge their customers. Hamiltonian video generation with conservation principles which do not have phantom masses appearing and disappearing out of nowhere is a billion dollar industry so my asking price is basically giving away the entire industry for free.
Sure, but I imagine the reason you haven't started your own company to do it is you need 10s of millions in compute, so the price would be 500k + 10s of millions... Or you can't actually do it and are just talking shit on the internet.
I don't think there is any real value in making videos other than useless entertainment. The real inspired use of computation and AI is to cure cancer, that would be the right way to show the world that this technology is worthwhile and useful. The techniques involved would be the same because one would need to include real physical constraints like conservation of mass and energy instead of figuring out the best way to flash lights on the screen with no regard for any foundational physical principles.
Do you know anyone or any companies working on that?
Especially anything involving fluid/smoke dynamics, or fast dynamic momements of humans and animals all suffer from the same weird motion artifacts. I can't describe it other than that the fluidity of the movements are completely off.
And as all genai video tools I've used are suffering from the same problem, I wonder if this is somehow inherent to the approach & somehow unsolvable with the current model architectures.