Hacker News new | past | comments | ask | show | jobs | submit login

> but the result is not naively going to understand the level of reality the script is going for…

We can already get detailed style guidance into picture generation. Declaring you want Picasso cubist, Warner brothers cartoon, or hyper realistic works today. So does lighting instructions, color palettes, on and on.

These future models will not be large language models, they will be multi-modal. Large movie models if you like. They will have tons of context about how scenes within movies cohere, just as LLMs do within documents today.




So, we went from "just hand off movie script to automated director/DP/editor" we're now rapidly approaching:

- you have to provide correct detailed instructions on lighting

- you have to provide correct detailed instructions on props

- you have to provide correct detailed instructions on clothing

- you have to provide correct detailed instructions on camera position and movement

- you have to provide correct detailed instructions on blocking

- you have to provide correct detailed instructions on editing

- you have to provide correct detailed instructions on music

- you have to provide correct detailed instructions on sound effects

- you have to provide correct detailed instructions on...

- ...

- repeat that for literally every single scene in the movie (up to 200 in extreme cases)

There's a reason I provided a few links for you to look at. I highly recommend the talk by Annie Atkins. Watch it, then open any movie script, and try to find any of the things she is talking about there (you can find actual movie scripts here: https://imsdb.com)


There's two reasons to be hopeful about it though: AI/LLMs are very good at filling in all those little details so humans can cherry pick the parts that they like. I think that's where the real value is in for the masses - once these models can generate coherent scenes, people can start using them to explore the creative space and figure out what they like. Sort of like SegmentAnything and masking in inpainting but for the rest of the scene assembly. The other reason is that the models can probably be architected to figure out environmental/character/light/etc embeddings and use those to build up other coherent scenes, like we use language embeddings for semantic similarity.

That's how I've been using the image generators - lots of experimentation and throwing out the stuff that doesn't work. Then once I've got enough good generated images collected out of the tons of garbage, I fine tune a model and create a workflow that more consistently gives me those styles.

Now the models and UX to do this at a cinematic quality are probably 5-10 years away for video (and the studios are probably the only ones with the data to do it), but I'm relatively bullish on AI in cinema. I don't think AI will be doing everything end to end, but it might be a shortcut for people who can write a script and figure out the UX to execute the rest of the creative process by trial and error.


> AI/LLMs are very good at filling in all those little details so humans can cherry pick the parts that they like.

Where did you find AI/ML that are good at filling in actual required and consistent details.

I beg of you to watch Annie Atkins' presentation I linked: https://www.youtube.com/watch?v=SzGvEYSzHf4 and tell me how much intervention would AI/ML need to create all that, and be consistent throughout the movie?

> once these models can generate coherent scenes, people can start using them to explore the creative space and figure out what they like.

Define "coherent scene" and "explore". A scene must be both coherent and consistent, and conform to the overall style of the movie and...

Even such a simple thing as shot/reverse shot requires about a million various details and can be shot in a million different ways. Here's an exploration of just shot/reverse shot: https://www.youtube.com/watch?v=5UE3jz_O_EM

All those are coherent scenes, but the coherence comes from a million decisions: from lighting, camera position, lens choice, wardrobe, what surrounds the characters, what's happening in the background, makeup... There's no coherence without all these choices made beforehand.

Around 4:00 mark: "Think about how well you know this woman just from her clothes, and workspace". Now watch that scene. And then read its description in the script https://imsdb.com/scripts/No-Country-for-Old-Men.html:

--- start quote ---

    Chigurh enters. Old plywood paneling, gunmetal desk, litter
          of papers. A window air-conditioner works hard.
          A fifty-year-old woman with a cast-iron hairdo sits behind
          the desk.
--- end quote ---

And right after that there's a section on the rhythm of editing. Another piece in the puzzle of coherence in a scene.

> Then once I've got enough good generated images collected out of the tons of garbage, I fine tune a model and create a workflow that more consistently gives me those styles.

So, literally what I wrote here: https://news.ycombinator.com/item?id=42375280 :)


That’s the same thing with digital art, even with the most effortless one (matte painting), there’s a plethora of decisions to make and techniques to use to have a coherent result. There’s a reason people go to school or trained themselves for years to get the needed expertise. If it was just data, someone would have written a guide that others would mindlessly follow.


Not sure why you jumped there. I was thinking more like ‘make it look like Bladerunner if Kurosawa directed it, with a score like Zimmer.’

You’re really failing to let go of the idea that you need to prescribe every little thing. Like Midjourney today, you’ll be able to give general guidance.

Now, I don’t expect we’ll get the best movies this way. But paint by numbers stuff like many movies already are? A Hallmark Channel weepy? I bet we will.


> Not sure why you jumped there.

No jump.

Your original claim: "Submit a whole script the way a writer delivers a movie to a director. The (automated) director/DP/editor could maintain internal visual coherence, while the script drives the story coherence."

Two comments later it's this: "We can already get detailed style guidance into picture generation. Declaring you want Picasso cubist, Warner brothers cartoon, or hyper realistic works today. So does lighting instructions, color palettes, on and on."

I just re-wrote this with respect to movies.

> I was thinking more like ‘make it look like Bladerunner if Kurosawa directed it, with a score like Zimmer.’

Because, as we all know, every single movie by Kurosawa is the same, as is every single score by Hans Zimmer, so it's ridiculously easy to recreate any movie in that style, with that music.

> You’re really failing to let go of the idea that you need to prescribe every little thing. Like Midjourney today, you’ll be able to give general guidance.

Yes, and Midjounrey today really sucks at:

- being consistent

- creating proper consistent details

A general prompt will give you a general result that is usually very far from what you actually have in mind.

And yes, you will have to prescribe a lot of small things if you want your movie to be consistent. And for your movie to make any sense.

Again, tell me how exactly your amazing magical AI director will know which wardrobe to chose, which camera angles to setup, which typography to use, which sound effects to make just from the script you hand in?

you can start ,with a very simple scene I referenced in my original reply: two people talking at the table in Whiplash.

> But paint by numbers stuff like many movies already are? A Hallmark Channel weepy? I bet we will.

Even those movies have more details and more care than you can get out of AIs (now, or in foreseeable future)


> Again, tell me how exactly your amazing magical AI director will know which wardrobe to chose, which camera angles to setup, which typography to use, which sound effects to make just from the script you hand in?

I think you're still assuming I always want to choose those things. That's why we're talking past each other. A good movie making model would choose for me unless I give explicit directions. Today we don't see long-range coherence in the results of movie (or game engine) models, but the range is increasing, and I'm willing to bet we will see movie-length coherence in the next decade or so.

By the way, I also bet that if I pasted exactly the No Country for Old Men script scene description from up this thread into Midjourney today it would produce at least some compelling images with decent choices of wardrobe, lighting, set dressing, camera angle, exposure, etc etc. That's what these models do, because they're extrapolating and interpolating between the billion images they've seen that contained these human choices.

AFAIK Midjourney produces single images, so the relevant scope of consistency is inside the single image only. Not between images. A movie model needs coherence across ~160,000 images, which is beyond the state of the art today but I don't see why it's impossible or unreasonable in the long run.

> A general prompt will give you a general result that is usually very far from what you actually have in mind.

Which is only a problem if I have something in mind. Alternatively I can give no guidance, or loose guidance, make half a dozen variations, pick the one I like best. Maybe iterate a couple of times into that variation tree. Just like the image generators do.


This is such an incredibly confident comment. I'm in awe.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: