Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: AI dub tool I made to watch foreign language videos with my 7-year-old (speakz.ai)
348 points by leobg on Feb 27, 2024 | hide | past | favorite | 153 comments
Hey HN!

I love watching YouTube with my 7-year-old daughter. Unfortunately, the best stuff is often in English (we're German). So I made an AI tool that translates videos directly, using the original voices. All other sounds, as well as background music, are preserved, too.

Turns out that it works for many other language pairs, too. So far, it can create dubs in English, Mandarin Chinese, Spanish, Arabic, French, Russian, German, Italian, Korean, Polish and Dutch.

The main challenge in building this was to get the balance right between translating the original meaning and getting the timing right. Especially for language pairs like English -> German, where the target ist often longer than the source ("bat" -> "Fle-der-maus", "speed" -> "Ge-schwin-dig-keit").

Let me know what you think! :)




I know Germany dub most video, but wouldn't a seven year old be able to read subtitles? It's a great way for her to learn English, it's how most Swedes learn it before starting school. I think there's a pretty strong correlation between countries' average English proficiency and how common dubbing is.

https://haonowshaokao.com/2013/05/18/does-dubbing-tv-harm-la...

Edit: I forgot to mention that the samples on the website is impressive and well made. How do you do the speaker diarization and voice cloning?


Yeah, this isn't really helpful for her to learn English. This is more when we watch The Anatomy Lab, or BBC's "The Incredible Human Journey". She'll already be asking me a lot of questions about the content. So if I had to translate on top of that, it would be tedious.

Subtitles - those are actually being generated as well. I've generated SRT files during development. Color coded by speaker, and on a per-word basis, for me to get the timing right.

Basically, if you have a YouTube channel, you can take any video from your channel, run it through Speakz.ai, and you'll get 15+ additional audio tracks in different languages, plus 15+ subtitle files (SRT).

Voice cloning and speaker diarization was a bit of a challenge. On the one hand, I want to duplicate the voice that is being spoken right now. On the other hand, sometimes "right now" is just a short "Yeah" (like in the Elon interview) which doesn't give you a lot of "meat" to work with in terms of embedding the voice's characteristics.

Right now, I'm using a mix of signals:

- Is the utterance padded by pauses before/after? - Is the utterance a complete sentence? - Does the voice of the utterance sound significantly different from the voice of the previous utterance?

It's a deep, deep rabbit hole. I was tempted to go much deeper. But I thought I better check if anybody besides myself actually cares before I do that... :)


I think it’s a cultural difference. I’m also from a non-dubbing country (Netherlands) and I can’t stand dubbed content either. On the other hand people tell me they can’t stand subtitles because it “reveals” what they’re going to say before they say it.


"people tell me they can’t stand subtitles because it “reveals” what they’re going to say before they say it."

I love watching movies in the original language, but this is something I hate as well, but something that can be avoided.

Some movies get it right, though. The timing, just the words that are spoken and even different colors for different persons speaking (very rare, cannot even remember where I have seen it). That should be standard, but with most movies you can be lucky if the subs even match the plot and do not reveal too much.


Some of the best subtitles I've ever seen were on Tom Scott's YouTube channel. They use different colours, indicators for jokes and sarcasm, while also staying relatively close to what's actually been said. They're better than many big-budget movies and TV shows I've seen.

He talked about subtitling at some point, and I was surprised how cheap subtitling services are. I think he went beyond the price he mentioned, but it really made me question why big, profitable YouTube channels aren't spending the small change to do at least native language subtitles that Google can translate, instead of relying on YouTube's terrible algorithm

That said, Whisper seems to generate quite good subtitles that take short pauses for timing into account, but they're obviously neve going to be as good as a human that actually understands the context of what's being said.


Whisper can also generate timings at the word level, which you could use to make better-timed subtitles


Yes. But Whisper's word-level timings are actually quite inaccurate out of the box. There are some Python libraries that mitigate that. I tested several of them. whisper-timestamped seems to be the best one. [0]

[0] https://github.com/linto-ai/whisper-timestamped


That's a great use case for LLMs, actually. Translate the sentence only up to what has been said so far. Basically, a balance between translating word-for-word (perfect timing, but terrible grammar) and translating the whole sentence and/or thought (perfect grammar and meaning, but potentially terrible timing).

With the SRT file format for subtitles, I think, there's no reason why one couldn't make groups of words appear as they are spoken.

Actually, I have to do the same thing when generating the dubbed voices. Otherwise it feels as though the AI voice is saying something different than the person in the video, especially when the AI finishes speaking and you still hear some of the last words from the original speaker.


Unfortunately not all languages follow the same sentence structure, so translating "up to what has been said so far" is not possible.

Assume 2 dramatic stops in an English sentence, and observe Turkish version. You can "I will.. go to.... the cinema" "Ben... sinemaya... gidecegim" (I .. to the cinema.. go)

I am sure there are smarter examples.


>different colors for different persons speaking

BBC iPlayer does this for some content, I don't know if it's ever on movies though.


It is. The iPlayer subtitles for Citizen Kane use colour to distinguish speakers.


I'm Norwegian, and Norway used to be near-universally non-dubbing other than for TV for the very youngest children, and even then almost exclusively cartoons or stop motion etc. where it wasn't so jarring. But the target age of material being dubbed has crept up as it has become relatively-speaking cheaper to do compared to revenues generated in what is a tiny market.

The thing that annoys me the most about it is that it often alters the feel of the material. E.g. I watched Valiant (2005) with my son in Norwegian first, because he got it on DVD from his grandparents. He doesn't understand much Norwegian, but when he first got the DVD he was so little that it didn't matter. A few years later we watched the English language version.

It comes across as much darker in the English version. The voice acting is much more somber than the relatively cheerful way the Norwegian dub was done, and it while it's still a comedy, in comparison it feels like the Norwegian version obscures a lot of the tension, and it makes it feel almost like a different movie.

I guess that could go both ways, but it does often feel like the people dubbing something are likely to have less time and opportunity to get direction on how to play the part, and you can often hear the consequences.


I think you get used to it. Like a punchline I've read, but I don't "register" it until the proper thing happens on the screen.


I prefer subs over dubbing for foreign languages, but I cannot stand closed captions (for people who can’t hear at all) because having your eye drawn to the bottom of the screen for a description of something I don’t need to know about is horrible!


Sometimes it's hilarious when they're trying to describe the dramatic tension from sounds or music, and "reveal" all the cliches, though. "Music swells to a tear-jerking crescendo"


A 7yo can barely keep up with subtitles in their mother tongue, depending on the speed. And that's probably true for a p90 reader. A p50 there is no way it can follow subs understanding what they say. Now, being a video, they might be able to interpolate from what they see, so it might be a nice challenge. But doing this with subtitles in a foreign language is only for a few, privileged minds.

Source: father of a 8yo with VERY good reading skills (already reading books in 2 languages targeted at tweens)


Dubbing in Germany is horrible and pervasive. Even in the news and interviews. Subtitles are cheaper and better.

As others have said, it is better to expose kids (that can read) to the original language plus subtitles.

So in other words, your solution while technically great is pedagogical not wise. A typical geek approach to a problem ;)


The worst thing about dubbing is that it's more important for the translations to have roughly the same length and correspondence to the original mouth movements than to be accurate. So the original meaning is often altered, and you don't even know it because of course you have no easy access to the original most of the time. But unfortunately Germans are so used to dubbing that subtitles don't really stand a chance. There are a few cinemas here and there that show original-language movies with subtitles, and on TV there was one experiment that I'm aware of a few years ago (on Pro Sieben Maxx) to show TV series with subtitles, but it was cancelled after some time. AFAIK it's also more expensive to secure the rights to show English-language content compared to dubbed content.


> but wouldn't a seven year old be able to read subtitles?

No, they wouldn't.

I don't believe that most swedes learn English by reading subtitles before starting school.

> I think there's a pretty strong correlation between countries' average English proficiency and how common dubbing is.

That I agree with.


> I don't believe that most swedes learn English by reading subtitles before starting school.

It's not about learning the language per se, it's about familiarizing yourself with the sound of the language, which then makes formal learning feel much more intuitive. English becomes an easy subject because you always feel a little ahead of the material. When faced with a "fill in the blank" type of questions, you're able to answer them by what feels right, even when you can't quite explain why it feels right.

It's why #1 rule of language learning at any stage in life is always gonna be immersing yourself with the language you want to learn, and by far the most effective way to immerse yourself (excluding moving to another country) is to consume content in your target language.


Most people in Belgium learn English through that before school.

Why wouldn't swedes?


You are saying that _most_ kids in Belgium can read subtitles before they start school?

It took me several years of school before being able to read fast enough to follow along subtitles, and the same for everyone I know.


They probably meant before they start learning English in primary school, not before they start school.

This used to be the case in the Netherlands too; I picked up a significant body of English from British TV series watched with subtitles as kid. Nowadays this advantage will probably be missed by most children, because the streaming services offer a lot of dubbed content, and you get to pick what you watch unless someone guides you. Subtitles can be avoided for longer.


I think you're native English and have the associated bias concerning how it works in practice?

Since kids first learn their native language ( write, read and speak) in school and only years after then ( mostly), learn foreign languages.

When they learned to do it in their native language, they hear English spoken on tv with eg. Dutch subtitles and pick it up. Sometimes before they have English lessons.

Most kids, as such, know a fair amount of English before they have it ( = English) in school.

The Dutch subtitles isn't always a requirement though. Kids will pick it up in some shows, eg. Pokemon would be a good example if English spoken.


Are you saying that kids of age six can understand and speak English at a basic level - say half way to A1?

Or is it just a basic familiarity (like a couple of most common words) and awareness that English exists?

EDIT: I see from a reply below by Freak_NL that it probably means before the kids start learning English at school. That makes more sense, as they would be older at that point.


> as they would be older at that point.

I don't know about The Netherlands but here in Norway children start learning English as soon as they start school at the age of five or six. But quite likely many of them will have at least some English already because of English language television, computer games, etc.


My 6 year old has been watching 20 minutes of cartoons every night for the past two years. This is the only exposure to the English language that she has ever had.

She has learned to understand what is said in the cartoons. Of course she misses some things, but it's surprising how much she gets.

Like, when I ask her "what did Bluey just say?", she can explain it.

Children's brains are awesome.

But actually, grown-ups can also pick up quite a lot if they actually immerse themselves.


Bluey is an excellent cartoon to do that with. Kudos!


I just wish there was a way to buy the Australian original version as a download.


Young kids don't even need subtitles, their brains are wired to figure out spoken languages, after all that's how we all learn our mother tongue initially. Last summer my then 3.5 years old, to my huge surprise, started talking in (simple, but correct) English with some tourist kids she met in the park. We never spoke English in home with her before, so I presume she picked it up from youtube and her older brother, but I had no idea she can form full sentences - including conditionals and past tense. And at first she was a bit slow to express her self, but after a few hours of play with those kids she sounded totally relaxed and fluent.


Subtitles in a foreign language? Probably not. Subtitles translated into their original language? I think it's probably an exaggeration that people have learnt it before starting school because it implies a lot about what learning it means, but picking up a number of words, sure.


>> but wouldn't a seven year old be able to read subtitles?

> No, they wouldn't.

hard disagree


Reading speed at that age will vary greatly. Reading subtitles while also having to follow the picture takes away focus and that makes it hard much harder for an inexperienced reader. My daughter, who picked up reading very naturally would have been able to follow sub-titles at age 7 without much trouble. My younger, 7-yo son on the other hand, who is more average in reading ability wouldn't be able to keep up with subtitles yet. Average reading speeds at age 7 seem to be 60-100 words per minute where subtitles are more at the 100-150 words per minute range. So for above-average readers, it will be possible but for the average, they won't be able to keep up consistently.


>I know Germany dub most video, but wouldn't a seven year old be able to read subtitles?

I gotta say... while sometimes it is a necessarily evil, I would so rather not have to read subtitles. I often want to listen to a show so that I can also continue working on catching up on email, etc, IE: I can't read two things at once, but I can listen to one thing and continue working on something else.


Impressively done! It sounds like you're doing

1) doing voice recognition with voice time clues, which Whisper and the like provide, breaking it up into sentence (or similar) units; you don't need to time match individual words, but you need to time match at coarser grain.

2) using a translation engine that allows for multiple alternative translations

3) cloning the original voice, regardless of language

4) choosing the translation that has the best time match (possibly by syllable counting, or by actually rendering and timing the translations). If there isn't a close translation, maybe you're asking ChatGPT to forcibly rephrase?

5) Maybe some modest pitch-corrected rate control to pick out path that gets you closest to the timing?

Did I get any of that right?


Very good!

Yes, that's basically how it works.

I don't do any pitch-correction. But I do check the TTS output for lenght, and I re-generate if it doesn't match my time contraints.

I also have an arranger that tries to figure out when to play an utterance early (i.e. earlier than in the original) in order to make up for the translated version being longer.

I try to make the translations match the speaker's character, as well as the context. So ideally, Alex (Sample 2) will still say "Salut" even in German (instead of translating that greeting, too).

And I need to monitor for speaker changes. This is because I can't clone the voice unless I have a decent amount of sample data. If Elon just says "Yes", cloning the voice based on just that one syllable will make it sound like a robot. But I also can't just blindly grab any voice around it, since that might be somebody else's voice.


I think it's a speech to speech model, I know about seamlessm4t: https://www.google.com/amp/s/about.fb.com/news/2023/08/seaml...

Interesting, but what inference engine supports it to run at decent speeds?


Ooh and you're probably doing a split into voice and non-voice tracks of the original, and keeping non-voice at original volume, but lowering the voice track.


I also noticed that the third sample with Chinese sounds slightly sped up in the first English segment, so there may be also an element of postprocessing the dub (speeding it up/slowing it down).


Yes. Though I don't like this solution. It breaks the flow. And it also doesn't really fully solve the problem. Overruns still accumulate if they happen too frequently. One second here, one second there... the further you get into the video, the worse it gets.

I think it would be better to either slow down the underlying video or solve the overrun issue on the translation level. A good professional dubber will find translations that will even out in terms of timing. That's something an AI should be able to do better instead of worse.


The last sample from BBC is really hilarious when translated to Polish. Something definitely went wrong and the voice speaks like a drunkard.


I know this is HN so I don't want to distract from the technical achievement and how genuinely useful this can be.

I also don't want to tell you how to raise your kid. You do you, it's not my family. But I want to share how important it is to watch foreign spoken language movies and TV, especially as a kid, to be able to speak multiple languages later in life. You'll notice that in every country where TV and movies are regularly dubbed in the local language, the English levels go to shit. Dubbing is partially responsible for this because kids are not exposed to a different language on a regular basis.

I remember wanting to watch a dubbed movie with my mom as a kid, and she told me "We will watch the original instead, dubbed movies don't have a soul". It stuck with me. She was absolutely right. Today I am working on my sixth spoken language. Causation not guaranteed, merely implied.


>We will watch the original instead, dubbed movies don't have a soul

I disagree. It's all about the quality of the voice actors and the effort put into localisation.

Having grown up on Dutch dubs of many cartoons, I honestly find the Dutch voice actor of Spongebob better than the original. I'm missing the extra energy that the Dutch VA seems to have put into the voice when I hear the original, even if the original is very good. Though text on screen isn't translated, puns and references are, sometimes overhauled completely.

The talent pool for Dutch voice actors isn't as big as I would've liked (you often hear the same five VAs in every show on a given channel), but some of them really put in the work. Many of them only do kids TV and commercials (really freaked me out to hear Ash Ketchum try to sell me soap one day) and not every VA is as good/paid enough/gets decent scripts, but there are some real gems to be found in dubs.

Last year I found out how Ukrainian dubs work and I was astounded by how weird the experience was. I'm used to dubs having only the voice track swapped out, but the Ukrainian shows seemed to just have the acties talk over the original show, like this AI tool does, and I honestly can't imagine ever getting into a show that's dubbed like that. I assume people get used to this, but I found it rather annoying.

Blanket statements like "dubs have no soul" serve nobody. There are good dubs and there are bad dubs, and the ratio will probably differ depending on the language you're talking about. Dismissing all dubs ignores the real heart and soul some dubbing teams have put into their works. That doesn't mean I disagree with the idea of exposing kids to more languages, but I wouldn't expect kids to learn much from just TV shows and movies in the first place.


Voices in dubbed movies don't have any depth, for instance.

That doesn't have anything to do with the quality of the voice actors. Everything sounds flat because that is just how they record it.

Dubbing is a useful convenience, an accessibility feature (even if it wasn't born that way). But they have way less soul.


I guess we just disagree, or maybe you're used to worse dubs than I am. There's nothing inherently flat about dubbing at all. In fact, in many (older) movies and shows, actors would dub over themselves to get better audio.


In a movie, if a character is far away, their voice comes from far away. Voice actors always have the mic in front of them, so their voices always come from the same place, not relative to the scene. That's what I meant.

I also think it is beautiful to hear the sound of the original language, particularly if it is one I am not used to. It's part of the charm.

I have grown up with dubs, thou, so I understand you. But once one gets used to no dubs, there is no way back. It's like removing sugar from the coffee.


I think the main point is that small kids get the basic building blocks for learning languages from anything they hear (even if they don't understand it yet), so listening to as many languages as possible when they are young will make learning languages easier for them later in their lives.


I've heard this argument before, mostly from companies trying to sell language courses for kids. As far as I know it's true that kids pick up on languages much easier when they're young, but I'm not so sure those skills will stick if all they can converse with are the TV. That's quite different from having a speaking partner such as a teacher or a bilingual parent. I suspect this is why shows like Dora the Explorer are set up like an interactive game.

I myself have been exposed to subtitled English shows and movies all my life (not every show or movie was dubbed, and there were some German shows that made it through as well) but I don't think I actually started speaking any English until I needed it to interact with strangers in Runescape, while at the same time I stopped watching any dubbed shows. Almost all of the content I consumed became English language content.

Almost passively learning a language by enveloping oneself in it works (though actual study will help you advance quicker), but you need more than TV. I can't find the actual paper I read on this once (thanks, SEO spam!) but as I recall, the biggest advantage kids have to is learn pronunciation without an accent; picking up vocabulary and grammar don't seem to be too affected by age from what I recall.


Mostly I agree with this, but for animated works dubs can be an integral part of the product when done right, and some are even tweaked for different languages (although I strongly reject adjusting the actual cultural content for different locales). The dubs have to be made in concert with the original though. There is also a lot of plain crap out there.

But absolutely; for anything featuring live action, dubs just damage the original.

I watch a German man building his massive Lego city on Youtube (narrated and recorded quite professionally) with my five year old son for a few minutes before bed. He is now at the point where he is trying to give this weird language (to him) a place in his head. Some words are familiar (being Dutch), some are foreign, and you can see the feedback loop happening when words do land; he wants to know what that man is saying. I don't except him to pick any German at this point, but the basics of immersion in another language are there.


Yes I agree with you. Actually, good-quality dubbed animated movies (= disney) is what I often use to help learn a new language.


FYI, I agree with you in all points.

As I said in another comment, I wouldn't want to live in a world where everything was dubbed into my language.

Any translation takes something away from the original. And dubbing even more so.

I also believe that being exposed to a foreign language long before you ever make a concious attempt to learn it is important. I wouldn't think I'd succeed in teaching my toddler to say "Daddy" if he hadn't been listening to the rest of us speaking for many months before.

I can see how this headline can make me seem like a bafoon of a dad. But I think I'm really not. :) When I watch The Anatomy Lab with my daughter, that's a time when I want our conversation to focus on how digestion works. Not on what the guy on the screen was saying just now. But of course there will also be times where I'll want our conversation to be about exactly that: What a foreign speaker just said. How those words come together. How the may have the same root as the words we use in German. Also, while AI has its place, I prefer to have these conversations with her myself.


I can't say I agree with dubbed movies having no soul. Greater accessibility to a wider audience is not something to deride, or hold in contempt.

That being said, I do agree that listening to other languages is a great thing. My father was a linguist, and when we would watch subtitled media, we'd play a game where we'd try and hear the cognates, pick out the most common words, figure out the basics of the grammar as we went along. It was a lot of fun!


One of my favorite movies was "Scent Of A Woman". But when I watched it in the German-dubbed version, I was appalled. It made the whole movie suddenly seem like a comedy. To me, the translation had killed its "soul", for lack of a better term.

I still want my kids to learn English. And ideally also one or two other languages, like Chinese.

As Nietzsche said:

"So you have mounted on horseback? And now ride briskly up to your goal? Well, my friend - but your lame foot is also riding with you!"


Counter point. My parents didn’t let me watch dubbed shows, and didn’t speak our native language because rhetorical wanted me to speak unaffected English.

I can’t speak any languages, but in school my English was insanely good. To the point of perfect scores in the college scholastic exams, and when I was in uni for engineering, I took on an English major for fun with essentially no impact on my work load.

You can generalize but you can also specialize.


I speak Russian, and I gotta say, the Lex sample is incredible. It sounds like real dubbing. Maybe not pro-level dubbing, but it's very very good. They voices are also very close to Lex's and Elon's.

Congrats! Very well done.


Thank you! Yeah, I also had a big grin on my face, hearing Lex and Elon suddenly talk in another language. :)

Consistency isn't perfect yet, as we're building the voice from scratch basically for each utterance. One the one hand, you want that, because the utterance might be more upbeat, or lower pitch, than the speaker's "average" voice. On the other hand, it sometimes introduces variance that makes the listener's brain go, "Uh.... is that another person speaking now?". If I had to dub 200 videos of a single YouTube channel, I would be able to fine-tune the voices of the main characters, and reserve the ad-hoc cloning for guest characters.


Please consider adding Simplified English as an output language option, preferably with a level, e.g., A2, B1, etc. This way, I can adjust the language complexity to my kids' level and then gradually remove the crutches as they improve in English.


Yes! I love this!

So you'd be translating English to Simplified English? Or are you talking from another source language?

I've already been playing with this concept w.r.t. books:

I take a non-fiction book. I'll have an LLM translate it with a specific audience in mind (say, a 7 year old girl with a certain background), explaining concepts and words that are likely unknown to that audience. And then converting the whole thing into an audiobook. Optional parental controls built in ("exclude violence", etc.). Nowhere near showtime, though.

Another thing I'd love to work on is filtering existing content. There are millions of videos on YouTube. Right now, finding quality stuff that's fun to watch with my kid depends a lot on dumb luck. But what if I could filter by topic (semantic whitelist/blacklist, i.e. not keyword dependent), personality traits (OCEAN, MBTI), values (e.g. "curiousity") and language (reading level, vocabulary, words per minute, etc.)? I'd love that.


I'd love what they suggested as well, for other languages. I'm working on improving my French (and occasionally German), and I'm at a stage where I can follow along some French shows reasonably well if they're not speaking too fast (one of the first French phrases my French teacher in school taught us was "plus lentement, s'il vous plaît" - "slower/slowly please", for a reason), and if they're not speaking any particularly difficult accents, and not too much slang, but it's limiting and I'm often forced to keep English subtitles on as a consequence and it's sometimes too much of a crutch. It doesn't help that my hearing isn't what it was.

Being able to "step down" the difficulty so that I can either turn off subtitles entirely or rely on French subtitles, or even much "difficult speech" and "simple subtitles" or vice versa seems like it'd be very useful in getting over that hump faster.


This is really impressive! Can't wait to see this more fleshed out. I'd gladly pay for something like this (by the video, ideally).

Some page feedback though: It seems to me that the video just keeps playing, with no way to restart it or scrub through the timeline. Each time I click a language, it changes the spoken audio but just keeps playing where it left off. That makes it hard to compare the same passage across different languages.

Separately, I think there are also some errors in translation. For Sample 3 (about the vines), the original in Mandarin Chinese says something like "if this tree gets grabbed, the weed will climb up and wrap around it, and the tree won't be able to photosynthesize and will die". But the English mistranslation says "If it gets scared by people, it gets pulled off and messed with. It can't function. The evil effects? It just dies."

There are also timing issues where the translations don't match up with the original subtitles or dialogue, and certain parts of the original audio just seem to be altogether ignored and not translated.

Maybe displaying the translated subtitles, allow with a way for users to report errors, would help...?


Thank you very much!

Yes. You cannot control the video playback on the demo page. I made it so because I wanted a way to showcase how you can switch between languages. You can go from Elon speaking English to German, Russian and Chinese, each with just one click. Activating the player controls would have made the UI more complex and distracting. And it would have also made it harder for me to sync the timing between languages.

Of course, the real output would be a proper player, with all of the controls. Or, for creators, raw files (video and/or audio, plus SRT subtitles).

I also noticed problems in the translation of the Chinese video. I put it up there anyway, because I figured most people coming to my site would be English speakers, and being able to understand a Chinese video might be another interesting aspect, in addition to the idea of being able to turn your own English content into languages you don't speak.

If this had been a pitch deck, I would have cherry picked the samples. But I wanted to share where the project is right now and see if anyone was interested. Premature optimization is the root evil of all programming. I think Knuth said that. And it's a trap I regularly fall into. So I tried to be disciplined this time.

But if any Chinese YouTuber would ask me to dub their work today, I'd make darn sure that the translations were close to perfect. Meaning I'd allow the system to make changes to the way things are phrased if that's necessary for the purpose of timing or cultural context. But I wouldn't allow it to skip a thought from the original video, or say something something different.

I've translated books by hand in the past. So this is something I care about. If the demo isn't perfect in this regard, it's because I didn't know if anyone was going to even look at my project. When I first posted this yesterday, my submissions didn't go beyond one comment for several hours. I already thought I had built another solution looking for a problem. :)

If you're seeing dropped phrases, that's most likely because my arranging function failed. Basically, the translation ran longer than the original. The algorithm tried to speed it up and fit it in. But it failed and dropped it. Better handling of these overruns are on my to-do list. Neither drops nor speedups should be tolerated.

In terms of self-correction, I plan to feed the translated audio back into the transcription engine. Then, an LLM can compare the translation with the original transcript. If anything is missing, the pipeline will be force to run again with slightly different parameters. There shouldn't be a human neccessary in the loop. Translation is what Transformers are best at.


Gotcha, thanks for the great walk-through and in-depth explanations! Excited to see how this thing progresses.

I'd totally pay to have something like this as a Chrome plugin for YouTube, for example.


I've only ever experienced Dutch dubs in kids' TV but I feel like these examples show that your Dutch model may need some work. I can't judge other languages well, but I found the Taiwanese documentary dub especially hard to follow. I wouldn't have expected Dutch to be in there for how little the language is spoken and how often Dutch speakers will understand English, though!

/offtopic It seems to do a pretty interesting thing where the first male voice has a bit of a Flemish/southern accent while the second male voice has an accent much closer to "Netherlands TV" Dutch. Reminded me a bit of the Lion King dub where the dub studio used Flemish voice actors to do the jungle animals (and Dutch voice actors for the savannah animals) to underline the "different world" Simba arrived in.


Yes, that issue is also present in the German translation.

I'm planning to monitor the output quality. Basically, feeding the translated audio back into the transcriber. Then compare it to the original transcript. Like a higher level loss function. I'll need this already because I don't speak all of these languages myself. But I can also use it to make the pipeline self-regulate and generate a new, better version if the last one scored too poorly.


Interesting, I can see how that approach would catch the weird voice lines.

Just the different ways the languages get picked up and processed by the AI system could be interesting. If you find anything cool, I'd love to read a blog post about it!


This is really amazing! Well done.

I already joined the beta but I want to point out another use case here as well:

In many countries (ie Greece where I'm from) movies and TV shows never get dubbed. We rely on subtitles. This means that if you can't see well (disability or age-related eye problems) and if your English is not excellent, then you are doomed to only watch locally produced movies & shows.

This can be a real life-changer.


Thank you!

With movies, I think I could get into legally challenging territory. I guess all AI apps are, in a way. But with movies, there's an entire industry behind enforcing copyright. So I must tread carefully on that front.

I made the jump from the courtroom into VS Code years ago. I really don't want to go backwards.


I honestly don't see how movies are different with any content ie YouTube videos. I am pretty sure MrBeast etc have the same lawyers as any big studio.

Could this run locally? I would certainly pay for that and you're off the hook on how anyone uses it.


What does this use behind the scenes? This type of stuff can get pretty expensive if you're relying on elevenlabs or heygen.


Let's just say this is NOT a wrapper around Elevenlabs or Heygen. I've looked at commercial voice cloning before. But, as you said, the prices seemed ridiculous.

Before this, I made audiobooks for my daughter. Old, out-of-print books, turned into speech. If I remember correctly, with Elevenlabs a single book would have cost me > $100. At that price level, I can read the damn thing myself. What good is computer generated voice if it isn't at least 10x cheaper than doing it yourself?

I'm just one guy. With me, it's just my time, one or two commercial licenses, and other than that just the raw price of running those GPUs.


Yea my guess would be elevenlabs, which just recently announced this exact featureset.


Congratulations on this project! We spent 6 years developing the best solution for generating perfect subtitles automatically (https://www.Checksub.com). 2 years ago, with the arrival of new generations of AI, we decided to go a step further and add the possibility of automatically dubbing videos. But automatic dubbing requires manual adjustments for a comfortable result for the audience. For example, https://www.HeyGen.com generates a video automatically, but offers very few editing options. That's why we focus on two things:

1 - to provide the best possible automatic quality 2 - to offer an advanced editor that lets you fine-tune your dubbing without having to go back to editing software.

In any case, I'm delighted to see people working on this problem. I hope it will help develop this sector.


Great website! You're based in France? You should put a demo on your website. If there is one, I don't see it (or rather: hear). If you're interested in collaborating, me email is in my bio.


Thanks for Checksub, another happy user here.

We take multilingual, AI-cloned audio from you guys (split from background noise, of course), after we've processed the subs professionally at our end, then we align everything in your tool and send off to a third service for lip sync. The result has blown away a few clients now. The CEO speaking 5 languages with perfect lip sync in their own tone of voice is quite convincing.

Hopefully we can get it all in one tool soon.


Why doesn't youtube have something like this built in.


I guess it's computationally expensive? OP states in the website that their solution takes 1 hour of processing for 30 minutes long videos .

Now imagine offering this for all the YouTube videos available:

- either it's done on their servers (hard to believe due to high costs) - either it's done on client side (which is also difficult, due to lack of processing power)


Well, I'm also shamefully unoptimized at the moment.

YouTube added auto-captions years ago. Long before there was Whisper, let alone things like Whisper.cpp. I imagine what I'm doing now is computationally no more expensive than what they did back then.


And automatically forced on every user depending on whatever their Google account is set to, just like the video titles which now get auto-translated without any way to turn this off. We're sliding back into a monolingual world.

No thanks.


As a native German, I also resent it when Google/Amazon/whoever tries to force a translation on me when I prefer the original language. So I wouldn't want to live in a world where everything would be dubbed into German for me. Not even if they used my tool :)

Regarding YouTube:

AFAIK, YouTube allows you to add multi-lingual voice tracks to your videos. Then, if the viewer has a preferred language set, the video will play in that language. Else in the language inferred by his browser/OS. But the user can also switch back to the original language, or any other language, right in the player.


You can't switch to original titles though, so I'm not really confident Google is going to be offering this option for long.


This is great, I tried to do a similar thing once but my language is one of those that AI doesnt do well.

I think you can look into muting the original voice in the video, I remember I saw there is some AI/tech that can separate audio into voice and nonvoice.

Yandex browser does this in the browser, you open a YT video and it offers to translate, a few seconds later the voices are all in Russian, it's probably the most interesting production use of AI I have seen and for free. It's to Russian only unfortunately.


Why do you keep the original audio track on the dubbed version? I think it sounds pretty distracting, although you do want to keep sounds other than the original voice I guess


I dig it, it's like amateur underground over dubs of bootleg hollywood movies. Don't want to conflate amateur AI voice with actual personality on screen. The seperation is part of the experience.


I think these little bits of the original sound really works well, helping us hear the original as well and keeping it less uncanny.

What an amazing project!


Thank you.

Exactly. I had the OG voice removed at first. But I added it back in for exactly this reason. It also serves as a tool for AI accountability: It lets you "see" that the cloned voice is indeed saying the same thing as the original voice.

That being said, it would be trivial to turn the OG voice off for anyone who wants to.


News channels do that for interviews


This is really amazing I'm very impressed and happy you released this. Can you share more details about your development rhythm of this helpful piece of software?


Cool project. What is the tech stack behind it. I can already guess few. 11labs, OpenAI…dreamtalk. There are so many similar services to what you are doing. What sets you apart? You should partner with local media or outside. Good luck!

Check out https://www.flawlessai.com they been around since 2018. Interesting stuff. When I first saw it in 2021 I was blown away.


Talked about this idea last month since astrobiology still isn’t all dubbed to English. Thank you for actually making the tool, it’s awesome, huge Kudos.


Very cool, any chance to see a on device release so we can run this locally? Topaz ai has a pretty good model for this if you are looking to monetize.


Great tool!

Curious if you've made more progress on diarization than what's described in this article?

https://aipressroom.com/streamline-diarization-using-ai-as-a...


They use pyannote/speaker-diarization. I tried that, but it wasn't accurate enough for my purposes. Made a confusion matrix with voice samples from The Simpsons characters. It looked... well, confused.

Am using a mix now of speaker embeddings and other signals (end of sentence, pause before/after, etc.). As you can see in the demos, it already works well for interview situations. It's when there are 3+ speakers and they talk over each other that the system gets confused.


Very passable. Waiting for something local like this for foreign language PLEX and podcasts. As someone who views/listens to things at 2x/3x speed 10-15 bucks an hour is cost prohibitive.


Perhaps you could ask some of your favorite podcast hosts to make a deal with me. Running a training on their voice once and then just re-using that will be much cheaper. Also, customers who buy in bulk will help me focus on this full time. There is huge potential for making this faster and cheaper.

(Even using OpenAI is silly. Technically, I neither need GPT-4's knowledge nor instruction tuning. Both is unnecessarily adding cost and latency. But it helped me get the demo out.)

Basically, the deal for Podcasters / YouTubers would be:

- Get all their episodes converted into 15+ languages - Increase their reach today, while the novelty is high and the market is still uncrowded - They get to tell their sponsors that they now have reach across the language boundary


I don't know state of podcast ecosystem, but I think you should reach out to listennotes.com whose also a 1 man job that seems to elevate discovery and looks like has reasonable reach for producers. Or go hit up some popular western podcasts, you've definitely got something here and the execution is good enough.


Amazing! I love this for dubbing, but was wondering if anyone knows of an AI powered subtitle generator for YouTube videos? I know YouTube has closed caption, but it’s terrible.


Speakz actually generates subtitles as a byproduct. The idea is that you put in a video, you select the target languages. And then you get out, for each target language, an audio track and an SRT subtitles file.

Someone else here asked about generating only subtitles, with no audio, as a cheaper option. So I'll probably add that as an option.


I would love that to help learn another language!


https://fluen.ai gives very accurate subtitles (and subtitle translations).


What potential issues with copyright are there from offering (paid) access to this tool to run on sources including Youtube, and with the output containing the source audio?


You could have asked the same question when Google started building their index. Or OpenAI trained their models.

I'm in Germany. I'm a licensed lawyer. I see the dangers.

The safest path will be to simply offer the production of multi language translations to content owners themselves. Which is also going to be more efficient - translating the thing at the source, rather than having consumers each create their own translation.

But the original intent for this has been to have my computer translate a video I want to watch with my kids in my private home. Technically, it's not "my computer" in the sense of being just the device that's physically in my home. There's stuff that happens in the cloud. Technically, copies are being made. So one could argue the point.

For most people today, getting your content seen and consumed is the highest you can achieve. To sue someone from another country who cares enough to pay someone else to translate it for him would seem bonkers. But I'm sure there are lawyers who are desperate for work. Who cannot code, and can't be bothered to learn, but still want to do something "in AI". I'll at least give them a hard time. And dare they use ChatGPT hallucinated references on me! :)


I was more thinking Youtube and record labels might take issue with the service, e.g. how they go after stream ripping sites.

Having the creator put a code in their channel description to verify ownership could be a good approach. Thanks for sharing the project!


You're creating a derived work, so you would be violating someone's IP unless the licence is permissive, and making money of it. That usually attracts the attention of whoever owns the IP.


Who's "you"? I'm obviously not a lawyer, but instinct tells me that end users are the ones creating a derived work by uploading a video they may or may not have the right to distribute. Linked website is just a tool, perhaps a cloud-based one, but still just a tool.


Not just files a user uploads: "You can select videos from YouTube"


Thanks, I missed that. I can see how that would complicate things.


I suppose you're paying a lot for the voice cloning, so do consider that in many countries voice-overs are done with a single generic voice. Would you consider a lower price point service doing just that?

I'm doing something similar and using GPT-4 for translation. What's unique about it, is that you can specifically prompt it to avoid long translations by rephrasing things, so you can buy yourself some time for the "Fledermaus".


Using a single speaker makes sense when you're paying for human voice talents. But if you're using a computer to generate the voice, why not generate a voice that sounds like the original speaker? Much more fun to hear Elon speak Chinese :)


To be blunt, because of the price. I’m running the whole pipeline of my toy project for much less than $5 per hour. If the voice cloning is the long pole in the tent, I’d just consider dropping it.

Moreover, it’s a cultural custom. I’m from Poland and here the voice-over narrator is supposed to be generic and bland, so your brain learns to tune him out and take the emotional cues from the original voice.


Yes, in Germany we have generic voice-over narrators, too. In documentaries, etc.. They usually match the gender of the speaker, but that's it.

Personally, I read most of my books with the iOS app Voice Dream Reader. That app still uses old TTS voices. They sounded great 3 years ago, but now sound robotic when compared to Elevenlabs or WaveNet. But, as you say, you learn to tune out the voice. I can read entire novels like this, and I still "hear" the different voices and personalities. It just all happens in my imagination.

How much I'd need to charge to make the project worthwhile depends on many factors. And I didn't want to name a price now and then backpaddle a month from now and say it'll actually cost more.

My pipeline right now is super unoptimized, to the point of being embarrassing. This can all be made to run much faster and cheaper.

I agree with you that if the voice cloning part of my pipeline causes a significant chunk of the cost that the end user pays for the service, that I should then offer the option of using a "bland" voice instead for a lower price.


You deserve a lot of credit for doing this... often when I am watching German content that has English subtitles, I wonder if the subtitles are already being produced by AI... I sometimes find the subtitles more confusing (even though they are in my native language) than the German, as though they almost had to have been automatically produced rather than by an actual translator using contextual clues, etc.


I do the opposite and seek out videos in other languages for my kids to watch. Now they're learning German, Spanish, Chinese, and Japanese.


it's a lovely parenting story. Let me tell you there is also a huge opportunity for the opposite use case. My elderly parents speak only one (non-Eglish) language. I would love to have a (cheaper) way to provide my parents with translated videos with the addition of (translated) subtitles. Subs are important because elderlies can have hearing issues great work, inclusivity is love


I second this! I was recently looking into a way to build something like this for my grandfather, but wasn't even sure where to start from the hardware side.

I wanted to have hardware plug into TV receiver, generate subtitles for live TV program and then play it back on TV. Delay would likely be less than a minute but even a few minutes is not a problem really.

Many people with a hearing problem would benefit from this and with AI getting so good at Speech-to-text, this can be done for quite a large population.

If anyone has a recommendation on where to start with this, I'd appreciate it! Was thinking of using Whisper for subtitle generation, but not sure about hardware that can take in, and output HDMI and run this software


I keep thinking about something similar. Also hardware. Also for my grandparents.

My grandma is 95. Her vision is bad. Even using the phone (I'm talking old school landline) is getting hit and miss, because she can't see the buttons.

Years ago, I set her up with an Echo Show. That works well enough for her to say "Call Leo". But Alexa is dumb. Sometimes, she'll mishear something and start playing music. Or start a monthly subscription... :)

So what I'd like:

- box - screen - far-field mic array - AI backend

You could do a number of things with it:

- manage a grocery shopping list (AI will notice duplicates and other oddities and ask) - communicate with the outside world (initiate calls, send emails and faxes, including to local businesses) - optional human oversight and/or permission settings (preventing the AI, say, from ordering groceries for more than $50 a week without a family member approving the order)

Something like your "subtitle mode" could also work:

"Listen to what is currently being spoken in the room (including the TV), and display it on the screen".

My grandma has her TV running all day. So maybe one could ditch the screen and make it a "set top box". Add an IR port to it, so it can control also the TV itself. Something like that might work.


I'm generating translated subtitles internally, before generating the voice-over. Also, generating those subtitles is way, way cheaper. If someone just wants the subtitles, I could offer them.

Bigger question is: What device are your parents using, and what content sources? Because I'd need to be able to download the audio, and inject the subtitles. With a regular TV, I wouldn't know how to do that.


Android device (either phone or tablet) that they can then send over to a chromecast

the chromecast could be a nice to have, not super necessary. they can put the tablet on a table close to them


just realised I didn't answer the second part of the question, what content sources: mainly youtube


I'm looking for a tool that will take a foreign language and automatically generate subtitles in my language. Anyone know of such a tool?


So when is the crunchyroll integration rolling out?


This is interesting because I'm the opposite of you - a German speaking American, who watches a lot of German language content on YouTube. Are you specifically looking for children's content? I ask because almost anything I would watch in English, I can find an equivalent content producer in German.


Can you add ukrainian?


Hey! As a target/output language? As a source language, it’s already supported.


Yes, as target language. And remove Яussian language, please, unless you are Яepublican. You can add it back after the war.


Wow this is amazing. If there was a locally running version available, I would gladly pay money for it.


Thanks!

Well, to make local happen I'd have to learn more about local app development.

I'd also be worried about having to support a bunch of different platforms, and being beholden to ever changing rules made by App Stores and OS makers. I actually work on a 2015 Mac with a 2019 operating system. There are many great looking AI apps that I'd love to run but can't.

Besides, it seems to me that making this centralized makes economic sense. I can just keep the GPU busy with lots of videos from many customers. I'm sure that's what most people think who build something: "The world would be so much better if everyone just came here and used this." :)


What would be required to add a new output language? E.g Portuguese? I know it’s supported as input.


Would have been nice to get a download link for a program I can just run locally.


Alas! Didn't get the memo? We're In the era where AI tools are all "software as a service," and you must pay for individual inferences from the model. How could they charge for inferences if they gave you the model to download?


I don't have access to any special model that I'm holding back from you in order to rent-seek.

Anyone can learn to build something like this. The parts are all available out there. There's Whisper. There's Mistral-7b. There's Tortoise, Coqui, SV2TTS. There's Python.

The bigger question is:

Would you want to?

I've been building web apps for several years now. I've sunk thousands of dollars into those projects. And literally years of my time. If I calculated my hourly wage, I'd be below a teenager mowing lawn. In Rwanda. And by a factor of 10x, probably.

The real ROI here is the learning. And that's not something I'm "taking" from anyone.


You are awesome, man! Keep it going and all the best of luck!


Amen, bro.


This looks amazing! Thanks for sharing. I signed up for the private beta.


Thank you so much!


Wow, this is impressive. Is there anything like it for live translation?


Not on my to do list currently. EzDubs say they do live [0]. Also, a friend of mine mentioned some Samsung / Android app that does this?

[0] https://ezdubs.ai


This might be a game-changer for preserving declining languages.


I'm not sure.

Making this work is limited by the availability by reliable transcription models. Which, in turn, are limited the availability of large training corpora. Those don't exist for rare languages.

Also, if people choose to listen to a Nepali speaker through an AI translator, that does give speakers of this language "a voice" - but it doesn't really preserve that language. You might argue, on the contrary, that it may remove any remaining incentives to learn that language.


Wonder how accuracy of the translation is measured (if at all).


One idea I have is to use back-translation. After generating the new language audio, feed it back into the transcription, and then have an LLM compare it to the transcript from the original. Penalty for any thought/detail that is missing. If too bad, start from scratch.


Is this using XTTS? I recognize a funny/weird glitch with Italian voices saying "punto" (full stop) at the end of every sentence.


Awesome! Large % of foreign streams have no proper dub.


can you add hindi as an output language, been meaning to build something like this for my parents. You saved me some work haha!


Does this use seamless4t or similar projects?


This is very impressive. I think the way you are timing the audio is clever. What kind of model are you using?


Would love this for Plex.


Would love to. Can you broker a deal? :)


What a great application of AI. The samples were amazing.


Thank you!


Would be awesome if it could also deep-fake the voice.


This is extremely impressive. Congratulations!


Thank you!


any opensource available?


[deleted]




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: