More

elitan · 2024-08-20T19:49:38 1724183378

I hope all our competitors find your answer inspirational!

elitan · 2024-07-26T19:47:48 1722023268

This is the #1 feature of my Meta Ray-Ban glasses.

elitan · 2024-07-13T09:15:10 1720862110

I follow the same structure, and I use Notion, Google Calendar, and Flow App.

elitan · 2024-05-31T17:53:13 1717177993

What's the same rate for humans?

Gormo · 2024-06-01T20:33:29 1717274009

LLMs are only capable of hallucinating, whereas humans are capable of hallucinating, but are also capable of empirically observing reality. So whatever the rate is, it's necessarily lower than those for LLMs.

elitan · 2024-05-22T14:29:34 1716388174

It works well here in Sweden: https://en.wikipedia.org/wiki/Freedom_to_roam

elitan · 2024-05-03T07:41:22 1714722082

I wouldn't be surprised if Region Blekinge were using something much worse and much more expensive than Whisper for their transcription.

I've been transcribing A LOT of SR (Swedish Radio) shows as part of https://nyheter.sh/, and Whisper (self-hosted) has been very accurate.

popinman322 · 2024-05-03T08:06:16 1714723576

Tangent here: really? I've found base Whisper has concerning error rates for non-US English accents; I imagine the same is true for other languages with a large regional mode to the source dataset.

Whisper + an LLM can recover some of the gaps by filling in contextually plausible bits, but then it's not a transcript and may contain hallucinations.

There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.

jilijeanlouis · 2024-05-03T12:53:05 1714740785

Language detection in the presence of strong accents is, in my opinion, one of the most under-discussed biases in AI.

Traditional ASR systems struggle when English (or any language) is spoken with a heavy accent, often confusing it with another language. Whisper is also affected by this issue, as you noted.

The root of this problem lies in how language detection typically works. It relies on analyzing audio via MFCC (Mel Frequency Cepstrum Coefficient), a method inspired by human auditory perception.

MFCC is a part of the "psychoacoustic" field, focusing on how we perceive sound. It emphasizes lower frequencies and uses techniques like normalized Fourier decomposition to convert audio into a frequency spectrum.

However, this approach has a limitation: it's based purely on acoustics. So, if you speak English with a strong accent, the system may not understand the content but instead judge based on your prosody (rhythm, stress, intonation).

With the team at Gladia, we've developed a hybrid approach that combines psycho-acoustic features with content understanding for dynamic language detection.

In simple terms, our system doesn't just listen to how you speak but also understands what you're saying. This dual approach allows for efficient code-switching and doesn't let strong accents fall through the cracks. The system is based on optimized Whisper, among other models.

In the end, we managed to solve 99% of edge cases involving strong accents, despite the initial Whisper bias there. We've also worked a lot on hallucinations as a separate problem, which resulted in our proprietary model called Whisper-Zero.

If you want to give it a try, there's a free tier available. I'm happy to bounce around ideas on this topic any time; it's super fascinating to me.

rhdunn · 2024-05-03T15:58:18 1714751898

Isn't the issue more that traditional ASR systems use US (General American) phonetic transcriptions so then struggle with accents that have different splits and mergers?

My understanding on whisper is that it is using a model trained on different accents, specifically from LibriVox. The quality would depend on the specific model selected.

The MFCC or other acoustic analysis is to detect the specific phonemes of speech. This is well understood (e.g. the first 3 formants corresponding to the vowels and their relative positions between speakers), and the inverse is used for a lot of the modern TTS engines where the MFCC is predicted and the waveform reconstructed from that (see e.g. https://pytorch.org/audio/stable/transforms.html).

Some words can change depending on adjacency to other words, or other speech phenomena (like H dropping) can alter the pronunciation of words. Then you have various homophones in different accents. All of these make it hard to go from the audio/phonetic representation to transcriptions.

This is in part why a relatively recent approach is to train the models on the actual spoken text and not the phonetics, so it can learn to disambiguate these issues. Note that this is not perfect, as TTS models like coqui-ai will often mispronounce words in different contexts as the result of a lack of training data or similar issues.

I'm wondering if it makes sense to train the models with the audio, phonetic transcriptions, and the text and score it on both phonetic and text accuracy. The idea being that it can learn what the different phonemes sound like and how they vary between speakers to try and stabilise the transcriptions and TTS output. The model would then be able to refer to both the audio and the phonemes when making the transcriptions, or for TTS to predict the phonemes then the phonemes as an additional input with the text to generate the audio -- i.e. it can use the text to infer things like prosody.

sevagh · 2024-05-03T13:28:47 1714742927

>Traditional ASR systems struggle when English (or any language) is spoken with a heavy accent, often confusing it with another language.

Humans also have difficulty with heavy accents, no?

vt100 · 2024-05-03T13:46:57 1714744017

True, but we can notice that that's the case, and then try to listen more carefully or ask for clarification

vidyava · 2024-05-03T12:47:09 1714740429

I've found that WhisperX with the medium model has been amazing at subtitling shows containing English dialects (British, Scottish, Australian, New Zealand-ish). It not only nails all the normal speech, but even gets the names and completely made up slang words. Interestingly you can tell it was trained from source material with dialects because it subtitles their particular spelling; so someone American will say color, and someone British will say colour.

I can't speak to how it performs outside of production quality audio, but in the hundreds of hours of subtitles that I've generated I don't think I've seen a single error.

rhdunn · 2024-05-03T16:04:13 1714752253

IIUC, it's trained on LibriVox audio mainly, along with a few other sources. I'm not sure how it is handling spelling as the spelling will depend on the source content being read, unless the source text has been processed/edited to align with the dialect.

htrp · 2024-05-03T13:44:32 1714743872

> There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.

Any recommended ones you've looked at?

elitan · 2024-05-03T07:35:48 1714721748

Vector search in SQLite is what's keeping me from switching from Postgres.

elitan · on Jan 17, 2024

elitan · on Aug 17, 2023

Great take!

elitan · on Aug 16, 2023

Is there a demo available?

pwmtr · on Aug 17, 2023

You can check out our managed service to get a feel at https://console.ubicloud.com

BilalBudhani · on Aug 17, 2023

I registered by never received account confirmation email.

ozgune · on Aug 17, 2023

Thanks for noting this.

We use Postmark to send emails and we apparently ran out of our monthly email limit after hitting HN. We just upgraded our plan; and this should now work. If it doesn't, please feel to directly ping us at support @ ubicloud.com.