LLMs are only capable of hallucinating, whereas humans are capable of hallucinating, but are also capable of empirically observing reality. So whatever the rate is, it's necessarily lower than those for LLMs.
Tangent here: really? I've found base Whisper has concerning error rates for non-US English accents; I imagine the same is true for other languages with a large regional mode to the source dataset.
Whisper + an LLM can recover some of the gaps by filling in contextually plausible bits, but then it's not a transcript and may contain hallucinations.
There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.
Language detection in the presence of strong accents is, in my opinion, one of the most under-discussed biases in AI.
Traditional ASR systems struggle when English (or any language) is spoken with a heavy accent, often confusing it with another language. Whisper is also affected by this issue, as you noted.
The root of this problem lies in how language detection typically works. It relies on analyzing audio via MFCC (Mel Frequency Cepstrum Coefficient), a method inspired by human auditory perception.
MFCC is a part of the "psychoacoustic" field, focusing on how we perceive sound. It emphasizes lower frequencies and uses techniques like normalized Fourier decomposition to convert audio into a frequency spectrum.
However, this approach has a limitation: it's based purely on acoustics. So, if you speak English with a strong accent, the system may not understand the content but instead judge based on your prosody (rhythm, stress, intonation).
With the team at Gladia, we've developed a hybrid approach that combines psycho-acoustic features with content understanding for dynamic language detection.
In simple terms, our system doesn't just listen to how you speak but also understands what you're saying. This dual approach allows for efficient code-switching and doesn't let strong accents fall through the cracks. The system is based on optimized Whisper, among other models.
In the end, we managed to solve 99% of edge cases involving strong accents, despite the initial Whisper bias there. We've also worked a lot on hallucinations as a separate problem, which resulted in our proprietary model called Whisper-Zero.
If you want to give it a try, there's a free tier available. I'm happy to bounce around ideas on this topic any time; it's super fascinating to me.
Isn't the issue more that traditional ASR systems use US (General American) phonetic transcriptions so then struggle with accents that have different splits and mergers?
My understanding on whisper is that it is using a model trained on different accents, specifically from LibriVox. The quality would depend on the specific model selected.
The MFCC or other acoustic analysis is to detect the specific phonemes of speech. This is well understood (e.g. the first 3 formants corresponding to the vowels and their relative positions between speakers), and the inverse is used for a lot of the modern TTS engines where the MFCC is predicted and the waveform reconstructed from that (see e.g. https://pytorch.org/audio/stable/transforms.html).
Some words can change depending on adjacency to other words, or other speech phenomena (like H dropping) can alter the pronunciation of words. Then you have various homophones in different accents. All of these make it hard to go from the audio/phonetic representation to transcriptions.
This is in part why a relatively recent approach is to train the models on the actual spoken text and not the phonetics, so it can learn to disambiguate these issues. Note that this is not perfect, as TTS models like coqui-ai will often mispronounce words in different contexts as the result of a lack of training data or similar issues.
I'm wondering if it makes sense to train the models with the audio, phonetic transcriptions, and the text and score it on both phonetic and text accuracy. The idea being that it can learn what the different phonemes sound like and how they vary between speakers to try and stabilise the transcriptions and TTS output. The model would then be able to refer to both the audio and the phonemes when making the transcriptions, or for TTS to predict the phonemes then the phonemes as an additional input with the text to generate the audio -- i.e. it can use the text to infer things like prosody.
I've found that WhisperX with the medium model has been amazing at subtitling shows containing English dialects (British, Scottish, Australian, New Zealand-ish). It not only nails all the normal speech, but even gets the names and completely made up slang words. Interestingly you can tell it was trained from source material with dialects because it subtitles their particular spelling; so someone American will say color, and someone British will say colour.
I can't speak to how it performs outside of production quality audio, but in the hundreds of hours of subtitles that I've generated I don't think I've seen a single error.
IIUC, it's trained on LibriVox audio mainly, along with a few other sources. I'm not sure how it is handling spelling as the spelling will depend on the source content being read, unless the source text has been processed/edited to align with the dialect.
> There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.
We use Postmark to send emails and we apparently ran out of our monthly email limit after hitting HN. We just upgraded our plan; and this should now work. If it doesn't, please feel to directly ping us at support @ ubicloud.com.