Language detection in the presence of strong accents is, in my opinion, one of the most under-discussed biases in AI.
Traditional ASR systems struggle when English (or any language) is spoken with a heavy accent, often confusing it with another language. Whisper is also affected by this issue, as you noted.
The root of this problem lies in how language detection typically works. It relies on analyzing audio via MFCC (Mel Frequency Cepstrum Coefficient), a method inspired by human auditory perception.
MFCC is a part of the "psychoacoustic" field, focusing on how we perceive sound. It emphasizes lower frequencies and uses techniques like normalized Fourier decomposition to convert audio into a frequency spectrum.
However, this approach has a limitation: it's based purely on acoustics. So, if you speak English with a strong accent, the system may not understand the content but instead judge based on your prosody (rhythm, stress, intonation).
With the team at Gladia, we've developed a hybrid approach that combines psycho-acoustic features with content understanding for dynamic language detection.
In simple terms, our system doesn't just listen to how you speak but also understands what you're saying. This dual approach allows for efficient code-switching and doesn't let strong accents fall through the cracks. The system is based on optimized Whisper, among other models.
In the end, we managed to solve 99% of edge cases involving strong accents, despite the initial Whisper bias there. We've also worked a lot on hallucinations as a separate problem, which resulted in our proprietary model called Whisper-Zero.
If you want to give it a try, there's a free tier available. I'm happy to bounce around ideas on this topic any time; it's super fascinating to me.
Isn't the issue more that traditional ASR systems use US (General American) phonetic transcriptions so then struggle with accents that have different splits and mergers?
My understanding on whisper is that it is using a model trained on different accents, specifically from LibriVox. The quality would depend on the specific model selected.
The MFCC or other acoustic analysis is to detect the specific phonemes of speech. This is well understood (e.g. the first 3 formants corresponding to the vowels and their relative positions between speakers), and the inverse is used for a lot of the modern TTS engines where the MFCC is predicted and the waveform reconstructed from that (see e.g. https://pytorch.org/audio/stable/transforms.html).
Some words can change depending on adjacency to other words, or other speech phenomena (like H dropping) can alter the pronunciation of words. Then you have various homophones in different accents. All of these make it hard to go from the audio/phonetic representation to transcriptions.
This is in part why a relatively recent approach is to train the models on the actual spoken text and not the phonetics, so it can learn to disambiguate these issues. Note that this is not perfect, as TTS models like coqui-ai will often mispronounce words in different contexts as the result of a lack of training data or similar issues.
I'm wondering if it makes sense to train the models with the audio, phonetic transcriptions, and the text and score it on both phonetic and text accuracy. The idea being that it can learn what the different phonemes sound like and how they vary between speakers to try and stabilise the transcriptions and TTS output. The model would then be able to refer to both the audio and the phonemes when making the transcriptions, or for TTS to predict the phonemes then the phonemes as an additional input with the text to generate the audio -- i.e. it can use the text to infer things like prosody.
Traditional ASR systems struggle when English (or any language) is spoken with a heavy accent, often confusing it with another language. Whisper is also affected by this issue, as you noted.
The root of this problem lies in how language detection typically works. It relies on analyzing audio via MFCC (Mel Frequency Cepstrum Coefficient), a method inspired by human auditory perception.
MFCC is a part of the "psychoacoustic" field, focusing on how we perceive sound. It emphasizes lower frequencies and uses techniques like normalized Fourier decomposition to convert audio into a frequency spectrum.
However, this approach has a limitation: it's based purely on acoustics. So, if you speak English with a strong accent, the system may not understand the content but instead judge based on your prosody (rhythm, stress, intonation).
With the team at Gladia, we've developed a hybrid approach that combines psycho-acoustic features with content understanding for dynamic language detection.
In simple terms, our system doesn't just listen to how you speak but also understands what you're saying. This dual approach allows for efficient code-switching and doesn't let strong accents fall through the cracks. The system is based on optimized Whisper, among other models.
In the end, we managed to solve 99% of edge cases involving strong accents, despite the initial Whisper bias there. We've also worked a lot on hallucinations as a separate problem, which resulted in our proprietary model called Whisper-Zero.
If you want to give it a try, there's a free tier available. I'm happy to bounce around ideas on this topic any time; it's super fascinating to me.