Image generated by the author with OpenAI’s DALL-E2
“Catch me outside! How about that?” The words appeared beneath Danielle Bregoli during her meme-worthy Dr. Phil interview. Unlike most of the internet, Whisper understood her.
Whisper is OpenAI’s new, open-source Automatic Speech Recognition System (ASR). Released in late September, Whisper can understand thick accents, pick up technical jargon, and filter out background noise. It can also translate multiple languages into English.
Besides transcribing memes, what else can Whisper do? And how does it work? I read OpenAI’s paper, “Robust Speech Recognition via Large-Scale Weak Supervision,” to find out.
Automated Transcription Before Whisper
The paper begins with a tale of two types of learning: supervised and unsupervised.
A typical ASR model learns from a thousand hours of supervised audio data. Training a supervised model is a painstaking process. A human expert needs to listen to and label all the audio in the dataset. This takes hours and drives up the cost of the experiment, limiting how large training datasets can be.
A model can’t generalize from a small, specific dataset. For example, a supervised model might be great at transcribing English, but fail to recognize other languages. Thick accents and background noise might also confuse the model.
So researchers took the obvious next step: What if they trained an ASR model on an unsupervised dataset? Researchers would save time and money because they wouldn’t need to label data. They could afford to build a dataset orders of magnitude larger than before. With this much data, the model should learn to generalize.
While great in theory, this didn’t work in practice. The unsupervised ASR models built sophisticated speech representations but didn’t work zero-shot. Researchers needed to fine-tune their models with custom datasets before they could complete tasks like language identification, transcription, or translation.
Fine-tuning introduces new problems. Fine-tuned models are more complicated and require human experts to train them. Worse, the model might overfit its dataset and fail to generalize.
Despite these issues, unsupervised learning was state-of-the-art for ASR. Then OpenAI stepped in.
What Is Weak Supervision?
“The goal of Whisper is to develop a single robust speech processing system that works reliably without the need for dataset specific fine-tuning to achieve high-quality results on specific distributions.”
– Robust Speech Recognition via Large-Scale Weak Supervision, pg. 5
Whisper finds the middle ground between supervised and unsupervised datasets with “weakly supervised learning.”
“Weak supervision” refers to datasets with low-quality labels. Labels might be incomplete, inexact, or flat-out inaccurate. Weakly supervised models strike the balance between quantity and quality. Datasets aren’t perfect, but good enough.
“A large amount of the promise in weakly supervised training approaches is their potential to use datasets much larger than those in traditional supervised learning. However, this comes with the cost of using data that is possibly much noisier and lower quality than gold-standard supervision.”
– Robust Speech Recognition via Large-Scale Weak Supervision, pg. 10
OpenAI scraped the web for their good-enough dataset, looking for audio already paired with transcripts. They don’t reveal their sources in the paper, but I’m guessing they scraped data from subtitled YouTube videos, transcripted podcasts, and similar media. They cobbled together 680,000 hours of labeled audio this way. This dataset included a diversity of contexts, languages, accents, and recording set-ups.
Whisper learned it all. The model generalized and completes many tasks zero-shot, including speech recognition, language identification, timestamp creation, transcription, and translation. No fine-tuning necessary.
“The goal of a speech recognition system should be to work reliably “out of the box” in a broad range of environments without requiring supervised fine-tuning of a decoder for every deployment distribution.”
– Robust Speech Recognition via Large-Scale Weak Supervision, pg. 1
How Whisper Works
Diversity, cross-training, and enormous datasets are hallmarks of an OpenAI model. Whisper’s architecture is also familiar to anyone who’s used GPT. Like other OpenAI models, Whisper is an encoder-decoder transformer.
Using a tried and true architecture let OpenAI test the power of weakly supervised learning. Introducing a new model would’ve muddled their results. If Whisper did well, how could they tell if the dataset or model was responsible for its success? By recycling a transformer architecture for Whisper, OpenAI can say for sure that weakly supervised learning works.
Transformers take in natural language, encode it as a vector (also known as a semantic embedding), then decode it as text. Whisper works like any other transformer, except it takes in audio instead of writing.
Whisper first transforms audio into something manageable. It cuts audio input into 30-second snippets, then converts these into spectrograms (80-channel log magnitude Mel spectrograms, to be precise). These spectrograms are encoded into a vector.
Decoding comes next.
OpenAI predicted one model could perform multiple tasks on the same audio. Why encode the same snippet over and over again? The decoder manages different tasks using tokens.
Tokens predict when speech starts and what language someone speaks. Whisper’s dataset includes training data for recognizing 75 languages, and each language has its own token. So does silence, tagged <|nospeech|>.
Next, tokens determine what Whisper should do. If the model detects English, it transcribes the audio with time stamps. If Whisper recognizes another language, it translates instead.
Once the model knows the language and task, it outputs text in plain English.
How Good Is Whisper?
OpenAI confirmed that as models get bigger, their ability to recognize, identify, and translate other languages improves. Scale is all you need.
Whisper competed with other ASR models to compare performance. The results varied.
Supervised models outperformed Whisper at specific tasks. For example, other models identified languages more accurately, beating Whisper by 13.6%.
Other times, Whisper surpassed its competitors. For instance, OpenAI compared the robustness of several models. They input audio contaminated with white noise and “pub noises.” The supervised models listened well initially, but their error rate increased as the noise worsened. Whisper could handle more distractions and stay focused on speech.
Now the question you really want to ask: How does Whisper compare to humans?
Whisper was “very close to human-level accuracy” at transcription. Humans’ error rate was “only a fraction of a percentage point better.”
“When compared to a human… the best zero-shot Whisper models roughly match their accuracy and robustness.”
– Robust Speech Recognition via Large-Scale Weak Supervision, pg. 6
Whisper is almost as good at transcription as you are, plus it works faster and doesn’t need breaks. And I don’t know about you, but I can’t identify seventy-five spoken languages! In many ways, Whisper already beats us.
What’s Next for Whisper?
Whisper is a zero-shot ASR model. Everything it can do, it does right out of the box. So what happens if you fine-tune it? This feels like the obvious next question. With a little specialized training, what else could Whisper do?
OpenAI wants to improve Whisper’s decoding structure to lower its error rate. Whisper also needs more training data for languages like Chinese and Hebrew that have little in common with English.
Meanwhile, users experiment and play with their new toy. They’ve discovered that Whisper can generate YouTube subtitles and transcribe podcasts. Of course, you can always transcribe your favorite memes.
