What is OpenAI Whisper?

by Stephen M. Walker II, Co-Founder / CEO

What is OpenAI Whisper?

OpenAI Whisper is an automatic speech recognition (ASR) system. It's designed to convert spoken language into written text, making it a valuable tool for transcribing audio files. Whisper is trained on a massive dataset of 680,000 hours of multilingual and multitask supervised data collected from the web.

The Whisper system is robust to accents, background noise, and technical language, making it versatile and adaptable to various audio conditions. It supports transcription in multiple languages and can translate non-English languages into English.

Whisper is implemented as an encoder-decoder Transformer. The input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is then trained to predict the corresponding text.

How can I run OpenAI Whisper myself?

To run OpenAI Whisper yourself, you need to install the necessary Python packages and system dependencies. Here's a step-by-step guide:

  1. Install the Whisper Python package — You can download and install the latest release of Whisper using pip, a package manager for Python. Use the following command in your terminal:
pip install -U openai-whisper

Alternatively, you can install the latest commit from the Whisper repository with its Python dependencies using:

pip install git+https://github.com/openai/whisper.git

To update the package to the latest version of the repository, run:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
  1. Install ffmpeg — Whisper requires the command-line tool ffmpeg to be installed on your system. As you're using MacOS, you can install ffmpeg using Homebrew with the following command:
brew install ffmpeg
  1. Load the model and transcribe audio — Once the necessary packages are installed, you can use the Whisper model to transcribe audio. Here's a basic example of how to do this in Python:
import whisper

# Load the model
model = whisper.load_model("base")

# Detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# Decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# Print the recognized text

In this code, mel is a Mel spectrogram, which is a representation of the short-term power spectrum of a sound. You would need to convert your audio input into a Mel spectrogram before feeding it into the Whisper model.

Remember, Whisper is a compute-intensive model, so for larger, more powerful versions of Whisper, it's recommended to run it on a GPU, whether locally or in the cloud.

How does OpenAI Whisper work?

Whisper was trained on 680,000 hours of multilingual and multitask supervised data, which has led to its robustness to accents, background noise, and technical language. It can transcribe in multiple languages and translate those languages into English.

The Whisper architecture is implemented as an encoder-decoder Transformer. The input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is then trained to predict the corresponding text.

Whisper is open-source and can be used as a foundation for building useful applications. It has been used to create various tools and applications, such as audio transcription apps for iOS and macOS, YouTube subtitle generation tools, and more.

For developers who program in Python, the Whisper model can be used with the Hugging Face library. The library provides pre-trained models and processors for Whisper, which can be used to transcribe and translate speech.

In terms of performance, while Whisper does not outperform models that specialize in LibriSpeech performance, a benchmark in speech recognition, it has shown strong performance across diverse datasets.

How does OpenAI Whisper compare to other speech recognition systems?

OpenAI's Whisper is an automatic speech recognition (ASR) system that has been trained on 680,000 hours of multilingual and multitask supervised data. It is known for its robust performance, especially when dealing with diverse data, multiple speakers, accents, background noise, or technical terminology. However, its performance can vary by language, with high-resource languages generally faring better than low-resource ones.

Comparatively, Deepgram, another speech-to-text API, claims to be 36% more accurate than OpenAI Whisper. It also boasts faster transcription speeds, native streaming support, and lower costs. Deepgram also offers extensive multilingual support and advanced formatting features.

Another comparison comes from a project called "faster-whisper", which has managed to significantly reduce the inference time for transcribing audio with Whisper models. For instance, the classic OpenAI Whisper small model can transcribe 13 minutes of audio in 10 minutes and 31 seconds, while faster-whisper can do the same in just 2 minutes and 44 seconds.

In terms of limitations, Whisper only offers batch processing for pre-recorded audio, which may not be suitable for use cases that rely on real-time speech processing. It also lacks built-in diarization, word-level timestamps, or keyword detection. Additionally, the Whisper API only accepts files up to 25MB in size, although there are no limits on the duration of the audio.

In terms of hardware requirements, Whisper's performance can vary depending on the GPU used. For instance, the Whisper large model requires around 10GB of VRAM.

Is OpenAI Whisper open source?

Yes, OpenAI's Whisper is open source. Whisper is a general-purpose speech recognition model trained on a large dataset of diverse audio. It was first released as open-source software in September 2022. The model and inference code are available on GitHub. Whisper is capable of transcribing speech in English and several other languages, and it can also translate several non-English languages into English. It's designed to be robust to accents, background noise, and technical language. The open-source nature of Whisper allows developers and researchers to use and modify it for their specific needs, contributing to the advancement of speech recognition technology.

More terms

Mistral "Mixtral" 8x7B 32k

The Mistral "Mixtral" 8x7B 32k model is an 8-expert Mixture of Experts (MoE) architecture, using a sliding window beyond 32K parameters. This model is designed for high performance and efficiency, surpassing the 13B Llama 2 in all benchmarks and outperforming the 34B Llama 1 in reasoning, math, and code generation. It uses grouped-query attention for quick inference and sliding window attention for Mistral 7B — Instruct, fine-tuned for following directions.

Read more

Precision vs Recall

Precision tells us how many of the items we identified as correct were actually correct, while recall tells us how many of the correct items we were able to identify. It's like looking for gold: precision is our accuracy in finding only gold instead of rocks, and recall is our success in finding all the pieces of gold in the dirt.

Read more

It's time to build

Collaborate with your team on reliable Generative AI features.
Want expert guidance? Book a 1:1 onboarding session from your dashboard.

Start for free