If you ever find yourself having to transcribe hours upon hours of audio data, for example for your thesis or user research, Whisper might be of help. However, running Whisper in the cloud, or using cloud transcription services, can cause compliance issues such as with GDPR.

However, you have another option. Running Whisper locally saves you from exposing potentially sensitive information to a third party, and on an M1 Mac the performance is pretty decent, with run times on an M1 Max being around 1/10th the actual playback time of the audio file.

However, to do so requires some finessing, and I hope this blog can help you along the way, and provide some tips for post-processing the data for greater accuracy.

What is Whisper?

Whisper is a trained and open-source neural network for speech recognition that reaches impressive levels of accuracy for a lot of languages. Since its original release, OpenAI has open sourced the model and accompanying runtime allowing anyone to run Whisper either on cloud hardware, or locally.

Usually large neural networks require powerful GPUs such that for most people its limited to running on cloud software, but with the M1 MacBooks, and I suspect more powerful X86 CPUs, it will run with acceptable performance for personal use.

Running Whisper on an M1 Mac

There is no native ARM version of Whisper as provided by OpenAI, but Georgi Gerganov helpfully provides a plain C/C++ port of the OpenAI version written in Python.

First, make sure you have your build dependencies set up using xcode-select --install, as well as HomeBrew installed on your Mac.

To install Whisper, open up your terminal and clone the repository:

git clone https://github.com/ggerganov/whisper.cpp.git

With that done, you can now cd whisper.cpp. Now we’re ready to fetch our model. If any of the below instructions differ from the README, please refer to that instead.

I prefer using the large model, but the base.en model should work quite well and presumably be much faster.

bash ./models/download-ggml-model.sh <model name, e.g. large>

This will take a while depending on your network connection, and the large model requires around 3GB of storage.

You now have to compile the program. To do that you can run:

make large

which will also run a test audio file through the model. If all goes well you should see output like below.

A screenshot of the expected output of the command. There are around 50-60 lines of text, with the most important one being the transcription towards the bottom

Now Whisper is ready! However, processing audio files is a bit finicky, and may require some post processing of the audio files to allow for analysis by Whisper.

Post-processing audio files for analysis by Whisper

This version of Whisper requires a very specific WAV file format with 16kHz audio samples to process, and it includes no convenient post-processing built-in like some versions of Whisper.

Post-processing isn’t too hard though if you’re familiar with the command line, and following this guide should help even novices post-process audio data in no time. Let’s get started.

With HomeBrew installed you can install ffmpeg, a popular and open-source media converter that runs locally.

brew install ffmpeg (This can take a while)

With ffmpeg installed, you can now open your whisper.cpp folder in Finder using open . in the terminal, and create a new folder called data. Copy paste your audio file(s) that you want to convert into the data folder.

Open your terminal again in the whisper.cpp folder, and run the following command:

mkdir -p output; for i in ./data/*; do f=$(echo "${i##*/}"); filename=$(echo $f | cut -d'.' -f 1); ffmpeg -i "$i" -acodec pcm_s16le -ac 1 -ar 16000 "./output/$(echo $filename).wav"; done

What this command does is that it creates an output directory if it doesn’t already exist, finds all the audio files in data and processes them using ffmpeg to eventually be outputted to the output directory with a nice filename in a Whisper-friendly format.

Running the speech recognition on our processed audio files

With all that said and done, we are now ready for Whisper to do its thing. While we could run Whisper directly now, for transcription work, it’s useful to specify the language being spoken, and an output format suitable. I tend to stick to WebVTT, a standardized way of expressing synced audio transcript data to a media file on the web. This also allows you to use a plethora of tools to edit any inaccuracies later, stay tuned for that!

To run Whisper on our file(s), we can use the following command. Note the language field, as we should specify the spoken language using a two-character identifier (ISO 639-1) for the language. So far I’ve tested en and no with decent results!

for i in output/*.wav; do ./main -m ./models/<model.bin> -l <language code> --output-vtt -f "$i"; done

For an English audio file using the large model:

for i in output/*.wav; do ./main -m ./models/ggml-large.bin -l en --output-vtt -f "$i"; done

This will now run for a while. In my tests I used around 1/10th the length of the audio file on an M1 Max using the large model.

Finally, once everything is done, you can open the output folder in Finder using open output/ and find your audio file and WebVTT file. For the JFK sample included in the repo the WebVTT file will contain:

WEBVTT

00:00:00.000 --> 00:00:11.000
 And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

Verifying the translation

While the accuracy per word is quite impressive, as shown below, we cannot completely rely on its transcription. If you are doing a word-for-word transcription you may want to verify the transcription. However, doing so can be cumbersome and lead to almost the same time usage as doing it manually in the first place. That is, if you don’t have the right tools! See more after the break:

An image of a graph detailing the accuracy of several languages using OpenAIs speech recognition. The most impressive are English, Italian and English which boasts less than 5% error rate. Even Japanese and Russian score around 6-7%. — *The speech recognition accuracy is impressive. The numbers here represent the word error rate (WER) per language, which is a measure of how likely the model is at recognizing the average word. (source*)

One such tool is HappyScribe which offers a free online subtitle editor, which processes things locally by default. It allows you to upload the WebVTT file Whisper generated together with the audio file, and easily select a text segment to start listening from its timestamp in the audio file.

A screenshot of the HappyScribe tool showing side-by-side text and audio previews

Once you’re done you can Download the resulting edited WebVTT file. The tool also autosaves locally so you don’t have to worry about losing any work!

Also, shoutout to HappyScribe for offering this excellent tool for free! If you can, I recommend thanking them by using their automatic speech recognition services next time if your budget and compliance allows it.

Closing thoughts

Following this blog post you’ve now seen how far we’ve gotten with automatic speech recognition, even allowing for on-device speech recognition with very low error rates for multiple languages. The speed at which AI is developing is mind-blowing, and I don’t think we’re far from seeing more use cases pop up in the near future.

In the meantime, support those that open source these models and allow anyone to run them on their own hardware. The democratization of AI will fuel the next wave of applications and innovation in our space, and I’m so thrilled to be part of it!

Running Whisper on an M1 Mac to transcribe audio data locally

What is Whisper?

Running Whisper on an M1 Mac

Post-processing audio files for analysis by Whisper

Running the speech recognition on our processed audio files

Verifying the translation

Closing thoughts

Dag-Inge Aas

hello@daginge.com

Running Whisper on an M1 Mac to transcribe audio data locally

What is Whisper?

Running Whisper on an M1 Mac

Post-processing audio files for analysis by Whisper

Running the speech recognition on our processed audio files

Verifying the translation

Closing thoughts

Why Bluetooth audio quality suffers in video calls

Accessing local resources over VPN

Dag-Inge Aas

hello@daginge.com