dancs.dev

Transcription: local video subtitling with Whisper

View the code

Introduction

With OpenAI’s Whisper local model being as good as it is, there is no need to ship off your family videos and company meeting recordings to a cloud-based company anymore.

After watching The Hobbit, which had questionable audio mixing, ranging from whisper quiet speaking containing key plot points that would transition to booming instruments, I was motivated to build Transcription, a CLI tool that you can run entirely locally on your machine. It takes a video file, extracts the audio, runs it through OpenAI’s Whisper model locally, and outputs subtitles (translating into English if necessary). There are no network requests, no API keys, no per-minute billing. Once the tool and model are installed, it can work entirely offline.

You can start transcribing videos and generate a .srt with the following:

transcription -v ./a-video-to-transcribe.mp4

# See all configuration options available with:
transcription --help

How it works

The pipeline is straightforward:

  1. Input: the user provides path(s) to video files via the CLI.
  2. Audio extraction: FFmpeg extracts the audio from the video, converting it into a wav format for compatibility with Whisper.
  3. Translation and transcription: the extracted audio is passed to a locally running Whisper model, which translates speech into English and produces timestamped text segments.
  4. Output: the timestamped segments are formatted into an .srt file and optionally embedded into the video’s metadata (soft embedding) or onto the stream itself (hard embedding).

Technology stack

Performance

The tool by default uses the 769M parameter medium multilingual model requiring around 5GB of VRAM. On my RTX 3080, it translated and transcribed a 27-minute video in 2 minutes and 11 seconds. The quality of the translation itself is good, but does miss some of the subtlety of the manual translation. The large model requires around 10GB of VRAM and unfortunately I get an OOM error on my hardware.

If you don’t need to translate before transcribing, you can use the 809M parameter turbo model requiring around 6GB of VRAM. On my hardware, I could transcribe (but not translate) the same 27-minute video in 36 seconds, which is fairly close to the approximately 4x speedup they claim over the medium model.

The tiny model actually took the same amount of time as the medium model to translate the same video, but with noticeably worse translation. When I tested this on a 23-minute English video (i.e., no translation needed), the tiny model took 52s and the medium model took 3 minutes, which is a bit closer but not quite at the 5x relative speedup that OpenAI claimed. The transcription produced by the tiny model was better than the translation it produced.

For transcription tasks, the turbo model should be the go-to due to its impressive relative speed over the medium model whilst still retaining impressive accuracy. For translation tasks, the medium model produced sufficient accuracy, but it may be worth exploring the large model if your hardware supports it.

What’s next

I would like to test the tool on weaker hardware such as running on the CPU. Additionally, I suspect that I may be able to get faster performance for transcription-only tasks if I change the flag passed to the whisper model - I have currently fixed this for translation, as it still works well for transcription tasks and keeps the CLI options simple. This would also allow the tool to support transcribing videos in their original language.

Further reading


Please open an issue if you have any comments, questions, or issues.