I’m very pleased to announce the availability of Whisper.jl, a Julia package to perform speech recognition. It uses the Whisper model developed by OpenAI, and runs the inference on the CPU using Georgi Gerganov’s whisper.cpp. That C library is provided as a jll, and the model weights are downloaded on demand.
Currently, a transcribe function is exposed that takes raw audio data, and produces a text transcription. Suggestions, or contributions, for other kinds of interfaces will be gratefully received. The whisper.cpp low level functions are, of course, already available. But a high level Julia streaming interface does need to be added – hopefully soon.
The package is awaiting registration in the General Registry. Please try it out and let me know how it goes.
There are five model sizes, four with English-only versions.
What languages/model do you support (large-v2 model?), e.g. my native Icelandic? I see best word-error rate (WER) 3.2% for Spanish, then Italian, then English at 4.2%, and way down Icelandic at 38.2%, and Nepali last.
State-of-the-art is however now:
Makes “43% fewer errors on noisy data on average”, and 3.3% WER (for English presumably on non-noisy, “human transcriptionists” get 4%; Conformer-1 gets 9.9% on noisy data):
Very cool! Any plans to bring the model inference code into Julia natively? Reading through the whisper.cpp code it’s mainly just in two C++ source files and most of it is doing stuff that would be easier/cleaner in Julia (i.e. there are currently separate functions for different input data types which could be moved to multiple dispatch).
That would indeed be very cool, and I think is quite feasible. However, that is beyond the limits of my time and skills currently. I hope someone does this, and I’ll be happy to retire the current ccall based codebase.
Just gave it a try and it works quite well.
I was surprised that the result was automatically translated into an english result from german reading. So I tried to figure out how to change some parameters perhaps to get a german text, but failed to find a solution. It’s not so easy by just looking into the C interface. Some hints or explanations on this would be helpfull. Not too deep but a bit to find a starting point.