Does anyone have (or work on) an implementation of modern automatic speech recognition tools like conformers or wav2vec2.0 in one of Julia’s deep learning frameworks Flux/Knet/Avalon? Ideally pretrained on LibriSpeech . I couldn’t find anything here or on github.
I don’t find anything either in Julia. Why do you need it implemented (fully) in Julia? Isn’t it good enough to call say PyTorch from Julia and use whatever is available (or Avalon.jl then helpful)? I’m not up-to-speed on moving models from Python to Julia, i.e. just the parameters, weights and biases, shouldn’t that be possible, and wasn’t there even a standard for it ONNX? Might likely just work for certain types of networks, e.g. I believe it’s an older standard than Transformers, so those excluded?
What I did find however brand-new from 31 March 2022:
Comprehensive experiments on the LibriSpeech corpus show that the proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training, and also outperforms significantly the state-of-the-art wav2vec 2.0 and HuBERT on finetuning subsets of 10h and 100h
“Wav2Vec 2.0” was state-of-the-art in 2020, according to its paper 2020 paper, is it still so, even though this other Feb 2022, states so (or is it just an evaluation/survay paper, and they tend to repeat claims?):
If someone DOES want to reimplement something in Julia, I at least would want them to find the state-of-the-art and use that…
Might be a helpful thread:
SincNet was also intriguing when I noticed it (might be outdated, or not, hadn’t heard of SpeechBrain):
SincNet is implemented in the SpeechBrain (https://speechbrain.github.io/) project as well.
sinc (and sin) looked intriguing for periodic functions, but may actually be outdated. SIREN is if I recall newer and better, and even something more recent, even better (applications I saw however for computer vision).
I hadn’t heard of conformers (thanks for the tip), only transformers, which it’s a variant of, but might also be too old:
Thanks!
I was thinking about backproping through h(\mathrm{asr}(g_\theta(x))) where h and g_\theta are functions written in julia, \mathrm{asr} is some automatic speech recognition network and \theta are the parameters to be adjusted. But you’re right, maybe I should just pass the gradient of h to pytorch, compute the pytorch gradient of \mathrm{asr} with respect to its input and pass the result to g_\theta. Having everything in julia just feels a bit easier .