Best practices for Speech-to-Text conversion?

I’m working on a project where I plan to convert speech into text to perform very basic NLP tasks.

At first, I hoped to find a package in Julia for Speech-to-Text conversion myself, but that didn’t turn up anything.

Next I considered Google’s Cloud Speech-to-Text – Speech Recognition API, but I didn’t like that it had to be online at all times (the program I’m making should at least have minimal functionality even if an internet connection is unavailable), then DeepSpeech, which was going to require some non-Julia to set up.

And just a while ago I noticed that most operating systems now have built-in dictation, is it then a good idea to make something like a voice recording and pass it to the OS’s dictation software?

Which of these approaches have you used before or which would you recommend? Do you have any other ideas?

1 Like

https://github.com/buriburisuri/speech-to-text-wavenet has a python version of this feature using tensorflow. You might be able to pretty easily transfer it to Julia and use one of their pre-trained nets.

1 Like

While I have considered training a model myself, due to current circumstances I need to limit my need for hardware and thereby hardware-intensive training. I’ll still be looking into it though, thank you.

Part of my suggestion was that the repository has pretrained nets, that if you are willing to put in a bit of work to load, would keep you from having to retrain a network.

Sorry, that last part seems to have slipped my mind :sweat_smile:

WaveNet can still be pretty slow even at runtime, but I might consider it if I don’t have any other options.

I did happen to find a WaveNet implementation using Flux I think? I could work off from there, but again I have concerns that my poor potato won’t be able to run it at all.