AI: STT, TTS and PromptingTools

Hello,

I’m passionate about learning languages and experiment every now and then with PromptingTools to create language resources. It’s an impressive package, though it currently lacks audio support.

For Python, I’ve come across great packages like GitHub - KoljaB/RealtimeSTT: A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription. and GitHub - KoljaB/RealtimeTTS: Converts text to speech in realtime. However, I prefer working within the Julia ecosystem.

How would you approach speech-to-text (STT) and text-to-speech (TTS) in Julia? I’m more of an AI user than someone who trains models from scratch.

One important aspect is also streaming the audio. Think of talking to an AI-tutor.

Could you provide guidance or recommend approaches for implementing STT and TTS with audio streaming in Julia?

Thank you!

2 Likes

Hi there,

An author of PromptingTools here, I’ve been always rolling my solutions by simply calling the APIs (eg, whisper) directly. The hard part is usually getting access to your audio stream, so I’ve been relying on the browser to get the audio for me (eg, when building AI apps with Stipple.jl) and then simply pass the audio file to my backend which calls the OpenAI API.

Nowadays, it’s really easy to get started with the new OpenAI GPT 4o APIs — see the curl commands here and change it to HTTP.jl call instead.

The goal is to make PromptingTools work with all modalities by having a unified multipart / multimodal message types(all providers and frameworks seem to converge on that), but that might take a while, since I have very little time for it these days.

I’d recommend starting with direct calls. The AI parts tend to be easier than using the hardware.

Hope it helps!

1 Like

Thanks for your reply!

I’m also using GenieFramework, and through the browser, I could access the microphone.

I’ve reviewed most of OpenAI’s API documentation, and I think it’s feasible to use. However, there’s an issue with pricing (and privacy). If you need a lot of STT and TTS, it can get quite expensive. So, using Whisper locally, getting the translation via the OpenAI API (or Ollama), and then resyncing the audio locally seems to be a more cost-effective solution.

With PromptingTools, I’ve created an automated cookbook generator from a menu cart, as well as an automated language course generator. For now, these are just fun projects. However, if I want to make the lessons more interactive (AI-tutor), adding audio would be amazing. I’d also like to adopt language courses to certain dialects later.

Real-time (low-latency) translation could also be great for conferences.

@lazarusA did something using PortAudio.jl (I think) to drive transcription and a cool visual a while ago…

1 Like

Indeed, I started Evie.jl like a year ago, but probably I just did work on it like a week, so it was just a fun project for the holidays. And I wanna to have my own offline AI (we should go offline and mobile :smiley: ). Probably, I should try again.

And well, I just went and did some updates, and still works :partying_face: (at least the listening and transcribing). Responses should also, will need to check later, I need to look for the LLM model again :sweat_smile: .

A year ago, models were having still a lot of parameters, so please suggest

  • new (smaller) LLMs models.
  • a text to Speech model, please. [this is still missing, but I wanna it :smiley: ]. Suggestions?

Will update here later with a demo!

2 Likes

In Home Assistant I have chosen Piper for speech synthesis. Maybe take a look here: GitHub - rhasspy/piper: A fast, local neural text to speech system

Short feedback on what I plan to do, since some might be interested. First, @svilupp suggested going with the API endpoints. This is better than writing my own wrappers. After a long search, I found GitHub - speaches-ai/speaches, a project that follows OpenAI’s API specifications and aims to become the Ollama of speech-to-text and vice versa. The Docker image already includes important models like faster-whisper, piper, and Kokoro. I tried it, and it works. Their real-time API I haven’t tried out yet.

So basically, I can use OpenAI’s API, and when I need to run it locally, I can use Ollama + Speaches. Maybe it could help you too, @lazarusA.