The eventual goal is to make a pure Julia Whisper inference, but got stopped before 1st step.
Most ingredients of the spectrogram is simply a combination of hanging window and stft:
in Julia, we can find them in DSP.jl(Periodograms - periodogram estimation · DSP.jl)
but where can I find the mel filterbank matrix? which is some audio specific scaling model
To make a log-Mel spectrogram from a DSP spectrogram, I believe you need to transform the frequency axis according to the Mel-scale formula and display the amplitudes in dB.
Linking also this “related” thread.