New job, new problems!
I’m currently trying to figure out the best way to read a folder of very large WAV sound files and process them. Reading an entire file causes me to run out of memory. Thankfully, the WAV-package allows me to read smaller chunks at a time, so I can get around it. I was wondering though, is there a well thought-through way of doing this?
I need to:
Read each files in a folder.
Process each file (calculate spectrograms)
Somehow downsample
Save results
For now, I assume that each file can be processed independently, but it would be nice to have an approach that would allow for treating all files as one distributed file.
I have so far been considering mmap, but it seems to work only if I have already gotten all data into one file? Is there perhaps something like a distributed mmap?
if the files really need to be processed separately, then you are doing the right thing. open them and work on them in chunks.
a spectrogram is a “chunked” FFT, so you have to do that.
as for the downsample, the Julia DSP library has filtering and downsampling that preserve state, i.e. they can be used in a streaming fashion so that you can read a few samples at a time and process them.
since your result will not fit in memory you’ll have to stream the output to an open file.
it seems like you are taking the correct approach.
if the real problem is that all of those large files are really sections of a still larger data-set then it should be a relatively simple thing that queues up the data file and manages the chunks as they transition from one file to the next.
Yeah, your summary pretty much agrees with what I am doing. I was mostly asking to see if there was an smooth method implemented somewhere where I only need to specify a folder and say that I would like to treat all files within it as one large memory-mapped array.
The downsampling I’m doing is such that the result will fit in memory. If results would not fit, it seems HDF5 supports appending to already existing files, as well as serving as the backend for a memory-mapped array.
LibSndFile.jl also supports reading and writing audio files in a streaming fashion. See loadstreaming in the example. There’s a corresponding savestreaming as well.
Thanks @ssfrr. I can’t seem to get the following code working. Do you have any advice on what I might be doing wrong? It fails on the first call to savestreaming.
using FileIO: load, save, loadstreaming, savestreaming
import LibSndFile
d = mktempdir()
a,b = randn(Float32,10000,4), randn(Float32,10000,4)
save(joinpath(d,"f1.wav"), a, Fs=8000)
save(joinpath(d,"f2.wav"), b, Fs=8000)
savestream = savestreaming(joinpath(d,"s1.wav"))
for wavfile in ["f1.wav", "f2.wav"]
loadstreaming(joinpath(d,wavfile)) do audio
while !eof(audio)
chunk = read(audio, 100) # read 100 frames
# process the chunk
chunk -= .001
write(savestream, chunk)
end
end
end
close(savestream)
I got it to work with the following code but the load function doesn’t return a SampleBuf. Is that correct?
using FileIO: load, save, loadstreaming, savestreaming
import LibSndFile
d = mktempdir()
a,b = randn(Float32,10000,4), randn(Float32,10000,4)
save(joinpath(d,"f1.wav"), a, Fs=8000)
save(joinpath(d,"f2.wav"), b, Fs=8000)
savestreaming(joinpath(d,"s1.wav"), 4, 8000, Float32) do dest
for wavfile in ["f1.wav", "f2.wav"]
loadstreaming(joinpath(d,wavfile)) do src
while write(dest, float(read(src, 2048))) == 2048 end
end
end
end
s1 = load(joinpath(d,"s1.wav"))
s1[1] == vcat(a,b)