Reading and processing multiple very large Wav files

New job, new problems! :slight_smile:
I’m currently trying to figure out the best way to read a folder of very large WAV sound files and process them. Reading an entire file causes me to run out of memory. Thankfully, the WAV-package allows me to read smaller chunks at a time, so I can get around it. I was wondering though, is there a well thought-through way of doing this?

I need to:

  1. Read each files in a folder.
  2. Process each file (calculate spectrograms)
  3. Somehow downsample
  4. Save results

For now, I assume that each file can be processed independently, but it would be nice to have an approach that would allow for treating all files as one distributed file.

I have so far been considering mmap, but it seems to work only if I have already gotten all data into one file? Is there perhaps something like a distributed mmap?

it’s not exactly clear what you are doing here.

if the files really need to be processed separately, then you are doing the right thing. open them and work on them in chunks.

a spectrogram is a “chunked” FFT, so you have to do that.

as for the downsample, the Julia DSP library has filtering and downsampling that preserve state, i.e. they can be used in a streaming fashion so that you can read a few samples at a time and process them.

since your result will not fit in memory you’ll have to stream the output to an open file.

it seems like you are taking the correct approach.

if the real problem is that all of those large files are really sections of a still larger data-set then it should be a relatively simple thing that queues up the data file and manages the chunks as they transition from one file to the next.

Yeah, your summary pretty much agrees with what I am doing. I was mostly asking to see if there was an smooth method implemented somewhere where I only need to specify a folder and say that I would like to treat all files within it as one large memory-mapped array.

The downsampling I’m doing is such that the result will fit in memory. If results would not fit, it seems HDF5 supports appending to already existing files, as well as serving as the backend for a memory-mapped array.

oh, ok, I get what you are saying now. I definitely don’t know of any way to do that, in julia or otherwise.

I really like your HDF5 idea ! I haven’t tried using it yet, but I have some work that might benefit from that idea.

1 Like

The relevant docs for HDF5.jl are slightly hard to spot, you find them here
https://github.com/JuliaIO/HDF5.jl/blob/master/doc/hdf5.md#memory-mapping

1 Like

I ended up creating a package for lazy, distributed wav files acting as AbstractArrays
https://github.com/baggepinnen/LazyWAVFiles.jl

2 Likes

Thank you! That’s very useful. How do you recommend writing the files out after processing them in chunks?

LibSndFile.jl also supports reading and writing audio files in a streaming fashion. See loadstreaming in the example. There’s a corresponding savestreaming as well.

Thanks @ssfrr. I can’t seem to get the following code working. Do you have any advice on what I might be doing wrong? It fails on the first call to savestreaming.

using FileIO: load, save, loadstreaming, savestreaming
import LibSndFile

d = mktempdir()
a,b = randn(Float32,10000,4), randn(Float32,10000,4)
save(joinpath(d,"f1.wav"), a, Fs=8000)
save(joinpath(d,"f2.wav"), b, Fs=8000)

savestream = savestreaming(joinpath(d,"s1.wav"))
for wavfile in ["f1.wav", "f2.wav"]
    loadstreaming(joinpath(d,wavfile)) do audio
        while !eof(audio)
            chunk = read(audio, 100) # read 100 frames
            # process the chunk
            chunk -= .001
            write(savestream, chunk)
        end
    end
end
close(savestream)

I got it to work with the following code but the load function doesn’t return a SampleBuf. Is that correct?

using FileIO: load, save, loadstreaming, savestreaming
import LibSndFile

d = mktempdir()
a,b = randn(Float32,10000,4), randn(Float32,10000,4)
save(joinpath(d,"f1.wav"), a, Fs=8000)
save(joinpath(d,"f2.wav"), b, Fs=8000)

savestreaming(joinpath(d,"s1.wav"), 4, 8000, Float32) do dest
    for wavfile in ["f1.wav", "f2.wav"]
        loadstreaming(joinpath(d,wavfile)) do src
            while write(dest, float(read(src, 2048))) == 2048 end
        end
    end
end

s1 = load(joinpath(d,"s1.wav"))
s1[1] == vcat(a,b)

If you have WAV.jl installed, FileIO will default to that rather than LibSndFile. That might be what’s happening here.