Reading and processing multiple very large Wav files

baggepinnen · August 5, 2019, 12:38am

New job, new problems!
I’m currently trying to figure out the best way to read a folder of very large WAV sound files and process them. Reading an entire file causes me to run out of memory. Thankfully, the WAV-package allows me to read smaller chunks at a time, so I can get around it. I was wondering though, is there a well thought-through way of doing this?

I need to:

Read each files in a folder.
Process each file (calculate spectrograms)
Somehow downsample
Save results

For now, I assume that each file can be processed independently, but it would be nice to have an approach that would allow for treating all files as one distributed file.

I have so far been considering mmap, but it seems to work only if I have already gotten all data into one file? Is there perhaps something like a distributed mmap?

purplishrock · August 6, 2019, 5:49am

it’s not exactly clear what you are doing here.

if the files really need to be processed separately, then you are doing the right thing. open them and work on them in chunks.

a spectrogram is a “chunked” FFT, so you have to do that.

as for the downsample, the Julia DSP library has filtering and downsampling that preserve state, i.e. they can be used in a streaming fashion so that you can read a few samples at a time and process them.

since your result will not fit in memory you’ll have to stream the output to an open file.

it seems like you are taking the correct approach.

if the real problem is that all of those large files are really sections of a still larger data-set then it should be a relatively simple thing that queues up the data file and manages the chunks as they transition from one file to the next.

baggepinnen · August 6, 2019, 5:59am

Yeah, your summary pretty much agrees with what I am doing. I was mostly asking to see if there was an smooth method implemented somewhere where I only need to specify a folder and say that I would like to treat all files within it as one large memory-mapped array.

The downsampling I’m doing is such that the result will fit in memory. If results would not fit, it seems HDF5 supports appending to already existing files, as well as serving as the backend for a memory-mapped array.

purplishrock · August 6, 2019, 6:10am

oh, ok, I get what you are saying now. I definitely don’t know of any way to do that, in julia or otherwise.

I really like your HDF5 idea ! I haven’t tried using it yet, but I have some work that might benefit from that idea.

baggepinnen · August 6, 2019, 6:21am

The relevant docs for HDF5.jl are slightly hard to spot, you find them here
https://github.com/JuliaIO/HDF5.jl/blob/master/doc/hdf5.md#memory-mapping

baggepinnen · August 8, 2019, 4:07am

I ended up creating a package for lazy, distributed wav files acting as AbstractArrays
https://github.com/baggepinnen/LazyWAVFiles.jl

abhayap · February 17, 2022, 9:03pm

Thank you! That’s very useful. How do you recommend writing the files out after processing them in chunks?

ssfrr · February 18, 2022, 12:03am

LibSndFile.jl also supports reading and writing audio files in a streaming fashion. See loadstreaming in the example. There’s a corresponding savestreaming as well.

abhayap · February 21, 2022, 4:51am

Thanks @ssfrr. I can’t seem to get the following code working. Do you have any advice on what I might be doing wrong? It fails on the first call to savestreaming.

using FileIO: load, save, loadstreaming, savestreaming
import LibSndFile

d = mktempdir()
a,b = randn(Float32,10000,4), randn(Float32,10000,4)
save(joinpath(d,"f1.wav"), a, Fs=8000)
save(joinpath(d,"f2.wav"), b, Fs=8000)

savestream = savestreaming(joinpath(d,"s1.wav"))
for wavfile in ["f1.wav", "f2.wav"]
    loadstreaming(joinpath(d,wavfile)) do audio
        while !eof(audio)
            chunk = read(audio, 100) # read 100 frames
            # process the chunk
            chunk -= .001
            write(savestream, chunk)
        end
    end
end
close(savestream)

abhayap · February 22, 2022, 12:24am

I got it to work with the following code but the load function doesn’t return a SampleBuf. Is that correct?

using FileIO: load, save, loadstreaming, savestreaming
import LibSndFile

d = mktempdir()
a,b = randn(Float32,10000,4), randn(Float32,10000,4)
save(joinpath(d,"f1.wav"), a, Fs=8000)
save(joinpath(d,"f2.wav"), b, Fs=8000)

savestreaming(joinpath(d,"s1.wav"), 4, 8000, Float32) do dest
    for wavfile in ["f1.wav", "f2.wav"]
        loadstreaming(joinpath(d,wavfile)) do src
            while write(dest, float(read(src, 2048))) == 2048 end
        end
    end
end

s1 = load(joinpath(d,"s1.wav"))
s1[1] == vcat(a,b)

ssfrr · February 22, 2022, 2:30am

If you have WAV.jl installed, FileIO will default to that rather than LibSndFile. That might be what’s happening here.

Topic		Replies	Views
Audio processing in Julia New to Julia	5	2982	June 17, 2021
Sample rate conversion from audio files Machine Learning	3	526	June 2, 2020
Problem reading some WAV files using WAV.jl General Usage data , audio	2	369	June 5, 2023
Interactive audio spectrograms - new user Visualization plotting , dsp , pluto	5	1220	April 15, 2021
Alternative to reading m4a files General Usage	2	875	May 9, 2020

Reading and processing multiple very large Wav files

Related topics