I’ve got a large dataset of several million short audio clips (i.e., waveforms in 1D arrays, plus metadata). Total size is about 500 GB, stored across 140,000 .jld files. I need to go through and calculate a set of summary features from each audio clip, which will then be used for clustering/classification/etc.
I haven’t used JuliaDB
and OnlineStats
yet, but they seem like the natural way to handle the big table of features. As I understand it, the simplest/naive way to get these data into JuliaDB would be:
- Load each .jld file, extract features from each clip, save the resulting table in a .csv file
- Ingest all .csv files into
JuliaDB
- Re-save the dataset in binary format
My questions:
Is there a way to build the binary-format database without writing and reading all those intermediate .csv files?
Are there other tools or approaches I should be looking at?
Thanks in advance!