Ingesting data to JuliaDB without .csv files

I’ve got a large dataset of several million short audio clips (i.e., waveforms in 1D arrays, plus metadata). Total size is about 500 GB, stored across 140,000 .jld files. I need to go through and calculate a set of summary features from each audio clip, which will then be used for clustering/classification/etc.

I haven’t used JuliaDB and OnlineStats yet, but they seem like the natural way to handle the big table of features. As I understand it, the simplest/naive way to get these data into JuliaDB would be:

  1. Load each .jld file, extract features from each clip, save the resulting table in a .csv file
  2. Ingest all .csv files into JuliaDB
  3. Re-save the dataset in binary format

My questions:
Is there a way to build the binary-format database without writing and reading all those intermediate .csv files?
Are there other tools or approaches I should be looking at?

Thanks in advance!


1.4e5 data points should fit easily in memory if you are just after summary statistics, so I would just iterate through the files, extract statistics, push!ing them into Vectors, which I would then perhaps convert to a DataFrame.

The simplest is to use map:

julia> t = table(@NT(filename = ["a.jld2", "b.jld2"]));
julia> f(filename) = @NT(filename=filename, length=1, frequency=2.3);

julia> map(f, t, select=:filename)
Table with 2 rows, 3 columns:
filename  length  frequency
"a.jld2"  1       2.3
"b.jld2"  1       2.3

Of course in your case f should be the function that loads the file and computes your relevant features. Note that if t is stored as a distributed table across several processors, map will be done in parallel.

Thanks for the responses. I guess I didn’t make it clear, but there are 1.4e5 .jld files, each of which contains a variable number of waveforms…most between 10 and 1000. The table of summary features may well fit in RAM, but I’d hate to try processing them all and then get an out-of-memory error 95% of the way through :sweat_smile:

I’ve ended up splitting the list of files so I can process and save one medium-sized chunk at a time. I’ll be back in a day or two if it doesn’t work!

I’m a bit stuck here. I have a pandas data store that looks like this:

Dict{String,Any} with 6 entries:
  "axis1"         => [0, 1, 2, 3, 4, 5, 6, 7, 8, 9  …  29592, 29593, 29594, 295…
  "axis0"         => String["run", "event", "moduleID", "pixelID", "time"]
  "block0_values" => [30.4698 31.4345 … 42.8309 30.2701]
  "block1_items"  => String["run", "event", "moduleID", "pixelID"]
  "block1_values" => [4118 4118 … 4118 4118; 39248 39248 … 78452 78452; 1 1 … 1…
  "block0_items"  => String["time"]

What I’ve come up so far is this:

function readDigitsTableFromHDF5_File(fname)
    d = h5read(fname, "digits")
    names = [(Symbol(i) for i in d["block1_items"])...,collect(Symbol(i) for i in d["block0_items"])...]
    data = [(d["block1_values"][i,:] for i in 1:length(d["block1_items"]))...,(d["block0_values"][i,:] for i in 1:length(d["block0_items"]))...]

But this doesn’t work, a map over filenames returns this shape:

Table with 1 rows, 1 columns:
#  colname  type
1  names    Array{Array{Float64,1},1}

This isn’t what I was hoping for. What is the right syntax to generate the NamedTuple?