Ingesting data to JuliaDB without .csv files

ElOceanografo · June 1, 2018, 6:50pm

I’ve got a large dataset of several million short audio clips (i.e., waveforms in 1D arrays, plus metadata). Total size is about 500 GB, stored across 140,000 .jld files. I need to go through and calculate a set of summary features from each audio clip, which will then be used for clustering/classification/etc.

I haven’t used JuliaDB and OnlineStats yet, but they seem like the natural way to handle the big table of features. As I understand it, the simplest/naive way to get these data into JuliaDB would be:

Load each .jld file, extract features from each clip, save the resulting table in a .csv file
Ingest all .csv files into JuliaDB
Re-save the dataset in binary format

My questions:
Is there a way to build the binary-format database without writing and reading all those intermediate .csv files?
Are there other tools or approaches I should be looking at?

Thanks in advance!

Tamas_Papp · June 2, 2018, 12:01pm

1.4e5 data points should fit easily in memory if you are just after summary statistics, so I would just iterate through the files, extract statistics, push!ing them into Vectors, which I would then perhaps convert to a DataFrame.

piever · June 2, 2018, 2:42pm

The simplest is to use map:

julia> t = table(@NT(filename = ["a.jld2", "b.jld2"]));
julia> f(filename) = @NT(filename=filename, length=1, frequency=2.3);

julia> map(f, t, select=:filename)
Table with 2 rows, 3 columns:
filename  length  frequency
───────────────────────────
"a.jld2"  1       2.3
"b.jld2"  1       2.3

Of course in your case f should be the function that loads the file and computes your relevant features. Note that if t is stored as a distributed table across several processors, map will be done in parallel.

ElOceanografo · June 5, 2018, 2:18am

Thanks for the responses. I guess I didn’t make it clear, but there are 1.4e5 .jld files, each of which contains a variable number of waveforms…most between 10 and 1000. The table of summary features may well fit in RAM, but I’d hate to try processing them all and then get an out-of-memory error 95% of the way through

I’ve ended up splitting the list of files so I can process and save one medium-sized chunk at a time. I’ll be back in a day or two if it doesn’t work!

jstrube · August 30, 2018, 5:52am

I’m a bit stuck here. I have a pandas data store that looks like this:

Dict{String,Any} with 6 entries:
  "axis1"         => [0, 1, 2, 3, 4, 5, 6, 7, 8, 9  …  29592, 29593, 29594, 295…
  "axis0"         => String["run", "event", "moduleID", "pixelID", "time"]
  "block0_values" => [30.4698 31.4345 … 42.8309 30.2701]
  "block1_items"  => String["run", "event", "moduleID", "pixelID"]
  "block1_values" => [4118 4118 … 4118 4118; 39248 39248 … 78452 78452; 1 1 … 1…
  "block0_items"  => String["time"]

What I’ve come up so far is this:

function readDigitsTableFromHDF5_File(fname)
    d = h5read(fname, "digits")
    names = [(Symbol(i) for i in d["block1_items"])...,collect(Symbol(i) for i in d["block0_items"])...]
    data = [(d["block1_values"][i,:] for i in 1:length(d["block1_items"]))...,(d["block0_values"][i,:] for i in 1:length(d["block0_items"]))...]
    @NT(names)(data)
end

But this doesn’t work, a map over filenames returns this shape:

Table with 1 rows, 1 columns:
Columns:
#  colname  type
─────────────────────────────────────
1  names    Array{Array{Float64,1},1}

This isn’t what I was hoping for. What is the right syntax to generate the NamedTuple?

Topic		Replies	Views
ANN: JuliaDB.jl Community	40	9788	November 13, 2018
JuliaDB Getting Started...with save error Data	6	1521	November 28, 2018
Package for reading/writing ~100GB data files General Usage	10	2903	November 17, 2018
JuliaDB - Saving to CSV New to Julia question	11	5204	June 18, 2019
JuliaDB loadndsparse: many errors General Usage	15	774	October 24, 2019

Ingesting data to JuliaDB without .csv files

Related topics