Storing huge amount of data efficiently

Hello,

I am performing some numerical computations and I have the necessity to run a huge (say, several thousand of iterations) for loop. Each iteration produces a certain amount of data that I need to store in the disk. For clarity, let’s say that each iteration produces a list of pairs (called graph) and a list of , around, 1000 real numbers (called spectrum). My current way to save the data is to create, before starting the loop, a file called, say, datafile.jld with two labels (the label spectra and the label graphs) and then at each iteration, store the data on it in the following way

file = JLD.load("datafile.jld")
push!(file["graphs"], graph)
push!(file["spectra"], spectrum)
JLD.save("datafile.jld", file)

However, although this method allows me to store the data in a clean way, I realized that it is very inefficient at the level of time and, more important, of memory. Indeed the file datafile.jld becomes soon very large and when I load it, the RAM saturates very fast and the system crushes.

What should I do? of course, a solution would be to store, temporarily, the data in many small temporary files (one file created for each iteration of the loop), and just at the end of the computation store everything in the jld file. But I am wondering whether there are better procedures.

Thanks

I am using GitHub - apache/arrow-julia: Official Julia implementation of Apache Arrow to store my simulation results. See: User Manual · Arrow.jl

This can be very fast and efficient if your data can be represented in form of tables.

But I don’t think that it works if your data is larger than your (virtual) memory.

Arrow allows larger-than-RAM datasets, but getting that right can be tricky: https://lists.apache.org/thread/cbfc9hjl4nofqkfk1lmo6bwlqnxz5x86

2 Likes

with Arrow.jl, if you don’t have compression, the reading is just mmaped, so you can easily handle larger than RAM data set

1 Like

As an HPC guy… “Many small files” Arooogah! Arooogah! Shields up! Red alert!

Seriously - many filesystems do not cope with well with huge amounts of small files.
Have a thought also for deep storage of your files - which eventually might be on tape.

As others reply here, give some thought to a suitable file format which keeps your data in single files. You are of course doing this by asking the question.

1 Like

Thanks! I will definitely check it. For me the important thing is to have the data well-organized and fast and memory efficient storage. In principle, the jld files were simply perfect but it looks that to add a new data I need to load the full file every time. this makes simply impossible to use this solution.

Ehehehe I know very well that many small files is a terrible solution :slight_smile:

Have a thought also for deep storage of your files - which eventually might be on tape.

what does it mean? can you give me an example?

As others reply here, give some thought to a suitable file format which keeps your data in single files.

yes, this is exactly what I want but with jld files it looks that I need to load every time the full file to write something new. Do you know whether I am doing something wrong?

Can you point to an example how to use Arrow,jl to write a dataset incrementally?

1 Like

Thanks. I am confused though: after this line of code

writer = open(Arrow.Writer, tempname())

isn’t writer containing the full vector? in other words, am I in this case loading the all file, then appending the new part and finally saving everything?

my guess is you can do multiple Arrow.write(writer, partition) and after each one of them, partition no longer needs to live in memory.

@Dario-Rosa85 Tape is an inherently linear medium (that is fo course stating the obvious). It performs best when data is being streamed when reading from it.
I WAS now going to talk about ‘shoe shining’ which is the behaviour of tale stopping and starting as many small files are read… even worse when the tape drive has to seek for a file somewhere along the length of the tape then go back and seek for another file.

However a quick Google tells me that shoe shining is not a problem as LTO tape drives have Dynamic Read Matching. As someone who managed a big tape library with LTO drives in the past I hang my head in shame that I didn’t look into the technology at that depth.

Also a warning that creating oodles and oodles of small files uses up inodes on your filesystem.

I like to use HDF5 files (https://www.hdfgroup.org/, Home · HDF5.jl) for this kind of thing. Then each iteration can write its own “dataset” (a notion in HDF5) to a single HDF5 file, and datasets can then be read later in an independent fashion, that is, you don’t need to read the entire HDF5 file to get a single dataset out of it.

HDF5 also allows you to incrementally append to a given dataset, and to read only a chunk of a given dataset, so you can also have a single dataset that doesn’t fit into RAM.

(Of course, if you don’t need the data to be super portable, you can also just read/write binary data from/to an mmapped file.)

1 Like

Note that JLD2 files are HDF5 files, so you can write data to an existing file without having to read its contents first. Here is an example setup…

using JLD2: JLD2, jldopen
using FileIO: load

struct Result
    graph::Vector{Pair{Int, Int}}
    spectrum::Vector{Float64}
end

function make_rand_result(len_g=1000, len_s=1000)
    graph = [rand(1:len_g) => rand(1:len_g) for _ in 1:len_g]
    spectrum = rand(len_s)
    return Result(graph, spectrum)
end


"""
    append_result_data(fname::AbstractString, gname::String, result::Result)

Append a `Result` instance to a result file for a particular frequency and pair of scan parameters.

## Arguments

- `fname`: The name of the result file to be appended to.
- `gname`: The unique `JLD2` group name to be used in the file for grouping the data 
  associated with this particular `Result`.
- `result`:  The `Result` data to be written to the file.
"""
function append_result_data(fname::AbstractString, gname::String, result::Result)
    jldopen(fname, "a") do fid
        group = JLD2.Group(fid, gname)
        group["result"] = result
    end
    return
end

"""
Read a result file (in JLD2 format) and return a vector of results.    
"""
function read_result_file(fname::AbstractString)::Vector{Result}
    dat = load(fname) # a Dict
    ks = collect(keys(dat))
    sort!(ks, by = x -> parse(Int, split(x, '/')[1]))
    Result[dat[k] for k in ks]
end


fname = "results.jld2"

@time for k in 1:100
    result = make_rand_result()
    append_result_data(fname, string(k), result)
end

The result of executing this code is

0.153978 seconds (90.18 k allocations: 11.762 MiB)

so each write takes about 0.0015 seconds on my machine (using an SSD). For this small amount of data one can read in the entire file at once using the read_result_file function:

julia> results = read_result_file("results.jld2")
100-element Vector{Result}:
 Result([101 => 539, 595 => 14, 157 => 177, 217 => 605, 936 => 594, 34 => 199, 798 => 597, 635 => 95, 149 => 669, 46 => 289  …  553 => 829, 463 => 229, 496 => 658, 298 => 627, 236 => 862, 154 => 48, 736 => 729, 512 => 277, 653 => 141, 913 => 978], [0.5249898601896728, 0.4312370366565087, 0.073513832333578, 0.9307962520861972, 0.570758524298132, 0.29993065399764673, 0.461428212039214, 0.48548053201183095, 0.9545877485556933, 0.2801239021403443  …  0.4034577980889896, 0.08557405938710971, 0.8975983515249012, 0.10602304568819776, 0.04273325330287514, 0.015438071286775767, 0.9906598021539139, 0.18758080699422763, 0.963555146837086, 0.39262228477199157])
 Result([74 => 40, 250 => 120, 579 => 190, 691 => 108, 925 => 668, 675 => 141, 510 => 240, 389 => 320, 12 => 641, 531 => 372  …  67 => 563, 444 => 506, 817 => 139, 737 => 163, 518 => 588, 133 => 688, 279 => 535, 747 => 827, 695 => 684, 974 => 837], [0.3158561440694494, 0.9601344915366097, 0.003653937261339224, 0.2926859090457542, 0.2827751952536224, 0.9779077388680282, 0.263547348130297, 0.27975998760694254, 0.7767543085049818, 0.9597931494721555  …  0.9377423040933572, 0.21667656142445002, 0.636526779508497, 0.20705611746552255, 0.6387448271827161, 0.3646310839980138, 0.190373271599928, 0.5071072365335462, 0.39990795434930937, 0.10745351609508336])
 ⋮
 Result([937 => 530, 906 => 467, 238 => 866, 609 => 25, 935 => 290, 872 => 503, 170 => 9, 894 => 365, 784 => 409, 807 => 327  …  229 => 506, 780 => 405, 321 => 948, 547 => 420, 122 => 30, 45 => 889, 245 => 420, 818 => 867, 299 => 420, 761 => 395], [0.17656883617243935, 0.4507217403285453, 0.04200052075311711, 0.6329462072026806, 0.11094795406276392, 0.057792051835904745, 0.4985762857207552, 0.1979714208282517, 0.9372049973093541, 0.022649614794672535  …  0.895620703038924, 0.03189989532073445, 0.012352709138806706, 0.9485344498486584, 0.5289750480073121, 0.607563134722566, 0.9184455893097113, 0.6339606316843024, 0.06139500890461502, 0.6184410686790852])

Since you will be generating too much data to fit in RAM, you can read in selected groups of data as shown in this section of the JLD2 documentation.

4 Likes

Are this programmer’s concerns in this blog post about JLD2 format and Julia serialization valid?

JLD2 maintainer here.
I’d say, JLD2 is reasonably suited well for this.

The blog post claims, JLD2 doesn’t have compression - it does.
It claims, it is some unique/customized format: Output files are plain hdf5. (It does store additional metadata if you want JLD2 to directly store structs rather than limiting yourself numbers, strings, and arrays)

Long-term support is not really an issue in my eyes.
If you want to store data for a long time, you should not rely on external (package) structs to not change (e.g. DiffEq solution objects or DataFrames) and generally avoid overly complex structures.
If you adhere to that, your files will remain readable with JLD2 and in particular also with HDF5.

Note: If you know, you will be appending new datasets to your file many times, you can construct the containing group to “leave space” for future entries. This improves loading time.

12 Likes