Storing huge amount of data efficiently

Dario-Rosa85 · February 18, 2023, 2:00am

Hello,

I am performing some numerical computations and I have the necessity to run a huge (say, several thousand of iterations) for loop. Each iteration produces a certain amount of data that I need to store in the disk. For clarity, let’s say that each iteration produces a list of pairs (called graph) and a list of , around, 1000 real numbers (called spectrum). My current way to save the data is to create, before starting the loop, a file called, say, datafile.jld with two labels (the label spectra and the label graphs) and then at each iteration, store the data on it in the following way

file = JLD.load("datafile.jld")
push!(file["graphs"], graph)
push!(file["spectra"], spectrum)
JLD.save("datafile.jld", file)

However, although this method allows me to store the data in a clean way, I realized that it is very inefficient at the level of time and, more important, of memory. Indeed the file datafile.jld becomes soon very large and when I load it, the RAM saturates very fast and the system crushes.

What should I do? of course, a solution would be to store, temporarily, the data in many small temporary files (one file created for each iteration of the loop), and just at the end of the computation store everything in the jld file. But I am wondering whether there are better procedures.

Thanks

ufechner7 · February 18, 2023, 2:37am

I am using GitHub - apache/arrow-julia: Official Julia implementation of Apache Arrow to store my simulation results. See: User Manual · Arrow.jl

This can be very fast and efficient if your data can be represented in form of tables.

But I don’t think that it works if your data is larger than your (virtual) memory.

Arrow allows larger-than-RAM datasets, but getting that right can be tricky: https://lists.apache.org/thread/cbfc9hjl4nofqkfk1lmo6bwlqnxz5x86

jling · February 18, 2023, 3:48am

with Arrow.jl, if you don’t have compression, the reading is just mmaped, so you can easily handle larger than RAM data set

johnh · February 18, 2023, 5:15am

As an HPC guy… “Many small files” Arooogah! Arooogah! Shields up! Red alert!

Seriously - many filesystems do not cope with well with huge amounts of small files.
Have a thought also for deep storage of your files - which eventually might be on tape.

As others reply here, give some thought to a suitable file format which keeps your data in single files. You are of course doing this by asking the question.

Dario-Rosa85 · February 18, 2023, 7:07am

Thanks! I will definitely check it. For me the important thing is to have the data well-organized and fast and memory efficient storage. In principle, the jld files were simply perfect but it looks that to add a new data I need to load the full file every time. this makes simply impossible to use this solution.

Dario-Rosa85 · February 18, 2023, 7:09am

Ehehehe I know very well that many small files is a terrible solution

Have a thought also for deep storage of your files - which eventually might be on tape.

what does it mean? can you give me an example?

As others reply here, give some thought to a suitable file format which keeps your data in single files.

yes, this is exactly what I want but with jld files it looks that I need to load every time the full file to write something new. Do you know whether I am doing something wrong?

ufechner7 · February 18, 2023, 9:25am

Can you point to an example how to use Arrow,jl to write a dataset incrementally?

jling · February 18, 2023, 7:34pm

github.com

apache/arrow-julia/blob/8da59e4ca0c07403029a7b39c0ce2bb4a7ee11df/src/write.jl#L81


      
          
          
struct Block
              offset::Int64
              metaDataLength::Int32
              bodyLength::Int64
          end
          
          
"""
              Arrow.Writer{T<:IO}
          
          
An object that can be used to incrementally write Arrow partitions
          
          
# Examples
          ```julia
          julia> writer = open(Arrow.Writer, tempname())
          
          
julia> partition1 = (col1 = [1, 2], col2 = ["A", "B"])
          (col1 = [1, 2], col2 = ["A", "B"])
          
          
julia> Arrow.write(writer, partition1)

Dario-Rosa85 · February 20, 2023, 6:34pm

Thanks. I am confused though: after this line of code

writer = open(Arrow.Writer, tempname())

isn’t writer containing the full vector? in other words, am I in this case loading the all file, then appending the new part and finally saving everything?

jling · February 20, 2023, 8:23pm

my guess is you can do multiple Arrow.write(writer, partition) and after each one of them, partition no longer needs to live in memory.

johnh · February 23, 2023, 9:33am

@Dario-Rosa85 Tape is an inherently linear medium (that is fo course stating the obvious). It performs best when data is being streamed when reading from it.
I WAS now going to talk about ‘shoe shining’ which is the behaviour of tale stopping and starting as many small files are read… even worse when the tape drive has to seek for a file somewhere along the length of the tape then go back and seek for another file.

However a quick Google tells me that shoe shining is not a problem as LTO tape drives have Dynamic Read Matching. As someone who managed a big tape library with LTO drives in the past I hang my head in shame that I didn’t look into the technology at that depth.

Also a warning that creating oodles and oodles of small files uses up inodes on your filesystem.

emil_hedevang_sgre · February 23, 2023, 10:02am

I like to use HDF5 files (https://www.hdfgroup.org/, Home · HDF5.jl) for this kind of thing. Then each iteration can write its own “dataset” (a notion in HDF5) to a single HDF5 file, and datasets can then be read later in an independent fashion, that is, you don’t need to read the entire HDF5 file to get a single dataset out of it.

stevengj · February 23, 2023, 3:20pm

HDF5 also allows you to incrementally append to a given dataset, and to read only a chunk of a given dataset, so you can also have a single dataset that doesn’t fit into RAM.

(Of course, if you don’t need the data to be super portable, you can also just read/write binary data from/to an mmapped file.)

PeterSimon · February 24, 2023, 5:19am

Note that JLD2 files are HDF5 files, so you can write data to an existing file without having to read its contents first. Here is an example setup…

using JLD2: JLD2, jldopen
using FileIO: load

struct Result
    graph::Vector{Pair{Int, Int}}
    spectrum::Vector{Float64}
end

function make_rand_result(len_g=1000, len_s=1000)
    graph = [rand(1:len_g) => rand(1:len_g) for _ in 1:len_g]
    spectrum = rand(len_s)
    return Result(graph, spectrum)
end


"""
    append_result_data(fname::AbstractString, gname::String, result::Result)

Append a `Result` instance to a result file for a particular frequency and pair of scan parameters.

## Arguments

- `fname`: The name of the result file to be appended to.
- `gname`: The unique `JLD2` group name to be used in the file for grouping the data 
  associated with this particular `Result`.
- `result`:  The `Result` data to be written to the file.
"""
function append_result_data(fname::AbstractString, gname::String, result::Result)
    jldopen(fname, "a") do fid
        group = JLD2.Group(fid, gname)
        group["result"] = result
    end
    return
end

"""
Read a result file (in JLD2 format) and return a vector of results.    
"""
function read_result_file(fname::AbstractString)::Vector{Result}
    dat = load(fname) # a Dict
    ks = collect(keys(dat))
    sort!(ks, by = x -> parse(Int, split(x, '/')[1]))
    Result[dat[k] for k in ks]
end


fname = "results.jld2"

@time for k in 1:100
    result = make_rand_result()
    append_result_data(fname, string(k), result)
end

The result of executing this code is

0.153978 seconds (90.18 k allocations: 11.762 MiB)

so each write takes about 0.0015 seconds on my machine (using an SSD). For this small amount of data one can read in the entire file at once using the read_result_file function:

julia> results = read_result_file("results.jld2")
100-element Vector{Result}:
 Result([101 => 539, 595 => 14, 157 => 177, 217 => 605, 936 => 594, 34 => 199, 798 => 597, 635 => 95, 149 => 669, 46 => 289  …  553 => 829, 463 => 229, 496 => 658, 298 => 627, 236 => 862, 154 => 48, 736 => 729, 512 => 277, 653 => 141, 913 => 978], [0.5249898601896728, 0.4312370366565087, 0.073513832333578, 0.9307962520861972, 0.570758524298132, 0.29993065399764673, 0.461428212039214, 0.48548053201183095, 0.9545877485556933, 0.2801239021403443  …  0.4034577980889896, 0.08557405938710971, 0.8975983515249012, 0.10602304568819776, 0.04273325330287514, 0.015438071286775767, 0.9906598021539139, 0.18758080699422763, 0.963555146837086, 0.39262228477199157])
 Result([74 => 40, 250 => 120, 579 => 190, 691 => 108, 925 => 668, 675 => 141, 510 => 240, 389 => 320, 12 => 641, 531 => 372  …  67 => 563, 444 => 506, 817 => 139, 737 => 163, 518 => 588, 133 => 688, 279 => 535, 747 => 827, 695 => 684, 974 => 837], [0.3158561440694494, 0.9601344915366097, 0.003653937261339224, 0.2926859090457542, 0.2827751952536224, 0.9779077388680282, 0.263547348130297, 0.27975998760694254, 0.7767543085049818, 0.9597931494721555  …  0.9377423040933572, 0.21667656142445002, 0.636526779508497, 0.20705611746552255, 0.6387448271827161, 0.3646310839980138, 0.190373271599928, 0.5071072365335462, 0.39990795434930937, 0.10745351609508336])
 ⋮
 Result([937 => 530, 906 => 467, 238 => 866, 609 => 25, 935 => 290, 872 => 503, 170 => 9, 894 => 365, 784 => 409, 807 => 327  …  229 => 506, 780 => 405, 321 => 948, 547 => 420, 122 => 30, 45 => 889, 245 => 420, 818 => 867, 299 => 420, 761 => 395], [0.17656883617243935, 0.4507217403285453, 0.04200052075311711, 0.6329462072026806, 0.11094795406276392, 0.057792051835904745, 0.4985762857207552, 0.1979714208282517, 0.9372049973093541, 0.022649614794672535  …  0.895620703038924, 0.03189989532073445, 0.012352709138806706, 0.9485344498486584, 0.5289750480073121, 0.607563134722566, 0.9184455893097113, 0.6339606316843024, 0.06139500890461502, 0.6184410686790852])

Since you will be generating too much data to fit in RAM, you can read in selected groups of data as shown in this section of the JLD2 documentation.

rafael.guerra · February 24, 2023, 8:55am

Are this programmer’s concerns in this blog post about JLD2 format and Julia serialization valid?

JonasIsensee · February 24, 2023, 9:21am

JLD2 maintainer here.
I’d say, JLD2 is reasonably suited well for this.

The blog post claims, JLD2 doesn’t have compression - it does.
It claims, it is some unique/customized format: Output files are plain hdf5. (It does store additional metadata if you want JLD2 to directly store structs rather than limiting yourself numbers, strings, and arrays)

Long-term support is not really an issue in my eyes.
If you want to store data for a long time, you should not rely on external (package) structs to not change (e.g. DiffEq solution objects or DataFrames) and generally avoid overly complex structures.
If you adhere to that, your files will remain readable with JLD2 and in particular also with HDF5.

Note: If you know, you will be appending new datasets to your file many times, you can construct the containing group to “leave space” for future entries. This improves loading time.

github.com

JuliaIO/JLD2.jl/blob/a0ecb14c1383a8893a34544ea34b16da96c3223b/src/JLD2.jl#L180


      
          FilterPipeline() = FilterPipeline(Filter[])
          iscompressed(fp::FilterPipeline) = !isempty(fp.filters)
          
          
"""
              Group(file)
          
          
JLD2 group object.
          
          
## Advanced Usage
          Takes two optional keyword arguments:
              est_num_entries::Int=4
              est_link_name_len::Int=8
          These determine how much (additional) empty space should be allocated for the group description. (list of entries)
          This can be useful for performance when one expects to append many additional datasets after first writing the file.
          """
          mutable struct Group{T}
              f::T
              last_chunk_start_offset::Int64
              continuation_message_goes_here::Int64
              last_chunk_checksum_offset::Int64
              next_link_offset::Int64

Topic		Replies	Views
Why JLD2.jl is 40x slower than Arrow.jl Performance package , jld2 , arrow	24	723	November 25, 2024
Storing and accessing large jagged array with julia General Usage question , data , filesystem , hep	33	4046	October 31, 2023
Saving data a scan at a time and reading as a contiguous block General Usage	11	487	July 19, 2023
A future for JLD2? Community jld2	56	9800	July 19, 2020
Can't read old JLD2 file Tooling	17	2941	February 19, 2019

Storing huge amount of data efficiently

Related topics