Maximizing Input File Read Performance

Hi everyone,

I’m writing a simple simulation program, and trying to figure out the best way of designing a format for my input files. The simulation calculates the effect of multiple types of sources (each defined in a different manner) on a set of observation points in 3D space. The number of sources and the number of observation points will be very large. Not all types of sources may be present in every input file. The data are all Float64.

The naive implementation I came up with is as follows:

  • Use a CSV file with comments for β€˜flags’ that indicate to the program to the type of the consecutive rows of data
  • Use CSV.jl to read the data line by line and react to the flags as required

For example, with just a few sample data lines:

# observation-pts (number, x, y, z)
1,0,0,0
2,0,0,1
3,1,2,3
# source-type-1 (x0,y0,z0,x1,y1,z1,a)
0,0,0,1,1,1,1000
0,1,2,4,5,6,2000
# source-type-2 (x0,y0,z0,r,b)
10,20,30,50,4000

My concern is that with very large quantities of data, reading line-by-line is a very inefficient method of accessing the data from the file and loading it into memory.

Are there standard/typical methods for handling data files like this?

Thanks in advance!

Use a binary format. e.g. HDF5.jl, or even a custom binary format via read and write (or memory-mapped arrays).

3 Likes

Adding to the previous answer:

  1. What are your requirements, i.e., human-readable, portable etc.
  2. Don’t invent your own data format, but use an existing one that is flexible enough to cover your data needs:
    • JSON might be an option if it needs to be human readable
    • SQL(ite) DB with observations and sources in separate tables
    • Binary format, e.g., HDF5 as already suggested, or serialization (e.g., JLD2) if it does not need to be portable
    • …

Thanks! Human-readable was an initial goal but not required.

HDF5 looks like it could be an answer for me.

JSON or YAML was my first thought, because they allow me to β€œcollect” different kinds of data using an existing human-readable standard. Are either of these performant in Julia when loading large arrays into memory?

Both of those are text based, so they are likely to be way slower than a binary format like HDF5. (A single HDF5 file can also contain multiple datasets and metadata. Another option is to put your human-readable metadata into a JSON file, and put the name of an accompanying HDF5 file with the big dataset a field in the JSON file.)

I would avoid YAML if at all possible, it is overly complex ( for example it has 6+ ways to write a string) and the current YAML.jl library has several bugs. Issues Β· JuliaData/YAML.jl Β· GitHub

Here’s a demonstration for HDF5.

julia> using HDF5

julia> struct source_type_2
           x0::Int64
           y0::Int64
           z0::Int64
           r::Int64
           b::Int64
       end

julia> h5open("points.h5", "w") do h5f
           obs_pts = create_group(h5f, "observation-pts")
           obs_pts["number"] = [1,2,3]
           obs_pts["x"] = [0,0,1]
           obs_pts["y"] = [0,0,2]
           obs_pts["z"] = [0,1,3]
           
           source_type_1 = create_group(h5f, "source-type-1")
           source_type_1["x0"] = [0,0]
           source_type_1["y0"] = [0,1]
           source_type_1["z0"] = [0,2]
           source_type_1["x1"] = [1,4]
           source_type_1["y1"] = [1,5]
           source_type_1["z1"] = [1,6]
           source_type_1["a"] = [1000, 2000]
           
           h5f["source-type-2"] = [source_type_2(10,20,30,50,4000)]
           nothing
       end

julia> h5f = h5open("points.h5")
πŸ—‚οΈ HDF5.File: (read-only) points.h5
β”œβ”€ πŸ“‚ observation-pts
β”‚  β”œβ”€ πŸ”’ number
β”‚  β”œβ”€ πŸ”’ x
β”‚  β”œβ”€ πŸ”’ y
β”‚  └─ πŸ”’ z
β”œβ”€ πŸ“‚ source-type-1
β”‚  β”œβ”€ πŸ”’ a
β”‚  β”œβ”€ πŸ”’ x0
β”‚  β”œβ”€ πŸ”’ x1
β”‚  β”œβ”€ πŸ”’ y0
β”‚  β”œβ”€ πŸ”’ y1
β”‚  β”œβ”€ πŸ”’ z0
β”‚  └─ πŸ”’ z1
└─ πŸ”’ source-type-2

julia> h5f["observation-pts"]
πŸ“‚ HDF5.Group: /observation-pts (file: points.h5)
β”œβ”€ πŸ”’ number
β”œβ”€ πŸ”’ x
β”œβ”€ πŸ”’ y
└─ πŸ”’ z

julia> h5f["observation-pts"]["x"][]
3-element Vector{Int64}:
 0
 0
 1

julia> h5f["source-type-1"]["a"][]
2-element Vector{Int64}:
 1000
 2000

julia> h5f["source-type-2"][1]
(x0 = 10, y0 = 20, z0 = 30, r = 50, b = 4000)

julia> read(h5f["source-type-2"])
1-element Vector{@NamedTuple{x0::Int64, y0::Int64, z0::Int64, r::Int64, b::Int64}}:
 (x0 = 10, y0 = 20, z0 = 30, r = 50, b = 4000)

julia> read(h5f["source-type-2"], source_type_2)
1-element Vector{source_type_2}:
 source_type_2(10, 20, 30, 50, 4000)

julia> read(h5f["source-type-2"], source_type_2)[1]
source_type_2(10, 20, 30, 50, 4000)

You may also want to consider JLD.jl or JLD2.jl.