Maximizing Input File Read Performance

freestatelabs · July 14, 2024, 3:33pm

Hi everyone,

I’m writing a simple simulation program, and trying to figure out the best way of designing a format for my input files. The simulation calculates the effect of multiple types of sources (each defined in a different manner) on a set of observation points in 3D space. The number of sources and the number of observation points will be very large. Not all types of sources may be present in every input file. The data are all Float64.

The naive implementation I came up with is as follows:

Use a CSV file with comments for ‘flags’ that indicate to the program to the type of the consecutive rows of data
Use CSV.jl to read the data line by line and react to the flags as required

For example, with just a few sample data lines:

# observation-pts (number, x, y, z)
1,0,0,0
2,0,0,1
3,1,2,3
# source-type-1 (x0,y0,z0,x1,y1,z1,a)
0,0,0,1,1,1,1000
0,1,2,4,5,6,2000
# source-type-2 (x0,y0,z0,r,b)
10,20,30,50,4000

My concern is that with very large quantities of data, reading line-by-line is a very inefficient method of accessing the data from the file and loading it into memory.

Are there standard/typical methods for handling data files like this?

Thanks in advance!

stevengj · July 14, 2024, 4:55pm

Use a binary format. e.g. HDF5.jl, or even a custom binary format via read and write (or memory-mapped arrays).

bertschi · July 14, 2024, 5:07pm

Adding to the previous answer:

What are your requirements, i.e., human-readable, portable etc.
Don’t invent your own data format, but use an existing one that is flexible enough to cover your data needs:
- JSON might be an option if it needs to be human readable
- SQL(ite) DB with observations and sources in separate tables
- Binary format, e.g., HDF5 as already suggested, or serialization (e.g., JLD2) if it does not need to be portable
- …

freestatelabs · July 16, 2024, 7:01pm

Thanks! Human-readable was an initial goal but not required.

HDF5 looks like it could be an answer for me.

JSON or YAML was my first thought, because they allow me to “collect” different kinds of data using an existing human-readable standard. Are either of these performant in Julia when loading large arrays into memory?

stevengj · July 16, 2024, 7:25pm

Both of those are text based, so they are likely to be way slower than a binary format like HDF5. (A single HDF5 file can also contain multiple datasets and metadata. Another option is to put your human-readable metadata into a JSON file, and put the name of an accompanying HDF5 file with the big dataset a field in the JSON file.)

nhz2 · July 16, 2024, 7:56pm

I would avoid YAML if at all possible, it is overly complex ( for example it has 6+ ways to write a string) and the current YAML.jl library has several bugs. Issues · JuliaData/YAML.jl · GitHub

mkitti · July 16, 2024, 8:14pm

Here’s a demonstration for HDF5.

julia> using HDF5

julia> struct source_type_2
           x0::Int64
           y0::Int64
           z0::Int64
           r::Int64
           b::Int64
       end

julia> h5open("points.h5", "w") do h5f
           obs_pts = create_group(h5f, "observation-pts")
           obs_pts["number"] = [1,2,3]
           obs_pts["x"] = [0,0,1]
           obs_pts["y"] = [0,0,2]
           obs_pts["z"] = [0,1,3]
           
           source_type_1 = create_group(h5f, "source-type-1")
           source_type_1["x0"] = [0,0]
           source_type_1["y0"] = [0,1]
           source_type_1["z0"] = [0,2]
           source_type_1["x1"] = [1,4]
           source_type_1["y1"] = [1,5]
           source_type_1["z1"] = [1,6]
           source_type_1["a"] = [1000, 2000]
           
           h5f["source-type-2"] = [source_type_2(10,20,30,50,4000)]
           nothing
       end

julia> h5f = h5open("points.h5")
🗂️ HDF5.File: (read-only) points.h5
├─ 📂 observation-pts
│  ├─ 🔢 number
│  ├─ 🔢 x
│  ├─ 🔢 y
│  └─ 🔢 z
├─ 📂 source-type-1
│  ├─ 🔢 a
│  ├─ 🔢 x0
│  ├─ 🔢 x1
│  ├─ 🔢 y0
│  ├─ 🔢 y1
│  ├─ 🔢 z0
│  └─ 🔢 z1
└─ 🔢 source-type-2

julia> h5f["observation-pts"]
📂 HDF5.Group: /observation-pts (file: points.h5)
├─ 🔢 number
├─ 🔢 x
├─ 🔢 y
└─ 🔢 z

julia> h5f["observation-pts"]["x"][]
3-element Vector{Int64}:
 0
 0
 1

julia> h5f["source-type-1"]["a"][]
2-element Vector{Int64}:
 1000
 2000

julia> h5f["source-type-2"][1]
(x0 = 10, y0 = 20, z0 = 30, r = 50, b = 4000)

julia> read(h5f["source-type-2"])
1-element Vector{@NamedTuple{x0::Int64, y0::Int64, z0::Int64, r::Int64, b::Int64}}:
 (x0 = 10, y0 = 20, z0 = 30, r = 50, b = 4000)

julia> read(h5f["source-type-2"], source_type_2)
1-element Vector{source_type_2}:
 source_type_2(10, 20, 30, 50, 4000)

julia> read(h5f["source-type-2"], source_type_2)[1]
source_type_2(10, 20, 30, 50, 4000)

You may also want to consider JLD.jl or JLD2.jl.

freestatelabs · September 13, 2024, 1:20pm

Thank you all for the suggestions!

I will be using HDF5 for my project.

Topic		Replies	Views
File Format for Large Two-Dimensional Dataset Data	19	2611	July 31, 2018
Quickest tabular file format when I do not know the number of rows General Usage	2	407	February 10, 2019
Suggested formats for saving and serialization Data package , data	8	1532	April 17, 2017
Fastest Approach to reading Binary Files Performance binaryio	2	773	April 7, 2019
How do you store your data before and after processing with Julia? Data	35	7172	March 1, 2021

Maximizing Input File Read Performance

Related topics