HDF5 speed?

I have a long-running simulation that logs to an HDF5 file on each sample. I preallocate space, and write, say, a vector to index 1 on the first sample, then to index 2 on the next sample, etc. I’m finding that logging doubles my simulation’s runtime. That’s not unexpected, but I also wonder if that’s necessary. Using the Profiler, it seems my simulation is spending a ton of time in HDF5.jl:1748; setindex!. I’m hoping we can find a way to make this faster, ideally by changing how I’m logging, or potentially by finding something that we can do to HDF5.jl itself.

Here’s how I’m calling it:

    colons = (Colon() for i in 1:num_dims) # patterned after EllipsisNotation.jl
    dataset[colons..., index] = data # write to, e.g., dataset(:, :, 100)

Note that data can have any number of dimensions (I might use it for scalars or vectors or matrices, etc.), but it will always have the same dimensions from sample to sample.

Here are the results of the Profiler for one instance:

              171 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1748; setindex!(::HDF5.HDF5Dataset, ::Array{Float64,1}, ::Colon, ::Int64)
               26  .\tuple.jl:108; ntuple(::HDF5.##16#17{HDF5.HDF5Dataset,Tuple{Colon,Int64}}, ::Int64)
                25 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1748; #16
                 25 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1045; size(::HDF5.HDF5Dataset, ::Int64)
                  14 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1047; ndims(::HDF5.HDF5Dataset)
                   3  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1041; size(::HDF5.HDF5Dataset)
                    2 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2220; h5s_get_simple_extent_dims(::Int32)
                   11 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1043; size(::HDF5.HDF5Dataset)
                    1 .\essentials.jl:0; cnvt_all(::Type{T} where T, ::UInt64, ::UInt64, ::Vararg{UInt64,N} where N)
                    3 .\essentials.jl:35; cnvt_all(::Type{T} where T, ::UInt64, ::UInt64, ::Vararg{UInt64,N} where N)
                  1  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1040; size(::HDF5.HDF5Dataset)
                   1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1162; dataspace(::HDF5.HDF5Dataset)
                    1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2152; h5d_get_space
                  2  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1041; size(::HDF5.HDF5Dataset)
                   1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2218; h5s_get_simple_extent_dims(::Int32)
                   1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2220; h5s_get_simple_extent_dims(::Int32)
                  1  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1042; size(::HDF5.HDF5Dataset)
                   1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:791; close(::HDF5.HDF5Dataspace)
                    1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2053; h5s_close
                  7  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1043; size(::HDF5.HDF5Dataset)
                   1 .\essentials.jl:0; cnvt_all(::Type{T} where T, ::UInt64, ::UInt64, ::Vararg{UInt64,N} where N)
                   3 .\essentials.jl:35; cnvt_all(::Type{T} where T, ::UInt64, ::UInt64, ::Vararg{UInt64,N} where N)
                    1 .\essentials.jl:35; cnvt_all(::Type{T} where T, ::UInt64)
               13  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1700; setindex!(::HDF5.HDF5Dataset, ::Array{Float64,1}, ::UnitRange{Int64}, ::Int64)
                6 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1816; hdf5_to_julia(::HDF5.HDF5Dataset)
                 3 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1853; hdf5_to_julia_eltype(::HDF5.HDF5Datatype)
                  3 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2152; h5t_get_native_type(::Int32, ::Int64)
                 3 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1861; hdf5_to_julia_eltype(::HDF5.HDF5Datatype)
                  1 .\dict.jl:473; getindex
                   1 .\dict.jl:322; ht_keyindex(::Dict{Any,DataType}, ::Tuple{Int32,Void,UInt64})
                    1 .\dict.jl:210; hashindex
                     1 .\tuple.jl:296; hash(::Tuple{Int32,Void,UInt64}, ::UInt64)
                      1 .\hashing.jl:10; hash
                1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1820; hdf5_to_julia(::HDF5.HDF5Dataset)
                3 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1824; hdf5_to_julia(::HDF5.HDF5Dataset)
                 3 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1162; dataspace(::HDF5.HDF5Dataset)
                  3 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:406; Type
                   3 .\base.jl:129; finalizer(::Any, ::Any)
                2 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1828; hdf5_to_julia(::HDF5.HDF5Dataset)
                1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1835; hdf5_to_julia(::HDF5.HDF5Dataset)
                 1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:791; close(::HDF5.HDF5Dataspace)
                  1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2053; h5s_close
               130 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1701; setindex!(::HDF5.HDF5Dataset, ::Array{Float64,1}, ::UnitRange{Int64}, ::Int64)
                2  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1704; _setindex!(::HDF5.HDF5Dataset, ::Type{T} where T, ::Array{Float64,1}, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                2  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1707; _setindex!(::HDF5.HDF5Dataset, ::Type{T} where T, ::Array{Float64,1}, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                1  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1711; _setindex!(::HDF5.HDF5Dataset, ::Type{T} where T, ::Array{Float64,1}, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                1  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1714; _setindex!(::HDF5.HDF5Dataset, ::Type{T} where T, ::Array{Float64,1}, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                 1 .\operators.jl:107; !=(::Type{T} where T, ::Type{T} where T)
                17 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1717; _setindex!(::HDF5.HDF5Dataset, ::Type{T} where T, ::Array{Float64,1}, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                 3 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1752; hyperslab(::HDF5.HDF5Dataset, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                  1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:0; dataspace(::HDF5.HDF5Dataset)
                  2 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1162; dataspace(::HDF5.HDF5Dataset)
                   2 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2152; h5d_get_space
                 8 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1754; hyperslab(::HDF5.HDF5Dataset, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                  8 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2220; h5s_get_simple_extent_dims(::Int32)
                 1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1755; hyperslab(::HDF5.HDF5Dataset, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                 1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1765; hyperslab(::HDF5.HDF5Dataset, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                 1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1770; hyperslab(::HDF5.HDF5Dataset, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                 1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1784; hyperslab(::HDF5.HDF5Dataset, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                  1 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2053; h5s_select_hyperslab(::Int32, ::Int64, ::Array{UInt64,1}, ::Array{UInt64,1}, ::Array{UInt64,1}, ::Ptr{Void})
                1  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1718; _setindex!(::HDF5.HDF5Dataset, ::Type{T} where T, ::Array{Float64,1}, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                3  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1721; _setindex!(::HDF5.HDF5Dataset, ::Type{T} where T, ::Array{Float64,1}, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                 3 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2053; h5d_write(::Int32, ::Int32, ::Int32, ::Int32, ::Int64, ::Array{Float64,1})
                2  C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:1724; _setindex!(::HDF5.HDF5Dataset, ::Type{T} where T, ::Array{Float64,1}, ::UnitRange{Int64}, ::Vararg{Union{Int64, Range{Int64}},N} where N)
                 2 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:791; close(::HDF5.HDF5Dataspace)
                  2 C:\Users\Tucker\.julia\HDF5\src\HDF5.jl:2053; h5s_close

It seems a lot of time is spent inside this function of HDF5.jl, but not inside either of the two functions that it calls:

1699  function setindex!(dset::HDF5Dataset, X::Array, indices::Union{AbstractRange{Int},Int}...)
1700      T = hdf5_to_julia(dset)
1701      _setindex!(dset, T, X, indices...)
1702  end

I need to triple the amount of logging I’m doing in this simulation, and that would make for a very slow simulation. Does anyone see the culprit here? Am I doing something wrong, or is there a general improvement we could make in HDF5.jl?

HDF5 is really slow at small writes. I think it is the library, not the Julia wrapper, but I’m not 100% sure. Your profiling does suggest a julia issue, but I’m not expert enough to be sure. I guess there is a large overhead per write. How much data are you writing per write? For my own use I made a BufferedHDF5Writer that collects data in a Vector and makes a single write per second, which solved the problem for me. Here is the code in a gist.

I think you can also mmap HDF5 datasets of fixed size, which is probably a better solution. My case didn’t have a fixed size so I didn’t look into it.

I use mmap and this solved the issue for me. The interesting bit is that the mmaped object is actually just a plain Array and you in turn have zero hdf5 overhead.

I just wanted to cross-reference a related discussion for future reference: Appending an element to a JLD2 file