# Optimisation Problems with HDF5 Data

Dear all,

this problem might be a bit too general, but I hope you can point me into the right direction to learn what’s going on behind the scenes and how to get out the best performance.

I am still a Julia newbie and did almost everything science related with Python/Cython/Numpy/Numba in the past years. I decided to switch, for obvious reasons.

So, I have a large dataset stored in an HDF5 file and another one which holds the information how the data needs to be split.
The main data is basically a stream of hits in a neutrino detector and the second dataset stores the indices and lengths of the events (aka time slices). To read an event or a bunch of events, I need to look up the hit indices and pull out the corresponding hits, while all hits of an event are stored continuously.
I usually have to deal with something like 900mio hits and about 25000 events per file, roughly.

The final goal is to create a high level access to the data, so that analysis and reconstructions are doable event-by-event with a vector of `Hits`.

My very first baby step was defining a struct for the hits:

``````immutable Hit
channel_id
time
tot
end
``````

And with kind help from the gitter channel (thanks again!), I got this function to be able to unpack the values of the matrices within a list comprehension:

`rows(x) = (x[i, :] for i in indices(x,1))`

And this is my attempt to create an array of Hit-vectors from a file:

``````function read_timeslices(filename, start, stop)
f = h5open(filename, "r")
ts_info = f["timeslice_info"][:, start:stop]'   # index and length of the time slices
first = ts_info[1, 1] + 1                       # index of first hit
last = ts_info[end, 1] + ts_info[end, 2]        # index of last hit
data = f["timeslice_hits"][1:3, first:last]'    # raw hit data as matrix

# this line creates 18mio allocations for 70000 hits
hits = [Hit(r...) for r in rows(data)]          # create Hit vector

# slice hits into events,
[hits[x+1-first+1:x+y - first+1] for (x, y) in rows(ts_info)]
end
``````

The function is of course totally slow. It takes a few seconds to read out a handful events.

The crucial line is the one with the Hit-vector creation:

``````hits = [Hit(r...) for r in rows(data)]          # create Hit vector
``````

Without that line and the following one, the readout is super fast (100seconds for 25000 events, which is the same I get in C++).

Now I am of course aware of the fact that the lines does a copy by value and need to allocate each hit. But why are there 18mio allocations?

I tried to dig further and looked at the `@code_warntype` output, where I saw that many things are `Any` and I don’t understand why Julia does not infer the types correctly. See the output at the end.

When I read in the datasets manually, I get `Array{Int32, 1}` etc. but inside the function it seems to do no inference. So my guess is also that the hit-array is of course not an efficient `Array{Hit,1}`.

I would really appreciate if anyone could give a hint where to start to solve this. Thanks in advance!

``````Variables:
filename::String
start::Int64
stop::Int64
f::HDF5.HDF5File
ts_info::Any
first::Any
last::Any
data::Any
hits::Any
#83::##83#84{first,hits}
#temp#@_12::LambdaInfo
#temp#@_13::Any
#temp#@_14::LambdaInfo
#1::##1#2{x}
Body:
begin
f::HDF5.HDF5File = \$(Expr(:invoke, LambdaInfo for h5open(::String, ::String), :(Main.h5open), :(filename), "r")) # line 3:
SSAValue(0) = \$(Expr(:invoke, LambdaInfo for o_open(::HDF5.HDF5File, ::String), :(HDF5.o_open), :(f), "timeslice_info"))
unless (Core.isa)(SSAValue(0),HDF5.HDF5Dataset)::Any goto 7
#temp#@_12::LambdaInfo = LambdaInfo for getindex(::HDF5.HDF5Dataset, ::Colon, ::UnitRange{Int64})
goto 12
7:
goto 9
9:
#temp#@_13::Any = (Main.getindex)(SSAValue(0),Main.:,\$(Expr(:new, UnitRange{Int64}, :(start), :((Base.select_value)((Base.sle_int)(start,stop)::Bool,stop,(Base.box)(Int64,(Base.sub_int)(start,1)))::Int64))))::Any
goto 14
12:
#temp#@_13::Any = \$(Expr(:invoke, :(#temp#@_12), :(Main.getindex), SSAValue(0), :(Main.:), :(\$(Expr(:new, UnitRange{Int64}, :(start), :((Base.select_value)((Base.sle_int)(start,stop)::Bool,stop,(Base.box)(Int64,(Base.sub_int)(start,1)))::Int64))))))
14:
SSAValue(1) = #temp#@_13::Any
ts_info::Any = (Main.ctranspose)(SSAValue(1))::Any # line 5:
first::Any = ((Main.getindex)(ts_info::Any,1,1)::Any + 1)::Any # line 7:
last::Any = ((Main.getindex)(ts_info::Any,(Base.size)(ts_info::Any,1)::Any,1)::Any + (Main.getindex)(ts_info::Any,(Base.size)(ts_info::Any,1)::Any,2)::Any)::Any # line 9:
SSAValue(2) = \$(Expr(:invoke, LambdaInfo for o_open(::HDF5.HDF5File, ::String), :(HDF5.o_open), :(f), "timeslice_hits"))
SSAValue(3) = (Main.getindex)(SSAValue(2),\$(Expr(:new, UnitRange{Int64}, 1, :((Base.select_value)((Base.sle_int)(1,3)::Bool,3,(Base.box)(Int64,(Base.sub_int)(1,1)))::Int64))),(Main.colon)(first::Any,last::Any)::Any)::Any
data::Any = (Main.ctranspose)(SSAValue(3))::Any # line 10:
hits::Any = (Main.parse_hits)(data::Any)::Any # line 11:
#83::##83#84{first,hits} = \$(Expr(:new, :((Core.apply_type)(Main.##83#84,(Core.typeof)(first)::DataType,(Core.typeof)(hits)::DataType)::Type{_<:##83#84}), :(first), :(hits)))
SSAValue(4) = #83::##83#84{first,hits}
# meta: location In[4] rows 6
#1::##1#2{x} = \$(Expr(:new, :((Core.apply_type)(Main.##1#2,(Core.typeof)(ts_info)::DataType)::Type{_<:##1#2}), :(ts_info)))
SSAValue(7) = #1::##1#2{x}
SSAValue(8) = (Main.indices)(ts_info::Any,1)::Any
# meta: pop location
SSAValue(5) = (Base.Generator)(SSAValue(7),SSAValue(8))::Base.Generator{I,F}
SSAValue(6) = (Base.Generator)(SSAValue(4),SSAValue(5))::Base.Generator{I,F}
return (Base.collect)(SSAValue(6))::Any
end::Any
``````

Your immutable is all `Any`. You need to strictly type this if you want performance.

``````immutable Hit{T1,T2,T3}
channel_id::T1
time::T2
tot::T3
end
``````
2 Likes

Ok thanks so far, but why are `ts_info` or `data` also of type `Any`? When I try that on the REPL I get correct array types

There is no way to know what type the variable will be since you are just getting it wíth a string. You can either type assert it by adding something like `::Int`, or you can use what is called a function barrier where you extract your data in one function and then you call the function that does all computations with the extracted data. Inside the function that does the computations, the types will then be known.

1 Like

Ah, I did try that function barrier, I forgot to mention. I created this function:

``````parse_hits(data) = [Hit(r...) for r in rows(data)]
``````

and then called it within `read_timeslices` as:

``````hits = parse_hits(data)
``````

but it didn’t change the types. The `@code_warntype` above is actually from that version, as you can see, sorry for the confusion.

So instead, I should create a function barrier for retrieving `data` if I understand correctly? And of course fix the types in `Hit`