Optimisation Problems with HDF5 Data

Dear all,

this problem might be a bit too general, but I hope you can point me into the right direction to learn what’s going on behind the scenes and how to get out the best performance.

I am still a Julia newbie and did almost everything science related with Python/Cython/Numpy/Numba in the past years. I decided to switch, for obvious reasons.

So, I have a large dataset stored in an HDF5 file and another one which holds the information how the data needs to be split.
The main data is basically a stream of hits in a neutrino detector and the second dataset stores the indices and lengths of the events (aka time slices). To read an event or a bunch of events, I need to look up the hit indices and pull out the corresponding hits, while all hits of an event are stored continuously.
I usually have to deal with something like 900mio hits and about 25000 events per file, roughly.

The final goal is to create a high level access to the data, so that analysis and reconstructions are doable event-by-event with a vector of Hits.

My very first baby step was defining a struct for the hits:

immutable Hit

And with kind help from the gitter channel (thanks again!), I got this function to be able to unpack the values of the matrices within a list comprehension:

rows(x) = (x[i, :] for i in indices(x,1))

And this is my attempt to create an array of Hit-vectors from a file:

function read_timeslices(filename, start, stop)
    f = h5open(filename, "r")
    ts_info = f["timeslice_info"][:, start:stop]'   # index and length of the time slices
    first = ts_info[1, 1] + 1                       # index of first hit
    last = ts_info[end, 1] + ts_info[end, 2]        # index of last hit
    data = f["timeslice_hits"][1:3, first:last]'    # raw hit data as matrix

    # this line creates 18mio allocations for 70000 hits
    hits = [Hit(r...) for r in rows(data)]          # create Hit vector

    # slice hits into events, 
    [hits[x+1-first+1:x+y - first+1] for (x, y) in rows(ts_info)]

The function is of course totally slow. It takes a few seconds to read out a handful events.

The crucial line is the one with the Hit-vector creation:

hits = [Hit(r...) for r in rows(data)]          # create Hit vector

Without that line and the following one, the readout is super fast (100seconds for 25000 events, which is the same I get in C++).

Now I am of course aware of the fact that the lines does a copy by value and need to allocate each hit. But why are there 18mio allocations?

I tried to dig further and looked at the @code_warntype output, where I saw that many things are Any and I don’t understand why Julia does not infer the types correctly. See the output at the end.

When I read in the datasets manually, I get Array{Int32, 1} etc. but inside the function it seems to do no inference. So my guess is also that the hit-array is of course not an efficient Array{Hit,1}.

I would really appreciate if anyone could give a hint where to start to solve this. Thanks in advance!

      f::HDF5.HDF5File = $(Expr(:invoke, LambdaInfo for h5open(::String, ::String), :(Main.h5open), :(filename), "r")) # line 3:
      SSAValue(0) = $(Expr(:invoke, LambdaInfo for o_open(::HDF5.HDF5File, ::String), :(HDF5.o_open), :(f), "timeslice_info"))
      unless (Core.isa)(SSAValue(0),HDF5.HDF5Dataset)::Any goto 7
      #temp#@_12::LambdaInfo = LambdaInfo for getindex(::HDF5.HDF5Dataset, ::Colon, ::UnitRange{Int64})
      goto 12
      goto 9
      #temp#@_13::Any = (Main.getindex)(SSAValue(0),Main.:,$(Expr(:new, UnitRange{Int64}, :(start), :((Base.select_value)((Base.sle_int)(start,stop)::Bool,stop,(Base.box)(Int64,(Base.sub_int)(start,1)))::Int64))))::Any
      goto 14
      #temp#@_13::Any = $(Expr(:invoke, :(#temp#@_12), :(Main.getindex), SSAValue(0), :(Main.:), :($(Expr(:new, UnitRange{Int64}, :(start), :((Base.select_value)((Base.sle_int)(start,stop)::Bool,stop,(Base.box)(Int64,(Base.sub_int)(start,1)))::Int64))))))
      SSAValue(1) = #temp#@_13::Any
      ts_info::Any = (Main.ctranspose)(SSAValue(1))::Any # line 5:
      first::Any = ((Main.getindex)(ts_info::Any,1,1)::Any + 1)::Any # line 7:
      last::Any = ((Main.getindex)(ts_info::Any,(Base.size)(ts_info::Any,1)::Any,1)::Any + (Main.getindex)(ts_info::Any,(Base.size)(ts_info::Any,1)::Any,2)::Any)::Any # line 9:
      SSAValue(2) = $(Expr(:invoke, LambdaInfo for o_open(::HDF5.HDF5File, ::String), :(HDF5.o_open), :(f), "timeslice_hits"))
      SSAValue(3) = (Main.getindex)(SSAValue(2),$(Expr(:new, UnitRange{Int64}, 1, :((Base.select_value)((Base.sle_int)(1,3)::Bool,3,(Base.box)(Int64,(Base.sub_int)(1,1)))::Int64))),(Main.colon)(first::Any,last::Any)::Any)::Any
      data::Any = (Main.ctranspose)(SSAValue(3))::Any # line 10:
      hits::Any = (Main.parse_hits)(data::Any)::Any # line 11:
      #83::##83#84{first,hits} = $(Expr(:new, :((Core.apply_type)(Main.##83#84,(Core.typeof)(first)::DataType,(Core.typeof)(hits)::DataType)::Type{_<:##83#84}), :(first), :(hits)))
      SSAValue(4) = #83::##83#84{first,hits}
      # meta: location In[4] rows 6
      #1::##1#2{x} = $(Expr(:new, :((Core.apply_type)(Main.##1#2,(Core.typeof)(ts_info)::DataType)::Type{_<:##1#2}), :(ts_info)))
      SSAValue(7) = #1::##1#2{x}
      SSAValue(8) = (Main.indices)(ts_info::Any,1)::Any
      # meta: pop location
      SSAValue(5) = (Base.Generator)(SSAValue(7),SSAValue(8))::Base.Generator{I,F}
      SSAValue(6) = (Base.Generator)(SSAValue(4),SSAValue(5))::Base.Generator{I,F}
      return (Base.collect)(SSAValue(6))::Any

Your immutable is all Any. You need to strictly type this if you want performance.

immutable Hit{T1,T2,T3}

Ok thanks so far, but why are ts_info or data also of type Any? When I try that on the REPL I get correct array types :confused:

There is no way to know what type the variable will be since you are just getting it wíth a string. You can either type assert it by adding something like ::Int, or you can use what is called a function barrier where you extract your data in one function and then you call the function that does all computations with the extracted data. Inside the function that does the computations, the types will then be known.

1 Like

Ah, I did try that function barrier, I forgot to mention. I created this function:

parse_hits(data) = [Hit(r...) for r in rows(data)]

and then called it within read_timeslices as:

hits = parse_hits(data)

but it didn’t change the types. The @code_warntype above is actually from that version, as you can see, sorry for the confusion.

So instead, I should create a function barrier for retrieving data if I understand correctly? And of course fix the types in Hit