Storing and accessing large jagged array with julia

jling · September 4, 2020, 4:19pm

all ~50k high energy physics related scientists still use ROOT since it’s best in the class [page 4] regarding:

compression ratio(exabyte of tape is expensive to store and find space for)
speed when reading. (crucial for reducing bandwidth pressure when reading remotely on LHC Computing Grid)
arbitrary data structure.

Luapulu · September 13, 2020, 3:01pm

So, I decided to use JLD2, based on hdf5. Each delimited files gets translated into a vector of vectors and that vector of vectors gets stored in the jld2 format. Using this format is fast enough to where reading and writing takes about 10% of the total time for the most simple analyses I will do on my data.

ROOT may well be faster, however reading and writing files now only accounts for a small amount of the total compute time and I’d have to write a a fair bit more code to write my objects to root and read from them. If reading and writing becomes more of an issue, I may well consider switching.

As for the point about .jld2 files only being accesible by julia, I agree. If other people want to do their analyses, well, anybody can read the delimited .csv style files. If you want to use my library to do the analysis, you’ll be bound to julia anyway. I don’t see an issue here.

One thing to note, is that jld2 currently doesn’t support multithreaded/parallel access to files. I have to put a lock around every open and close of a file. This isn’t currently limiting performance but it may in some circumstances.

johnh · September 14, 2020, 9:14am

@Luapulu Would you tell us about your experiment please and what analysis you are performing?
I ws a CERN PhD student in the LEP era.

Luapulu · September 14, 2020, 1:06pm

I’m working on A/E cuts to distinguish single site from multi site events in germanium detectors for background discrimination for detection of neutrinoless double beta decay.

So, lots of histograms over events and the simulation of the signal for each event.

If you’re curious: My code

Luapulu · September 19, 2020, 9:02pm

Ok, having used JLD2 and JLD a bit now, I’ve actually gone back to just reading in the raw text and using that.

First, I had some issues with JLD2 where sometimes my data would become corrupted or somehow couldn’t be read. May be I did something wrong, may be it’s a bug, who knows. When I switched to JLD everything worked so I used that format for a bit. However, I like making changes to my code and this is an issue since the way data is saved is intimately linked with implementation details in the JLD format. For example, if you rename an object in your code, you can no longer read old data using the old name so you’re forced to load and save data again using the new implementation. This gets tedious fast.

This issue could be solved by having some default implementation from which you convert to load data and to which you convert to save data. The problem here is, that I have to decide on some default implementation (which I know I’m going to be tempted to change to get that x% performance improvement) and I don’t really want to spend my time benchmarking the three or four different ways I can think of to store and load ragged arrays. Also, if performance matters, well, converting all your data is going to cost some time. Another issue is the likelihood of introducing errors. While I do have unit tests and I try to be as thorough as I can, every time you load and save a file, there’s some chance you mess up the file path, accidentally overwrite a file, etc.

So, at this point in the thought process I just decided to use the existing csv like file format. I optimised my parsing a bit more so it’s a good bit faster now and reading files is only a bottle neck for the fastest of analyses anyhow. The simulations I’m doing take orders of magnitude more time per file than reading the input files, for example. I still use JLD to save results, though I may change that too. May be use pure HDF5 so I’m not so bound to implementation details.

fabiangans · September 22, 2020, 9:35am

In case you revisit the storage format at one point and want to give zarr a try, because you need simpler parallel read/write or object store support, we have implemented the VLenArray filter now for Julia. Here is a small MWE workflow with the latest Zarr version:

First we create some artifical ragged array data and prepare an output dataset.

using Zarr
#Generate some random data array of array data with variable length
data = [rand(Float64, rand(2:20)) for i=1:100, j=1:80];
#Create dataset on disk, pick a chunk size
s = (100,80)
cs = (30,25)
indata = zcreate(Vector{Float64},s..., 
    chunks=cs, 
    compressor=Zarr.BloscCompressor(clevel=3),
    path = "myinputs.zarr"
)
#Write data to disk
indata[:,:] = data;
#Create output array, it is important here that chunk sizes match
outdata = zcreate(Float64, s..., chunks = cs, compressor=Zarr.NoCompressor(), path = "myoutputs.zarr")

then we launch a computation on multiple workers by mapping over the chunks of the data. Here we simply compute the mean of every vector in the array. No locking is necessary as long as you don’t write across chunk boundaries:

using Distributed
using DiskArrays: eachchunk
addprocs(3)
@everywhere using Zarr, Statistics
pmap(eachchunk(indata)) do cinds
    zin = zopen("myinputs.zarr")
    zout = zopen("myoutputs.zarr","w")
    zout[cinds] = mean.(zin[cinds])
    nothing
end
rmprocs(workers())

And you can access the results from Julia

zopen("myoutputs.zarr")[1:2,1:2]

2×2 Array{Float64,2}:
 0.400068  0.576094
 0.507728  0.436327

or from python, so you are not on a Julia-island like with JLD

using PyCall
py"""
import zarr
zin = zarr.open('myinputs.zarr')
zout = zarr.open('myoutputs.zarr')
"""
py"zin[0,0], zout[0,0]"

(PyObject array([0.18110972, 0.7273475 , 0.12056501, 0.44523791, 0.2457608 ,
       0.91214274, 0.16907038, 0.83184576, 0.22446035, 0.0884847 ,
       0.64998661, 0.20480489]), 0.400068030073534)

Juan · September 22, 2020, 10:40am

Maybe it’s too late now but you could store your data in a different, more homogeneous way:

Adding the three numbers on the left of each row.

40 181 5 -3.74677 1.8462 -200.582 0.03819 0 22 12 9 physiDet
40 181 5 -3.36715 2.31202 -201.09 0.17925 0 22 12 9 physiDet
40 181 5 -2.93906 2.24399 -200.797 0.03819 0 22 12 9 physiDet
…

Or if memory is an issue you can replace that three numbers with a reference to an external table.

1 -3.74677 1.8462 -200.582 0.03819 0 22 12 9 physiDet
1 -3.36715 2.31202 -201.09 0.17925 0 22 12 9 physiDet
1 -2.93906 2.24399 -200.797 0.03819 0 22 12 9 physiDet
…

You may even include the “physiDet” on that external reference.

Luapulu · September 22, 2020, 12:33pm

Yes, if I do end up needing to read the files in faster, I’ll probably use either root or have one big array of the main data and a separate array for those three extra numbers with an index into the big array, or even just one big array of everything as you suggest. With that way of storing my file format no longer needs to be good at dealing with ragged arrays, so I can do whatever I want in terms of choosing a file format. I think somebody else suggested something similar as well.

Luapulu · September 22, 2020, 12:37pm

Thanks, yes, that would be another choice. If reading speed becomes an issue I’ll likely try a few of the suggested methods and do some benchmarking.

jling · October 24, 2023, 8:18pm

Some update,

Here’s a toy experiment we’re collaborating.

For anyone actually in HEP, use UnROOT.jl if you have to deal with ROOT files (GitHub - JuliaHEP/UnROOT.jl: Native Julia I/O package to work with CERN ROOT files objects (TTree and RNTuple)), if you can control, just use Arrow.jl.

Palli · October 31, 2023, 10:32am

You wrote in 2020:

This is likely still correct. I recently found out about (and what’s an interesting YouTube video on it, and Python used by 50%+ in HEP, by now, building on faster languages…):

Very much not a toy, and most HEP people using it, while the new Julia reimplementation is a toy I guess since new. Do you think that will change, and would you even now use it (for jagged) rather than ROOT? Or Julia with its Python equivalent above? I note you can use both together. Would you rather recommend Arrow?

jling · October 31, 2023, 10:59am

I can write an essay on this. (we did write a white paper on the topic of Julia x HEP, with the author (Jim) of Awkward array also contributing perspectives). The following is heavily biased towards experience at the LHC experiments, HEP is a large field that includes experiments like IceCube and KM3Net etc. They move much faster than LHC experiments.

A few things for the context first, ROOT is used everywhere, I mean not just the file format (.root)^[1], but ROOT is an integral part of O(10M) lines of code physics data processing pipelines (going from raw electronics signal to grad-student readable physics data). And the unfortunately .root is strongly coupled to the C++ implementation details in lots of places.

Where awkward array comes in is around the end of the whole pipeline: when you have physical data, and you want to make some histograms for plots and statistical fit. This is doable because one can largely decouple from the behemoth ROOT/C++ framework, and that’s the goal of Scikit-HEP. I can certainly imagine a future where most grad students write mostly Python/Julia for garden variety physics analysis.

Another thing to keep in mind is Awkward array started out as a necessity because Numpy doesn’t support Jagged array as a model at all, Julia doesn’t have this barrier, so some physicists are actually happier with imperative loops^[2]. All this is today that, if we go a bit up the stream of HEP data pipeline, you might not prefer Awkward array because computing becomes more and more “within a row of the table” as oppose to columnar. Julia technically won’t have this problem. But the shared problem is to deal with highly non-trivial .root files and also replicate the O(M) lines of C++ Code in physics logic… which may sociologically never happen.

btw, .root is a filesystem-in-a-file, that’s another essay if you want. Also, the TTree (most of data are in this format) object in .root files does not even have a spec, it’s all implementation detail in C++. This is about to change in maybe 3-5 years, when we have a new object called RNTuple. ↩︎
the same way people dislike Numpy vectorized-style gymnastics, Numba and Jax would fill the gap a bit, but getting them to work for large physics analysis program is non-trivial ↩︎

tamasgal · October 31, 2023, 11:02am

Just a side note: we use ROOT as our primary data format in KM3NeT to store events, MC and reconstruction. We do have jagged data and we also use HDF5 to store those especially as input for machine learning analyses.

What we did was defining a structure for HDF5 datasets and groups and to access them, we have lightweight libraries for Python and Julia.

The main idea is that each dataset is either an HDF5 compound structure (array of structs with primitive types as fields) or split up into individual 1D arrays and put into a group (arrays of primitive types) with an extra table called _index which has one row for each sub-dataset (e.g. event) with the fields Index and n_items. These tables are discovered and loaded by the small I/O library into memory and this way you can easily group and access different datasets randomly.

You can make it work with HDF5 and it can be made really fast but if your data is multiply jagged, it will become complex very fast… that’s where ROOT has its strengths. Btw. Arrow/Parquet/Feather are serous alternatives.

jling · October 31, 2023, 11:10am

For people reading this after 2023, if they prefer ROOT, I would say check out RNTuple (a talk I gave with a bunch of pointers in first slide).

IF, big if, RNTuple can be made into a standalone file format with specification, I would just recommend it flat, since it’s pretty good and you can expect CERN to support it for a few decades.

But it’s a huge pain if you’re ecosystem locked with C++ ROOT, and you will be because there’s not separate I/O lib.so you can wrap, as I said above, .root files are strongly coupled to the C++ ROOT implementations

If you’re not chained to C++ ROOT, I would just recommend Apache Arrow because three reasons:

ROOT RNTupe is pretty much identical in design to Arrow (same good performance)
Arrow is ballistic now with python pandas moving to it, I expect much better ecosystem around it compare to whatever C++ ROOT has since they have ~ 0 interests in interpolating with ecosystem (non ROOT application)
It’s actually specced and tested in multiple implementations.

Unfortunately, Arrow.jl isn’t as polished as Python or Rust counterparts because main maintainer is super busy and we are also slowed down by Apache bureaucracy.

Topic		Replies	Views
Best way to store arrays with metadata? New to Julia dataframes	1	388	October 10, 2023
Data structure for visualizing and scaling General Usage question , data , data_structures	0	946	September 15, 2021
Guidance on profiling / benchmarking I/O opperations on sparse dataset Performance	11	732	March 5, 2020
How can I split large data using a faster and more efficient function (data science)? New to Julia csv	9	806	October 27, 2022
Optimisation Problems with HDF5 Data New to Julia question	4	914	April 27, 2017

Storing and accessing large jagged array with julia

Related topics