Storing and accessing large jagged array with julia

all ~50k high energy physics related scientists still use ROOT since it’s best in the class [page 4] regarding:

  1. compression ratio(exabyte of tape is expensive to store and find space for)
  2. speed when reading. (crucial for reducing bandwidth pressure when reading remotely on LHC Computing Grid)
  3. arbitrary data structure.

So, I decided to use JLD2, based on hdf5. Each delimited files gets translated into a vector of vectors and that vector of vectors gets stored in the jld2 format. Using this format is fast enough to where reading and writing takes about 10% of the total time for the most simple analyses I will do on my data.

ROOT may well be faster, however reading and writing files now only accounts for a small amount of the total compute time and I’d have to write a a fair bit more code to write my objects to root and read from them. If reading and writing becomes more of an issue, I may well consider switching.

As for the point about .jld2 files only being accesible by julia, I agree. If other people want to do their analyses, well, anybody can read the delimited .csv style files. If you want to use my library to do the analysis, you’ll be bound to julia anyway. I don’t see an issue here.

One thing to note, is that jld2 currently doesn’t support multithreaded/parallel access to files. I have to put a lock around every open and close of a file. This isn’t currently limiting performance but it may in some circumstances.

1 Like

@Luapulu Would you tell us about your experiment please and what analysis you are performing?
I ws a CERN PhD student in the LEP era.

I’m working on A/E cuts to distinguish single site from multi site events in germanium detectors for background discrimination for detection of neutrinoless double beta decay.

So, lots of histograms over events and the simulation of the signal for each event.

If you’re curious: My code

Ok, having used JLD2 and JLD a bit now, I’ve actually gone back to just reading in the raw text and using that.

First, I had some issues with JLD2 where sometimes my data would become corrupted or somehow couldn’t be read. May be I did something wrong, may be it’s a bug, who knows. When I switched to JLD everything worked so I used that format for a bit. However, I like making changes to my code and this is an issue since the way data is saved is intimately linked with implementation details in the JLD format. For example, if you rename an object in your code, you can no longer read old data using the old name so you’re forced to load and save data again using the new implementation. This gets tedious fast.

This issue could be solved by having some default implementation from which you convert to load data and to which you convert to save data. The problem here is, that I have to decide on some default implementation (which I know I’m going to be tempted to change to get that x% performance improvement) and I don’t really want to spend my time benchmarking the three or four different ways I can think of to store and load ragged arrays. Also, if performance matters, well, converting all your data is going to cost some time. Another issue is the likelihood of introducing errors. While I do have unit tests and I try to be as thorough as I can, every time you load and save a file, there’s some chance you mess up the file path, accidentally overwrite a file, etc.

So, at this point in the thought process I just decided to use the existing csv like file format. I optimised my parsing a bit more so it’s a good bit faster now and reading files is only a bottle neck for the fastest of analyses anyhow. The simulations I’m doing take orders of magnitude more time per file than reading the input files, for example. I still use JLD to save results, though I may change that too. May be use pure HDF5 so I’m not so bound to implementation details. :thinking:

In case you revisit the storage format at one point and want to give zarr a try, because you need simpler parallel read/write or object store support, we have implemented the VLenArray filter now for Julia. Here is a small MWE workflow with the latest Zarr version:

First we create some artifical ragged array data and prepare an output dataset.

using Zarr
#Generate some random data array of array data with variable length
data = [rand(Float64, rand(2:20)) for i=1:100, j=1:80];
#Create dataset on disk, pick a chunk size
s = (100,80)
cs = (30,25)
indata = zcreate(Vector{Float64},s..., 
    path = "myinputs.zarr"
#Write data to disk
indata[:,:] = data;
#Create output array, it is important here that chunk sizes match
outdata = zcreate(Float64, s..., chunks = cs, compressor=Zarr.NoCompressor(), path = "myoutputs.zarr")

then we launch a computation on multiple workers by mapping over the chunks of the data. Here we simply compute the mean of every vector in the array. No locking is necessary as long as you don’t write across chunk boundaries:

using Distributed
using DiskArrays: eachchunk
@everywhere using Zarr, Statistics
pmap(eachchunk(indata)) do cinds
    zin = zopen("myinputs.zarr")
    zout = zopen("myoutputs.zarr","w")
    zout[cinds] = mean.(zin[cinds])

And you can access the results from Julia

2×2 Array{Float64,2}:
 0.400068  0.576094
 0.507728  0.436327

or from python, so you are not on a Julia-island like with JLD

using PyCall
import zarr
zin ='myinputs.zarr')
zout ='myoutputs.zarr')
py"zin[0,0], zout[0,0]"
(PyObject array([0.18110972, 0.7273475 , 0.12056501, 0.44523791, 0.2457608 ,
       0.91214274, 0.16907038, 0.83184576, 0.22446035, 0.0884847 ,
       0.64998661, 0.20480489]), 0.400068030073534)
1 Like

Maybe it’s too late now but you could store your data in a different, more homogeneous way:

Adding the three numbers on the left of each row.

40 181 5 -3.74677 1.8462 -200.582 0.03819 0 22 12 9 physiDet
40 181 5 -3.36715 2.31202 -201.09 0.17925 0 22 12 9 physiDet
40 181 5 -2.93906 2.24399 -200.797 0.03819 0 22 12 9 physiDet

Or if memory is an issue you can replace that three numbers with a reference to an external table.

1 -3.74677 1.8462 -200.582 0.03819 0 22 12 9 physiDet
1 -3.36715 2.31202 -201.09 0.17925 0 22 12 9 physiDet
1 -2.93906 2.24399 -200.797 0.03819 0 22 12 9 physiDet

You may even include the “physiDet” on that external reference.

Yes, if I do end up needing to read the files in faster, I’ll probably use either root or have one big array of the main data and a separate array for those three extra numbers with an index into the big array, or even just one big array of everything as you suggest. With that way of storing my file format no longer needs to be good at dealing with ragged arrays, so I can do whatever I want in terms of choosing a file format. I think somebody else suggested something similar as well.

Thanks, yes, that would be another choice. If reading speed becomes an issue I’ll likely try a few of the suggested methods and do some benchmarking. :smiley: