Storing and accessing large jagged array with julia

I have a bunch of data, currently stored in 2000 seperate .csv files. Each file countains thousands of events, where each event starts with three numbers, one of which specifies the length of the event. All following rows then contain 9 columns. After a few hundred lines, the next event starts. This is the heterogenous part: events have different lengths.

Example:

40 181 5 # 181 rows in this event
-3.74677 1.8462 -200.582 0.03819 0 22 12 9 physiDet
-3.36715 2.31202 -201.09 0.17925 0 22 12 9 physiDet
-2.93906 2.24399 -200.797 0.03819 0 22 12 9 physiDet
...
-3.74766 1.84347 -200.586 1.16036 0 11 97 72 physiDet
-3.74772 1.84354 -200.586 4.33292 0 11 71 13 physiDet
-3.74845 1.84468 -200.584 0.986089 0 11 70 13 physiDet
232 595 6 # 595 rows here
1.48232 -3.12787 -196.664 0.07638 0 22 9 6 physiDet
1.10072 -3.0131 -196.344 0.125 0 22 9 6 physiDet
1.10448 -3.0174 -196.335 0.07701 0 22 221 9 physiDet
...

Most of the operations I will perform occur event by event. So, I need to loop over each event and do something. Usually, I will loop over the rows in the event and aggregate all rows down to a one-dimensional array or even one value per event.

I don’t have enough RAM to be able to load the entire dataset at once. I do have 32 cores to use for distributed computing.

So here’s the issue:

Problem

When I use @profile, I find that a lot of time is being lost to parsing floats from the csv files (also ints, but less so) I was hoping there would be a more efficient way to store my data, so I can access it and parse it faster.

Possible Solutions and Questions

  • Use JuliaDB with Dataframe/table type approach

Here, I’m worried about the issue of my data being three dimensional and having variable lengths. I don’t believe dataframes, tables and juliaDB are really intended for this purpose because they seem to do best on two dimensional data. My data is three dimensional with variable length in one dimension, as events have different lengths.

  • JuliaData (hdf5) or JLD2

This seems to be closest to what I want but I still have some questions. If I treat my data as an array of events, and store it as one dataset in hdf5, I would have to know the total number of events beforehand, create a dataset of that size, and begin writing to it. Once that’s done I think I’d have a nice solution. Problem: it’s a non-trivial computation to find the total number of events in my dataset. Ideally, I’d like to be able to expand as I go. I did find that the pure hdf5 specification has a resize function, but it seems not to exist for JuliaData.

If I don’t treat my data as one big dataset but make each event into a seperate dataset, I would have more flexibility, but I’m worried about losing performance. Each event is relatively small, so really I’d like to store them in chunks, otherwise I’d spend a lot of time in my loop looking for the next event in the file. May be store 1000 at a time per dataset?

How does distributed (multi core) writing and reading work, when accessing one file? Can multiple workers write/read to a file? Would it be advantageous to have a one to one translation of .csv file to .jld, resulting in 2000 .jld, files, rather than one? I’m very unsure as to the best way to proceed here. It might even be the case, that csv is perfectly fine and switching to different format would only get me a 5% improvement for example.

Summary

If I use JuliaData, how do I format my data inside to make it easy to write and read, while significantly improving performance? Is there a better solution out there?

Edit: really reading is the big deal. I’ll write my data to the new format once, and then everything is read-only.

HEP? This is called Jagged Array and is the nemesis of many in-memory/permanent data structure. If this is used for science project, you can take a look at ROOT, which is developed specifically to deal with irregular data, the IO can be achieved by using python uproot (you can use PyCall for this), which in turn was developed because Numpy cannot handle jagged array. The underlying implementation is in C++

3 Likes

Try FileTrees.jl

1 Like

Jup, HEP :). Good to know about the name. I think I’ll change the title.

Why use uproot and not just use root directly? Is it just more practical to install pyroot compared to root itself? The server I’m working on has a root distribution, that’s why I ask. There’s an argument to using pyroot because it would allow me to test locally (I don’t want to have to install a root distribution locally).

Also, how does pythons global interpreter lock mesh with distributed computing? If only one worker can read at a time, is that an issue?

FileTrees looks really cool. Seems like a cleaner way to deal with many files in parallel. I guess, if I used FileTrees.jl, I’d reformat my data so that each file stores say 10000 events in hdf5 format, and then loop over it all.

uproot does NOT use C++ ROOT. C++ cannot be used easily with Julia (1.5) and C++ ROOT is heavy/hard to use. if you just want I/O, uproot is fast and light. It belongs to scikit family so it has good long term support (scikit-hep)

while threading may be weird, Distributed computing is totally working since each worker will have their own PyCall thus python kernel

btw, if you’re doing HEP, you should already be using C++ ROOT one way or the other?

3 Likes

Alright, good to know about uproot and calling C++ root.

So, I guess I’d make one big giant root file and then loop over it as needed with pmap or @distributed for?

And, as far as using root already, not really at the moment. I’m not really in the field as a fully fledged member, I’m just doing my bachelor thesis in the field right now, so I haven’t used root itself much yet. The data I have was created by a software based on root, but we’re using julia for the analysis, which is the part I’m working on.

in that case just use uproot or UnROOT.jl to do the I/O and keep using .root to store data. Bonus good if you guys an help with UnROOT.jl development. it’s the Julia equivalent of uproot (pure Julia reading .root files

Alright, thanks for the help so far! If I use UnROOT.jl, I’ll probably do some contributing. We’ll see.

I still wonder if I’m over complicating things and one dataset in hdf5 would really be the easiest. I can read and write julia objects directly with JuliaData, which is a significant advantage. With every other format I have to write a bunch of code to read data into the structs I’m using and write from julia objects to the file format. The fact that I have to know the total number of events is only an issue once. Once it’s written to the file, everything just works. I don’t know how much slower hdf5 is compared to root though. I guess. There’s also JDL2, which is faster apparently.

HDF5 doesn’t support jagged array

What do you mean not supported? With JuliaData I can save arbitrary julia objects to a file. For each event isbits is true so each event in itself should be stored very efficiently. I lose performance switching from event to event because they’re not stored contiguously. I don’t know how big that effect is.

I’m ok with losing performance if my code is easier to maintain. The current format is bad in both respects. I need a bunch of custom code to read the format and is slow.

what is JuliaData? also, only Julia will be able to open that if you do it => not sure it’s a good thing inside HEP. Normal HDF5 just doesn’t support irregular matrix

I suspect that under the hood each event will be made into a separate dataset, but I’d have to check.

Yes, the fact that you need julia to open it is an argument. If I want to make my code into a package for others to use, I’ll probably have to write parsers/writers for some more generic format, like root.

Just FYI, there is an interesting work from particle physics people called Awkward Array for jagged arrays and nested data in general. IIUC, it was initially designed/created in Python (GitHub - scikit-hep/awkward-0.x: Manipulate arrays of complex data structures as easily as Numpy.) and now written in C++ (GitHub - scikit-hep/awkward: Manipulate JSON-like data with NumPy-like idioms.). This Strange Loop talk is a good introduction:

They were talking about using Apache Arrow as a serialization format. I wonder if a similar approach can be uesd in Julia or maybe even have a Awkward Array-compatible reader/writer in Julia. Or create a binding to Awkward Array C++ library? I think it’d be awesome to have something like Awkward Array in Julia.

It’s not an out-of-the-box solution for Julia but I thought it could be a nice food for thoughts.

5 Likes

I used Root files 15 years ago, interesting to see that they are still a thing :wink:

Alternatively, you could transform your data to a “normalized” table format: one table for the events and one for the tracks (linked by event_id). This way, you could use the existing tools for tabular data, e.g. databases (like PostgreSQL, SQLite) or (chunked) DataFrames (with e.g. JDF for read/write to disk).

2 Likes

Yes, I’ve seen awkward array while I was looking at uproot (python). Looks interesting and yes, that’d be a cool thing to have in julia.

I think that’s actually the cleanest solution, that I can quickly code up. I’d store an array of metadata and array of the actual rows.The metadata would just contain the three numbers in the header line for each event and an index for where the event starts in the big array.

I don’t know if I want to use juliaDB or dataframes though or if I just code up a serialiser myself.

No, I should use JuliaDB or something similar. I just looked into writing my own serialiser and it seems like a pain.

The problem I run into here is that I can’t append to an out of memory array. So, I’d have to load the entire dataset into memory, then save it, but that’s not feasible.

The zarr format might be an option as well, it already supports ragged arrays.

However, currently we don’t have an implementation for the VLenArray codec in Zarr.jl, but has been on the todo list for a while. If you consider using this as a backend, feel free to open an issue in Zarr.jl, so we can implement the codec.

2 Likes