I have a bunch of data, currently stored in 2000 seperate .csv files. Each file countains thousands of events, where each event starts with three numbers, one of which specifies the length of the event. All following rows then contain 9 columns. After a few hundred lines, the next event starts. This is the heterogenous part: events have different lengths.
Example:
40 181 5 # 181 rows in this event
-3.74677 1.8462 -200.582 0.03819 0 22 12 9 physiDet
-3.36715 2.31202 -201.09 0.17925 0 22 12 9 physiDet
-2.93906 2.24399 -200.797 0.03819 0 22 12 9 physiDet
...
-3.74766 1.84347 -200.586 1.16036 0 11 97 72 physiDet
-3.74772 1.84354 -200.586 4.33292 0 11 71 13 physiDet
-3.74845 1.84468 -200.584 0.986089 0 11 70 13 physiDet
232 595 6 # 595 rows here
1.48232 -3.12787 -196.664 0.07638 0 22 9 6 physiDet
1.10072 -3.0131 -196.344 0.125 0 22 9 6 physiDet
1.10448 -3.0174 -196.335 0.07701 0 22 221 9 physiDet
...
Most of the operations I will perform occur event by event. So, I need to loop over each event and do something. Usually, I will loop over the rows in the event and aggregate all rows down to a one-dimensional array or even one value per event.
I don’t have enough RAM to be able to load the entire dataset at once. I do have 32 cores to use for distributed computing.
So here’s the issue:
Problem
When I use @profile
, I find that a lot of time is being lost to parsing floats from the csv files (also ints, but less so) I was hoping there would be a more efficient way to store my data, so I can access it and parse it faster.
Possible Solutions and Questions
- Use JuliaDB with Dataframe/table type approach
Here, I’m worried about the issue of my data being three dimensional and having variable lengths. I don’t believe dataframes, tables and juliaDB are really intended for this purpose because they seem to do best on two dimensional data. My data is three dimensional with variable length in one dimension, as events have different lengths.
- JuliaData (hdf5) or JLD2
This seems to be closest to what I want but I still have some questions. If I treat my data as an array of events, and store it as one dataset in hdf5, I would have to know the total number of events beforehand, create a dataset of that size, and begin writing to it. Once that’s done I think I’d have a nice solution. Problem: it’s a non-trivial computation to find the total number of events in my dataset. Ideally, I’d like to be able to expand as I go. I did find that the pure hdf5 specification has a resize function, but it seems not to exist for JuliaData.
If I don’t treat my data as one big dataset but make each event into a seperate dataset, I would have more flexibility, but I’m worried about losing performance. Each event is relatively small, so really I’d like to store them in chunks, otherwise I’d spend a lot of time in my loop looking for the next event in the file. May be store 1000 at a time per dataset?
How does distributed (multi core) writing and reading work, when accessing one file? Can multiple workers write/read to a file? Would it be advantageous to have a one to one translation of .csv file to .jld, resulting in 2000 .jld, files, rather than one? I’m very unsure as to the best way to proceed here. It might even be the case, that csv is perfectly fine and switching to different format would only get me a 5% improvement for example.
Summary
If I use JuliaData, how do I format my data inside to make it easy to write and read, while significantly improving performance? Is there a better solution out there?
Edit: really reading is the big deal. I’ll write my data to the new format once, and then everything is read-only.