Best Data format for Tabular Data plus Vector

Coming from a background in Matlab, where I do mostly signal processing kinds of things, and am considering switching to Julia, I was looking for good alternatives for data storage in Julia. The type of data I would like to store to begin with is:

Measurement Location
Node Number
Node direction
measurement type
Measurement units
Measurement start time
sampling rate Fs
Measurement Comments
Vector containing the time history of the measurements

The vector could have a size of 10^5 - 10^6. The data will then be post-processed, initially in Matlab but the matlab algorithms will probably move to Julia in due time. The processing will consist of STFT, TVDFT, processing a tach signal into a speed map, etc.

In Matlab I can do this with an array of structures or a table.

Browsing through the Julia documentation it does not seem that a core language structure is ideal for this. Doing an initial review of package data structures (A Tour of the Data Ecosystem in Julia – Traitement de Données). At this point it is not clear to me whether a Dataframe, Table or JuliaDB will accept the vector in addition to the tabular data for each channel collected.

Once collected, it may make sense to store the data in a matlab file. What are the other good alternatives for storing this type of data? Note that there can be large amounts of data, so smaller file sizes are preferable.

Some guidance on the data structure and file storage will be appreciated. Thanks in advance.

Depending on how many modalities/sample rates you’re handling, I’d imagine an array of structs or table wouldn’t be ideal either :slight_smile:

Perhaps the most purpose-built format I’ve seen is https://github.com/beacon-biosignals/OndaFormat. There’s a native Julia SDK at https://github.com/beacon-biosignals/Onda.jl. Matlab files are either a basic flat array + metadata or dialect of HDF5 anyhow, so you could roll something custom on top of HDF5.jl or Zarr.jl as well. A couple of folks from my lab were recently working on multi-modality format for health data, might be worth a look if that’s what you’re working with.

Why not define a custom type that holds a DataFrame (for the tabular data) and a Vector{Float64}, for example?

As for saving the data to a file, my suggestion would be to save two separate files for robustness. Robustly storing user defined types is still a bit iffy in Julia.

If you store the tabular data in a DataFrame JDF.jl is a good option. For the Vector{Float} there are many options. I like JLD2, but that’ s just my preference. More info can be found by searching Discourse for threads on serialization.

Because of the length of the vector, especially if it is variable, a dataframe is not ideal for this. You could make it tidy for some applications, but then it may be wasteful for repeated fields.

I would decouple the choice of Julia data type (for which I would just start with a struct) from the externalized representation, where you can experiment with various options — eg JSON may work as a portable and lightweight — if somewhat verbose — option, or HDF5 if speed is important.

1 Like