Best data interchange format?

nickeubank · June 24, 2018, 2:59am

Seems like people recently realized we shouldn’t be passing around data in CSV formats, but now it seems like we have too many solutions: parquet, feather, hdf5 (and JLD, which as I understand it is hdf5 but only for use with julia?).

Any chance the community is converging towards one of this as a go-to data format for tabular data?

ScottPJones · June 24, 2018, 3:15am

That’s been true well before RFC-4180 came out (in 2005) to try to “standardize” the adhoc CSV formats , but amazingly, people still continue to do it.

Probably the best standardized (text-based) format for data transfer these days would be JSON, although it’s not as compact for tabular data as CSV can be.

bernhard · June 24, 2018, 5:51am

Best sata interchange with which software (or people?)?
I actually find CSV quite practical. It works with any software, is very reliable and one can very easily check „the truth“.

nickeubank · June 24, 2018, 6:21am

Fair question. I find CSV ok for small datasets, but runs into lots of parsing issues, text encoding issues, etc., and it’s SUPER inefficient in terms of data sizes. So I guess my preference is:

works easily for r, python, julia, and ideally stata
maintains type information (so floats stay floats)
If I re-load into Julia, not much lost in terms of meta-data / column names, etc.
Loads and saves quickly

Tamas_Papp · June 24, 2018, 6:51am

The solution closest to requirements 1, 2, and 4 (works for multiple languages, lossless and efficient for simple bits types) is HDF5.

To maintain more metadata across languages, you would need some (implicit) protocol, eg that hdf5 nodes at some level are columns of a data frame, etc.

Feather is becoming rather standard for column data, if you are OK with data types that eg R can handle natively.

At the moment, I would recommend the following for moderately sized datasets:

Maintain your parsed and cleaned “master” data, possibly ingested from CSV or similar, in HDF5. This format will most likely endure for decades, and can be shared with people who use different languages/frameworks.
If you find that some other format works better for day to day work and can be generated easily from the above dataset, use it in the short run, with the implicit assumption that in the worst case scenario you just regenerate everything from the master data above. Eg JLD2 works fine for saving interim results within Julia.
For “final” results, again archive in HDF5.

nickeubank · June 24, 2018, 6:58am

Perfect! Thanks!

Topic		Replies	Views
Importing big data General Usage question	21	5520	November 14, 2017
Recommended Saves and Loads of DataFrame : JLD, CSV, etc Data	8	2922	August 30, 2020
Julia data storage New to Julia question	5	1004	August 5, 2020
The poor state of fileformats for High Performance computing General Usage	16	2681	August 13, 2017
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1441	January 19, 2020

Best data interchange format?

Related topics