Best data interchange format?


#1

Seems like people recently realized we shouldn’t be passing around data in CSV formats, but now it seems like we have too many solutions: parquet, feather, hdf5 (and JLD, which as I understand it is hdf5 but only for use with julia?).

Any chance the community is converging towards one of this as a go-to data format for tabular data?


#2

That’s been true well before RFC-4180 came out (in 2005) to try to “standardize” the adhoc CSV formats :grinning:, but amazingly, people still continue to do it.

Probably the best standardized (text-based) format for data transfer these days would be JSON, although it’s not as compact for tabular data as CSV can be.


#3

Best sata interchange with which software (or people?)?
I actually find CSV quite practical. It works with any software, is very reliable and one can very easily check „the truth“.


#4

Fair question. I find CSV ok for small datasets, but runs into lots of parsing issues, text encoding issues, etc., and it’s SUPER inefficient in terms of data sizes. So I guess my preference is:

  • works easily for r, python, julia, and ideally stata
  • maintains type information (so floats stay floats)
  • If I re-load into Julia, not much lost in terms of meta-data / column names, etc.
  • Loads and saves quickly

#5

The solution closest to requirements 1, 2, and 4 (works for multiple languages, lossless and efficient for simple bits types) is HDF5.

To maintain more metadata across languages, you would need some (implicit) protocol, eg that hdf5 nodes at some level are columns of a data frame, etc.

Feather is becoming rather standard for column data, if you are OK with data types that eg R can handle natively.

At the moment, I would recommend the following for moderately sized datasets:

  1. Maintain your parsed and cleaned “master” data, possibly ingested from CSV or similar, in HDF5. This format will most likely endure for decades, and can be shared with people who use different languages/frameworks.

  2. If you find that some other format works better for day to day work and can be generated easily from the above dataset, use it in the short run, with the implicit assumption that in the worst case scenario you just regenerate everything from the master data above. Eg JLD2 works fine for saving interim results within Julia.

  3. For “final” results, again archive in HDF5.


#6

Perfect! Thanks!