Data Import Types and Compression

Hello all,

Sorry if this is a bad questions but I use data sets from ICPSR (a social science repository) and to use an example a dataset that I recently used is about 200 mg in uncompressed csv however I’m able to get these down to about 30 mg in rds format in r and Stata dta format has a similar size. When I try to save it in jld using this code save("dta.jld", "dta", dta, compress=true) it is still around 200 mg. I suspect that putting a lot of the variables which Julia imports as floats into int since most variables only contain a dozen or less unique values or something else might help but I have no idea how to do that on a larger scale (this example dataset has a little over 10,000 variables)

  1. What methods should I look into for this problem?
  2. What is the most user friendly way of doing this?
  3. What is the current state of data file formats in Julia, and future plans?

Although I’m familiar with R and Stata I’m no computer programmer but I want to learn Julia for the speed benefits while data wrangling. Getting more managable files sizes is one of those quality of life things I want to figure out.

Thank you!

Try the fst format for R and Julia

https://github.com/xiaodaigh/fstformat.jl

2 Likes