What is a good format to store Julia DataFrames efficiently (in terms of size on the disk)? I have a dataset that is about 2.1GB as a Feather file. It is almost 1.9GB in Stata .dta format. But if I open it in R and save it in the .rds format, the size decreases to 239MB. Is there something like RDS format for Julia? I tried JuliaDB, but it does nothing to decrease the size on the disk.
Any recommendations for how the data could be stored on the disk.
You could try compressing Feather or another format with gzip or another general compression algorithm.
Feather supports compression via CategoricalArrays. If you set columns of your table to be CategoricalArrays they will be stored in the feather file in a similar format. Of course, you will only get significant compression from this if your table happens to have lots of repeated values.
Thanks. Will look into this. I suppose this only works if columns contain Categorical data. Is that right? In general, is R achieving the small size because of compression? Does it compress and decompress automatically?
You can make a CategoricalArray out of anything, but it’ll only benefit you if the number of distinct values is much smaller than the number of elements in the array and the values are larger than a few bytes. It’s usually only useful for strings.
From the sizes you are seeing it’s obvious that R’s rds format is somehow compressed (the feather format is very simple, so it’s very unlikely that it’s going to “inflate” much) but I have no idea how. In extreme cases you can achieve that much from the CategoricalArray encoding, but more likely they are using some generalized compression algorithm (there’s a chance it’s gzip, since that is the most commonly used compression algorithm).
It’s actually an R native format so the saving will be slow. Better use my other package https://github.com/xiaodaigh/JDF.jl if u r happy to stay in Julia ecosystem
RDS does compression by default. Please try out JDF.jl. Based on one benchmark that I did, it’s much smaller than Feather if you have lots of strings. See
@xiaodai So sorry for not replying sooner. Thank you so much for these responses. I got busy with something else. But I look forward to trying this out.