Saving a DataFrame onto disk is tricky business. This is especially so for large datasets. The current choices are JLD, JLD2, JLSO, Feather, CSV. The first four are preferred to CSV as they are meant to be preserve type information. However, in practice, I have encountered issues with all of these solutions. The issues are varied and include failing to save due to error, failing to reload the data faithfully, taking too long to reload, and not being able to handle edge cases e.g. Vectotr{Missing}
.
JDF.jl was born because the existing solutions just didn’t work for me. The situation is more acute if I am dealing with large datasets like Fannie Mae. To me, JDF.jl is the more “reliable” solution, because it works on all the datasets I have tested. You may disagree, but please test it out and let me know!
Ideally, I would be contributing to Parquet.jl (or the like) but I didn’t feel confident enough in my serialization coding knowledge. So I wanted to do a prototype in pure Julia to learn. But the learning process led me to believe that there is definitely a place for a pure-Julia serialization format to show the world what tricks Julia has got up its sleeves! E.g. doing a Run-Length Encoding (RLE) is easy; using Blosc compression is easy; defining your own Vector
type is easy. Everything is so easy and there are still so many tricks I haven’t applied! I can make it faster, make it compress better, and make it more usable than solutions in R and Python!
Did I mention it’s only ~500 lines of code? And it achieves the below performance on Julia 1.3 with multi-threading? It achieves reasonable file size, it’s almost as fast as R’s {fst} (which I consider state of the art in this space). It may look like that Feather.jl has faster read performance, but the Feather.jl load is doing a mmap whereas JDF.jl actually loads the data into RAM. I am sure Feather.jl will improve its write performance, but until then JDF.jl looks like a pretty good choice!
Enjoy! And report bugs and issues!