[ANN] JDF.jl - Experimental Julia DataFrames serialization format

Saving a DataFrame onto disk is tricky business. This is especially so for large datasets. The current choices are JLD, JLD2, JLSO, Feather, CSV. The first four are preferred to CSV as they are meant to be preserve type information. However, in practice, I have encountered issues with all of these solutions. The issues are varied and include failing to save due to error, failing to reload the data faithfully, taking too long to reload, and not being able to handle edge cases e.g. Vectotr{Missing}.

JDF.jl was born because the existing solutions just didn’t work for me. The situation is more acute if I am dealing with large datasets like Fannie Mae. To me, JDF.jl is the more “reliable” solution, because it works on all the datasets I have tested. You may disagree, but please test it out and let me know!

Ideally, I would be contributing to Parquet.jl (or the like) but I didn’t feel confident enough in my serialization coding knowledge. So I wanted to do a prototype in pure Julia to learn. But the learning process led me to believe that there is definitely a place for a pure-Julia serialization format to show the world what tricks Julia has got up its sleeves! E.g. doing a Run-Length Encoding (RLE) is easy; using Blosc compression is easy; defining your own Vector type is easy. Everything is so easy and there are still so many tricks I haven’t applied! I can make it faster, make it compress better, and make it more usable than solutions in R and Python!

Did I mention it’s only ~500 lines of code? And it achieves the below performance on Julia 1.3 with multi-threading? It achieves reasonable file size, it’s almost as fast as R’s {fst} (which I consider state of the art in this space). It may look like that Feather.jl has faster read performance, but the Feather.jl load is doing a mmap whereas JDF.jl actually loads the data into RAM. I am sure Feather.jl will improve its write performance, but until then JDF.jl looks like a pretty good choice!

Enjoy! And report bugs and issues!

txt txt txt

6 Likes

What exactly is meant by “DataFrame serialization” here? Serializing a DataFrame of primitive types, like numbers and strings? A DataFrame which includes e.g. dicts, 1-d or n-d arrays as values? A DataFrame containing arbitrary values?

serialization to me means saving an in memory structure to disk.

JDF.jl only supports a few types. Please go to the README on github for more info. I think supporting arbitrary formats in JDF.jl will things alot harder to do.

1 Like

Thanks for sharing this! Found from google while looking for the equivalent of pandas.DataFrame.to_hdf, pandas.read_hdf, xarray.Dataset.to_netcdf, etc in Julia. It would be nice to have interop compatibility with other languages via HDF5, which seems to be a fairly standard way to serialize tables. As a bonus, makes it easy to do chunking & blosc:zstd compression, which is necessary for serializing large data structures.

Seems there’s an issue for this already: https://github.com/JuliaIO/HDF5.jl/issues/92.

Python and R already have decent DataFrame interop. It’d be great to be able to have interop with Julia, too!

edit: just read more about Parquet, looks like good interop across R, Python & Julia. But sounds like writing from Julia is still a bit painful. It looks like Arrow (backend for Feather) 1.0.0 is just around the corner, though! https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ and stability guarantees here: https://github.com/apache/arrow/blob/master/docs/source/format/Versioning.rst