JDF - an experimental DataFrame serialization format is ready for beta testing

xiaodai · September 13, 2019, 3:29pm

I wanted to contribute to Parquet.jl and Arrow.jl but don’t feel confident that I know enough about serialization and disk formats. And reading the Parquet and Arrow docs still leaves me pretty clueless as to how the formats work. So I decided to write my own DataFrame serialization format to learn and test things out.

It’s working pretty well so far. It can serialize String Int* and Float* types and it supports Union{Missing, T} as well.

Generally, I am pretty happy with it, as the file sizes are smaller than Feather, and it’s much more reliable than JLD2, JLSO, Feather, and is much faster than JLD2 and JLSO.

Update

It’s only more “reliable” in the narrow sense of serializing the Fannie Mae datasets. In general JLD2, seems to have issues reloading DataFrames from. Fannie Mae data last time I checked. Jlso was really slow for really data frames. And there are outstanding issues with Feather.jl on the Fannie Mae dataset and other datasets that I have used.

ExpandingMan · September 13, 2019, 3:53pm

Don’t get me wrong, even though I’m a maintainer on Feather.jl, there are lots of things I do not like about that format and our current implementation of it, and I’m happy to see other packages for serializing tabular data, but it seems a little presumptuous that this brand new package is in some sense “more reliable”. We have pretty extensive unit tests in Arrow.jl and Feather.jl and yes, even interoperability tests with the pyarrow feather reader. There seems to be some issue that occasionally pops up with R, but evidence so far points to that being a problem on the R end. The package has been around in the community for quite a while and has been through several major iterations. If there are problems with it severe enough that it should be perceived as generally “unreliable”, we aren’t hearing about them.

tkf · September 13, 2019, 7:40pm

IIRC Arrow/Feather devs say that it’s in-memory or temporary data format and recommend using Parquet for long-term storage format. Doesn’t it mean it’s “unreliable” in the sense that it’s not design to be a long-term storage?

ExpandingMan · September 13, 2019, 7:48pm

There’s a whole story of it which I have pieced together over the time I’ve been involved with it, which I’m not sure is worth getting into here. I think what they are getting at is that Feather was basically deprecated in favor of another arrow file format, which has no name and they never bothered to publicize (it’s basically just the arrow streaming format written into a file). Feather seems to have been a pretty early attempt at an arrow data format, and as such is only slightly more specialized to arrow than parquet is.

From what I can tell, they are unlikely to continue to evolve the Feather format itself, unless they decide to slap that label on the arrow file format (which I think would be a bad idea, because it would be very confusing). If this is true, it makes Feather pretty safe for long term storage.

Anyway, ok I’d have to admit I’d be a little nervous about deliberately writing out a feather file, not being allowed to touch it for 10 years, and then having to load it up without a problem 10 years later, but I wouldn’t exactly say it’s “unreliable”.

(By the way, I can already serialize and deserialize the arrow file format from my dev branch of Arrow.jl, but I haven’t finished getting it into a state that’s appropriate for a registered package and I don’t know when I might go back to it.)

xiaodai · September 13, 2019, 10:52pm

Wrote this in haste as I wanted to go to sleep before 2am. I meant to be more precise in that from my own test set it seems to be more “reliable” in the sense that it works for my own datasets. Given it’s beta, it must have bugs! But just not in my datasets. I actually recommended Feather.jl but it seems not work on the Fannie Mae data. JLD2 and JLSO either fails to restore the data frame correctly or takes forever!!

JDF is slower than fst by a long shot but at least it works for me.

xiaodai · September 13, 2019, 11:07pm

Put it this way, I really wanted to contribute to Arrow.jl and Parquet.jl and I really wouldn’thave created JDF if Feather.jl or JLD2 and JLSO had worked for my use-case. So at least now I know something about serialization.

The best thing you can do is to throw some big datasets at it and see how it performs.

xiaodai · September 15, 2019, 6:49am

It’s probably “unreliable” in the narrow sense that I have only tried them on two sets of real world data, and at least one package has failed to read or write the datasets correctly.

Feather.jl has this issue which I assume will have a solution soon. JLD2 failed to read back the AirOnTimeDataset that it has written.

I have published the code here at this gist

Note that when the benchmarks show 0 it means that particular package has failed and has thrown an error on my machine. Feather.jl throws an error on Air On Time as well, but I’ve replaced all Missing column with Union{Mising, Bool} to circumvent the above referenced issue.

So for me, both Feather.jl and JLD2.jl failed on a reasonably popular dataset available in the wild (see gist for URL to download). So 2 of 3 packages has issues with the AirOnTime datasets. So “unreliable” from my (albeit limited) sample was pretty fair statement. It was unfair to JLSO because it worked every time. But its performance isn’t suitable for my use-case at the moment.

To be fair, I had previously recommended Feather.jl, as the best format for Julia, and when it works, the performance is decent. Although I assume write performance can be improved.

One advantage that JDF has over Feather.jl is that the files are compressed and hence smaller. Also, Feather files can be memory mapped. The con of JDF is that files are stored in a folder. So a dataframe is stored as more than one file. Which makes it harder to move across machines.

Benchmarks: Air On Time data
csvplot_write
csvplot_read

Benchmarks: Fannie Mae
txtplot_write
txtplot_read

File Size
csv_filesize
txt_filesize

xiaodai · September 15, 2019, 7:42am

ON a 2.8GB Fannie Mae file. JLSO failed on 64GB RAM machine. Here are the timings txtplot_read

txtplot_read_wo_csv_jlso

txtplot_write_wo_jlso

xiaodai · September 15, 2019, 11:01pm

Currently, it looks like Feather.jl reads are faster and it isn’t multithreaded yet, so performance can further improve. On the other hand, JDF is consistent at both read and write and has decent performance in general. So I would use JDF for now as it seems reasonably stable. The source code is in the hundreds so there isn’t as many bugs yet.

There are further opportunities to improve JDF read performance by serializing StringArray’s better.

Topic		Replies	Views
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1428	January 19, 2020
Recommended Saves and Loads of DataFrame : JLD, CSV, etc Data	8	2895	August 30, 2020
Benchmarking ways to write/load DataFrames IndexedTables to disk Data	42	6961	October 25, 2018
[ANN] JDF.jl v0.2.0 - Julia DataFrames serialization format Package Announcements	11	1038	May 19, 2020
Reading large-columned data using Feather.jl is too slow Data question , package	8	722	June 28, 2020

JDF - an experimental DataFrame serialization format is ready for beta testing

Related topics