JDF - an experimental DataFrame serialization format is ready for beta testing

https://github.com/xiaodaigh/JDF.jl

I wanted to contribute to Parquet.jl and Arrow.jl but don’t feel confident that I know enough about serialization and disk formats. And reading the Parquet and Arrow docs still leaves me pretty clueless as to how the formats work. So I decided to write my own DataFrame serialization format to learn and test things out.

It’s working pretty well so far. It can serialize String Int* and Float* types and it supports Union{Missing, T} as well.

Generally, I am pretty happy with it, as the file sizes are smaller than Feather, and it’s much more reliable than JLD2, JLSO, Feather, and is much faster than JLD2 and JLSO.

Update

It’s only more “reliable” in the narrow sense of serializing the Fannie Mae datasets. In general JLD2, seems to have issues reloading DataFrames from. Fannie Mae data last time I checked. Jlso was really slow for really data frames. And there are outstanding issues with Feather.jl on the Fannie Mae dataset and other datasets that I have used.

2 Likes

Don’t get me wrong, even though I’m a maintainer on Feather.jl, there are lots of things I do not like about that format and our current implementation of it, and I’m happy to see other packages for serializing tabular data, but it seems a little presumptuous that this brand new package is in some sense “more reliable”. We have pretty extensive unit tests in Arrow.jl and Feather.jl and yes, even interoperability tests with the pyarrow feather reader. There seems to be some issue that occasionally pops up with R, but evidence so far points to that being a problem on the R end. The package has been around in the community for quite a while and has been through several major iterations. If there are problems with it severe enough that it should be perceived as generally “unreliable”, we aren’t hearing about them.

10 Likes

IIRC Arrow/Feather devs say that it’s in-memory or temporary data format and recommend using Parquet for long-term storage format. Doesn’t it mean it’s “unreliable” in the sense that it’s not design to be a long-term storage?

There’s a whole story of it which I have pieced together over the time I’ve been involved with it, which I’m not sure is worth getting into here. I think what they are getting at is that Feather was basically deprecated in favor of another arrow file format, which has no name and they never bothered to publicize (it’s basically just the arrow streaming format written into a file). Feather seems to have been a pretty early attempt at an arrow data format, and as such is only slightly more specialized to arrow than parquet is.

From what I can tell, they are unlikely to continue to evolve the Feather format itself, unless they decide to slap that label on the arrow file format (which I think would be a bad idea, because it would be very confusing). If this is true, it makes Feather pretty safe for long term storage.

Anyway, ok I’d have to admit I’d be a little nervous about deliberately writing out a feather file, not being allowed to touch it for 10 years, and then having to load it up without a problem 10 years later, but I wouldn’t exactly say it’s “unreliable”.

(By the way, I can already serialize and deserialize the arrow file format from my dev branch of Arrow.jl, but I haven’t finished getting it into a state that’s appropriate for a registered package and I don’t know when I might go back to it.)

3 Likes

Wrote this in haste as I wanted to go to sleep before 2am. I meant to be more precise in that from my own test set it seems to be more “reliable” in the sense that it works for my own datasets. Given it’s beta, it must have bugs! But just not in my datasets. I actually recommended Feather.jl but it seems not work on the Fannie Mae data. JLD2 and JLSO either fails to restore the data frame correctly or takes forever!!

JDF is slower than fst by a long shot but at least it works for me.

Put it this way, I really wanted to contribute to Arrow.jl and Parquet.jl and I really wouldn’thave created JDF if Feather.jl or JLD2 and JLSO had worked for my use-case. So at least now I know something about serialization.

The best thing you can do is to throw some big datasets at it and see how it performs.

It’s probably “unreliable” in the narrow sense that I have only tried them on two sets of real world data, and at least one package has failed to read or write the datasets correctly.

Feather.jl has this issue which I assume will have a solution soon. JLD2 failed to read back the AirOnTimeDataset that it has written.

I have published the code here at this gist

Note that when the benchmarks show 0 it means that particular package has failed and has thrown an error on my machine. Feather.jl throws an error on Air On Time as well, but I’ve replaced all Missing column with Union{Mising, Bool} to circumvent the above referenced issue.

So for me, both Feather.jl and JLD2.jl failed on a reasonably popular dataset available in the wild (see gist for URL to download). So 2 of 3 packages has issues with the AirOnTime datasets. So “unreliable” from my (albeit limited) sample was pretty fair statement. It was unfair to JLSO because it worked every time. But its performance isn’t suitable for my use-case at the moment.

To be fair, I had previously recommended Feather.jl, as the best format for Julia, and when it works, the performance is decent. Although I assume write performance can be improved.

One advantage that JDF has over Feather.jl is that the files are compressed and hence smaller. Also, Feather files can be memory mapped. The con of JDF is that files are stored in a folder. So a dataframe is stored as more than one file. Which makes it harder to move across machines.

Benchmarks: Air On Time data
csvplot_write
csvplot_read

Benchmarks: Fannie Mae
txtplot_write
txtplot_read

File Size
csv_filesize
txt_filesize

2 Likes

ON a 2.8GB Fannie Mae file. JLSO failed on 64GB RAM machine. Here are the timingstxtplot_read

txtplot_read_wo_csv_jlso

txtplot_write_wo_jlso

1 Like

Currently, it looks like Feather.jl reads are faster and it isn’t multithreaded yet, so performance can further improve. On the other hand, JDF is consistent at both read and write and has decent performance in general. So I would use JDF for now as it seems reasonably stable. The source code is in the hundreds so there isn’t as many bugs yet.

There are further opportunities to improve JDF read performance by serializing StringArray’s better.

1 Like