Benchmarking ways to write/load DataFrames IndexedTables to disk

ExpandingMan · October 24, 2018, 1:22pm

As of the last couple of releases, Arrow.jl only works on Julia 1.0, so no worries there.

Note that this package was built primarily with Feather in mind, so we are still missing all the arrow IPC stuff (it’s really just arrays at the moment). As far as I know, at the moment Feather.jl is the fastest way to read and write tabular data from disk in pure Julia. Strings are known to be a little slow, but they are still quite fast. Also note that Feather.jl does lazy loading by default now. Unfortunately we are still limited to 4GB per feather file because the feather standard badly needs to be upgraded, but stay tuned for that.

xiaodai · October 25, 2018, 1:51am

What about Parquet? It seems more promising to be as it allows compression of the files.

If my memory serves me correctly, I don’t think this is true anymore.

ExpandingMan · October 25, 2018, 3:55pm

It depends on what you are doing. Parquet is definitely a more capable format than Feather, but it is also quite a bit more complicated. I think the intended use case for Feather was just “well, I have a few GB of data that I’m messing around with, I need to store it while I’m working on it for the next few weeks” whereas Parquet was intended as a data source for “production” processes.

To answer your question yes, you should be able to use the arrays provided by Arrow.jl to wrap arrays that appear in Parquet. I have yet to take a serious look at the parquet metadata, so I don’t really know what would be involved in doing that. I’m pretty confident that it would be significantly more complicated than Feather.

If there was any change to the standard, it doesn’t seem to be reflected here. Like I said, I have a number of thoughts on how we should go about this sort of thing, so I may have some news in the future.

Topic		Replies	Views
CSV Reader / Writer Choices Data	1	734	August 28, 2018
[ANN] Fread.jl - read CSVs faster with the help of R's {data.table} Package Announcements performance , data , csv	6	2049	October 9, 2019
[ANN] JDF.jl - Experimental Julia DataFrames serialization format Package Announcements	3	1428	January 19, 2020
CSV.jl's CSV write seems slow Performance	32	5734	January 28, 2020
JDF - an experimental DataFrame serialization format is ready for beta testing Data	8	2001	September 15, 2019

Benchmarking ways to write/load DataFrames IndexedTables to disk

Related topics