Benchmarking ways to write/load DataFrames IndexedTables to disk

As of the last couple of releases, Arrow.jl only works on Julia 1.0, so no worries there.

Note that this package was built primarily with Feather in mind, so we are still missing all the arrow IPC stuff (it’s really just arrays at the moment). As far as I know, at the moment Feather.jl is the fastest way to read and write tabular data from disk in pure Julia. Strings are known to be a little slow, but they are still quite fast. Also note that Feather.jl does lazy loading by default now. Unfortunately we are still limited to 4GB per feather file because the feather standard badly needs to be upgraded, but stay tuned for that.

What about Parquet? It seems more promising to be as it allows compression of the files.

If my memory serves me correctly, I don’t think this is true anymore.

It depends on what you are doing. Parquet is definitely a more capable format than Feather, but it is also quite a bit more complicated. I think the intended use case for Feather was just “well, I have a few GB of data that I’m messing around with, I need to store it while I’m working on it for the next few weeks” whereas Parquet was intended as a data source for “production” processes.

To answer your question yes, you should be able to use the arrays provided by Arrow.jl to wrap arrays that appear in Parquet. I have yet to take a serious look at the parquet metadata, so I don’t really know what would be involved in doing that. I’m pretty confident that it would be significantly more complicated than Feather.

If there was any change to the standard, it doesn’t seem to be reflected here. Like I said, I have a number of thoughts on how we should go about this sort of thing, so I may have some news in the future.