Version 1.0.0 of the Apache Arrow specification and implementations in several languages has been released Apache Arrow 1.0.0 Release | Apache Arrow As part of this release the Feather file specification (a file format for DataFrame-like tables) has been changed to the Arrow IPC file format.
Coming from an R background where packages typically include data sets for testing and illustration I have wanted to be able to access data sets for Julia packages. We do this in MixedModels using Artifacts and Feather files but right now there are some difficulties in creating Feather files from Julia. (Feather.write goes through Arrow.jl which is looking for outdated information in CategoricalArrays or PooledArrays.)
I would really like to see Arrow.jl updated to the new spec and would be happy to contribute effort to make that happen. Alternatively, are there other serialized, language-agnostic formats for tabular structures we could consider.
Yeah, I’m very bullish/pro apache arrow; I think it does a lot of things right in terms of binary formats. It’s pretty high up on my list to help get Arrow.jl fully fleshed out (including the IPC format) and integrated into the main apache arrow repo as well. I think Julia can be one of the very best clients for arrow as well, because we can support the format natively (no requirement to go through C/C++), and use arrow vectors natively in DataFrames, for example.
I’m sure Julia is perfectly capable of supporting the format natively but I’m wondering if it wouldn’t be simpler to use a wrapper around the C++ implementation. The Dataset API is under active development (among other things like flight for high throughout bidirectional data exchange). Many analytical kernels have already been implemented in C++ and it would much faster to build on the existing API than reimplementing them in pure Julia. Moreover bugs are inevitably bound to happen which means if you’re copying an existing implementation, you will again have to track all Jira/GitHub issues to have exactly the same implementation in Julia and C++.
Can you please explain in more details about use cases for Arrow, what problems does it solve, and what other products/libraries it can replace? How does if differ from just another columnar format?
In some respect, it’s “just another columnar format”; but a very-well designed one. It supports arbitrary “structs”, unions of types, primitive types, compression, well-aligned memory, and memory blob sharing between processes (in theory, I haven’t actually seen this used yet). In principle, it’s a culmination of columnar formats come before; in practice, time will tell if it actually catches on and brings enough advantages to make people switch to it from other formats.