Apache Arrow 1.0 release

dmbates · July 27, 2020, 7:47pm

Version 1.0.0 of the Apache Arrow specification and implementations in several languages has been released Apache Arrow 1.0.0 Release | Apache Arrow As part of this release the Feather file specification (a file format for DataFrame-like tables) has been changed to the Arrow IPC file format.

Coming from an R background where packages typically include data sets for testing and illustration I have wanted to be able to access data sets for Julia packages. We do this in MixedModels using Artifacts and Feather files but right now there are some difficulties in creating Feather files from Julia. (Feather.write goes through Arrow.jl which is looking for outdated information in CategoricalArrays or PooledArrays.)

I would really like to see Arrow.jl updated to the new spec and would be happy to contribute effort to make that happen. Alternatively, are there other serialized, language-agnostic formats for tabular structures we could consider.

visr · July 27, 2020, 8:52pm

@ExpandingMan has been working on Arrow.jl in the completion branch: https://github.com/ExpandingMan/Arrow.jl/tree/completion
completion still has to be brought to completion. I’m sure extra help here would be welcome.

quinnj · July 28, 2020, 3:50am

Yeah, I’m very bullish/pro apache arrow; I think it does a lot of things right in terms of binary formats. It’s pretty high up on my list to help get Arrow.jl fully fleshed out (including the IPC format) and integrated into the main apache arrow repo as well. I think Julia can be one of the very best clients for arrow as well, because we can support the format natively (no requirement to go through C/C++), and use arrow vectors natively in DataFrames, for example.

xiaodai · July 28, 2020, 11:04am

Start a company where the main product is data. Can build a pipeline using any language and pass data using arrow

aschmu · August 11, 2020, 11:50am

I’m sure Julia is perfectly capable of supporting the format natively but I’m wondering if it wouldn’t be simpler to use a wrapper around the C++ implementation. The Dataset API is under active development (among other things like flight for high throughout bidirectional data exchange). Many analytical kernels have already been implemented in C++ and it would much faster to build on the existing API than reimplementing them in pure Julia. Moreover bugs are inevitably bound to happen which means if you’re copying an existing implementation, you will again have to track all Jira/GitHub issues to have exactly the same implementation in Julia and C++.

sairus7 · August 11, 2020, 2:22pm

Can you please explain in more details about use cases for Arrow, what problems does it solve, and what other products/libraries it can replace? How does if differ from just another columnar format?

quinnj · September 4, 2020, 10:37pm

In some respect, it’s “just another columnar format”; but a very-well designed one. It supports arbitrary “structs”, unions of types, primitive types, compression, well-aligned memory, and memory blob sharing between processes (in theory, I haven’t actually seen this used yet). In principle, it’s a culmination of columnar formats come before; in practice, time will tell if it actually catches on and brings enough advantages to make people switch to it from other formats.

xiaodai · September 5, 2020, 1:28am

In some ways it already has. We code python udf to and the data get sent from spark by arrow.

Topic		Replies	Views
[ANN] Arrow.jl 0.3 Release Data arrow	21	3174	March 16, 2021
Arrow, Feather, and Parquet Data parquet , arrow	48	12971	November 1, 2020
Reading and writing Apache arrow files General Usage question , package , arrow	4	5764	May 28, 2022
Help with Arrow.jl and size of files Data question , arrow	23	1900	October 21, 2022
Challenges with Arrow and Parquet in a (reasonably substantial) Julia Project General Usage	57	3267	May 6, 2024

Apache Arrow 1.0 release

Related topics