[ANN] Arrow.jl 0.3 Release

A new 0.3 release has just been made for the Arrow.jl package.

This is a significant rewrite of the entire package from scratch, and it now lives under the JuliaData organization. With this release, Arrow.jl now fully implements the 1.0 version of the apache arrow format in native Julia. More detailed support now includes:

  • All primitive data types
  • All nested data types
  • Dictionary encodings and messages
  • Extension types
  • Streaming, file, record batch, and replacement and isdelta dictionary messages

It currently doesn’t include support for:

  • Tensors or sparse tensors
  • Flight RPC
  • C data interface

Third-party data formats:

  • csv and parquet support via the existing CSV.jl and Parquet.jl packages
  • Other Tables.jl-compatible packages automatically supported (DataFrames.jl, JSONTables.jl, JuliaDB.jl, SQLite.jl, MySQL.jl, JDBC.jl, ODBC.jl, XLSX.jl, etc.)
  • No current Julia packages support ORC or Avro data formats

This 0.3 release is meant as a “beta” release of the new rewritten code and we invite all to give it a try and report any issues you may run into. Also feel free to post questions/issues in the #data slack channel.

The plan is to let the 0.3 help shake out any glaring issues in the rewritten code before doing an official 1.0 release. In the mean time, I’ll also be working on integrating the julia implementation into the official apache arrow repository.

For the really adventurous among you, I recorded a 90-minute video doing a deep-dive into the Arrow.jl Julia implementation of the arrow format; it dives deep into the code and also gives some high-level ideas/uses for arrow data in general.

Cheers!

-Jacob

54 Likes

This looks great! I have a question - is it possible to dict-encode a single column only (e.g. if I have a categorical column of strings in an otherwise numeric table)?

The docs state that this is controlled by a single option for all columns but I’m curious if this a choice made in the package or by the standard itself.

Thanks again for working on this; Arrow support is a really useful addition to the Julia data ecosystem.

Edit: Looking at the code it seems that the underlying format supports choosing to dict-encode or not dict-encode individual columns.

Can we get at least one small benchmarking plot?

pretty please

Of what exactly? Comparison reading a file vs. csv or something? Comparison vs. other language implementations?

Yea I was interested between other file formats.

But perhaps since you say this is a pure Julia implementation of the library, might also be good to compare it to other languages as well. That’s less for people on this board though, more for people who find Arrow and then hopefully happen to see that the Julia implementation is one of the fastest!

Cool! Is there any future to Feather.jl, or should we just switch to Arrow? It says

NOTE: Feather V1 has been deprecated by Apache Arrow in favor of Feather V2, which is just the Arrow IPC format written to disk. A complete rewrite of Arrow.jl is actively being worked on which will support reading and writing Feather V2… Currently Feather V2 will not be recognized as a valid feather file by this package.

But it hasn’t been worked on since several months it seems.

1 Like

Sorry for the slow response; yes, you can definitely dict encode single columns. The Arrow.jl package provides a wrapper type DictEncode which you can use to wrap your existing array and it will be dict encoded, or if your array is already a PooledArray/CategoricalArray type, it will be dict-encoded automatically (specifically, any array type that implements DataAPI.refarray and DataAPI.refpool apis). But yeah, simplest to dict-encode a single column is to just wrap it like df.col1 = Arrow.DictEncode(df.col1)

1 Like

It’s a good question. I think Feather.jl is still technically alive, but I’m not exactly sure why. As far as I can tell, it’s just a very limited subset of the arrow format, on-disk only, and only really supported in C++, R, python, and Julia; and for the non-Julia implementations, they just use C++ arrow under the hood anyway.

My guess is it will be retired at some point and people will be encouraged to switch to arrow.

1 Like

Thanks for the response! Cool, it’s great that this is already possible.

Here’s a very quick comparison of a file with 70K rows of ints, floats, strings, and dates:

julia> @time f = CSV.File(file);
  0.004444 seconds (141.48 k allocations: 8.359 MiB)

julia> @time f2 = Arrow.Table(file2);
  0.000247 seconds (419 allocations: 26.016 KiB)
8 Likes

Thank you! This is going to end up being one of my new favorite packages :D.

Can you provide more details of your benchmark, please?
Is “file2” a CSV file too? or a file with Arrow format?
Could you also compare timings with data.table’s fread()?

No, file2 is the same data as in randoms.csv, but in arrow format, so you’re seeing a comparison of CSV.jl reading a csv file, and Arrow.jl reading an arrow file. I don’t really have the time or desire to do detailed benchmarks right now vs. other languages/data formats, but if someone else would like to, feel free!

4 Likes

This is awesome and another big step forward for the Julia package ecosystem! Well done.

1 Like

Hi,

Many thanks for your work on this - I am trying to incorporate Julia into an (“toy”) ETL of CSV -> Arrow format since my goal is to the consume the data in a memory mapped form (via Julia) after the fact.

To this end I have imported the Arrow.jl package and read through the GitHub and essentially tried to recreate the above (using Julia v1.5) with the following code:

using Arrow , Tables , CSV , DataFrames
parfile = "C:\\Desktop\\juliaData\\csv\\gtscopyv4.csv"
arrowfile = "C:\\Desktop\\juliaData\\csv\\gtscopyv4.arrow"
#df = DataFrame!(CSV.File(parfile, header = false))
#io = open(parfile)
Arrow.write(arrowfile,CSV.File(parfile, header = false))

However, I consistently receive the following error:

ERROR: LoadError: `write` is not supported on non-isbits arrays
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] write(::IOStream, ::CSV.File{false}) at .\io.jl:634
 [3] (::Base.var"#292#293"{CSV.File{false},Tuple{}})(::IOStream) at .\io.jl:396
 [4] open(::Base.var"#292#293"{CSV.File{false},Tuple{}}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at .\io.jl:325
 [5] open(::Function, ::String, ::String) at .\io.jl:323
 [6] write(::String, ::CSV.File{false}) at .\io.jl:396
 [7] top-level scope at c:\Users\doliver\Documents\JuliaRepo\TestCode\ArrowTest.jl:10
 [8] include(::String) at .\client.jl:457

The file itself contains some strings, ints:

image

Am I doing anything obviously wrong? I do note that this has been referenced in some earlier issue posts that appear to be resolved.

Apologies if this is the wrong place to be posting - this happens to be my first post here.

Regards

Hey @djholiver, sorry you’re having trouble. From the looks of it, your Arrow.jl version isn’t quite up to date. You can check this by doing ] st which will list all the packages in your environment w/ their versions. You’ll want to make sure your Arrow.jl is 0.3; anything earlier won’t work the same.

do you have any examples of using a python script to access an IN MEMORY Arrow.jl structure? We were going to use python to read a csv into memory on an inhouse server ( 256 gb ram) and use feather to offer up the data as an in memory structure. Happy to use Arrow.jl instead.

1 Like

Hi @quinnj, thank you for your rapid response - I had switched off the laptop for the weekend (a rarity) so didn’t catch it in time.

I had ran Pkg.update(“Arrow”) before working with this initially, however, it didn’t actually update until I did a full Pkg.update() this morning.

Now that it has updated, I am able to do your example. This opens up myriad opportunities for query and analytical needs: my next steps involve reducing the data into separate query and index spaces - then binding this to a Julia service for on - demand data extraction.

What an amazing piece of code; kudos to you.

Regards

Hi,

I’ve been making a lot of progress with this package and now have working pocs of a data query layer that reads directly from an arrow file, producing a parameter-based filtered output via an exported function.

What I want to do next is pass the filtered rows and a subset of the columns to another compute process(es). I am intending to define the column subset before calling the filter result, but I’m unsure on how best to proceed: should I wrap both the query and compute modules in a web service and do the communication over http, or would Arrow.write allow me to do IPC with zero copy etc.?

How would I write the latter? Would I call a while loop over the rows produced by the query layer from the compute layer(s)? My end goal is to have a distributed, load balanced query and compute approach.

Regards,

Yeah, I think passing the data via IPC/on disk would work well. Once you perform the query on arrow data, you would just write the result via Arrow.write("datadir/result.arrow", query_result).

The compute process could utilize FileWatching.watch_folder to watch the data directory datadir for new query results to be written out. Once a new query is written out, it would just call Arrow.Table("datadir/result.arrow" to get the query result arrow data.