How to read Arrow.jl files from R?

I used Arrow.write() from Arrow.jl 1.2.4 and read_feather() from R arrow 3.0.0 but the read said errored:

Error in ipc___feather___Reader__Open(file) : 
  Invalid: Not a Feather V1 or Arrow IPC file

How can I read this file from R?

I don’t have difficulty with that combination on Ubuntu 20.10

bates$ julia-1.5.3 -t auto -O3
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.5.3 (2020-11-09)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Arrow

(@v1.5) pkg> status Arrow
Status `~/.julia/environments/v1.5/Project.toml`
  [69666777] Arrow v1.2.4

julia> Arrow.write("/tmp/arrowtest.arrow", (x = rand(6), f = repeat(["A","B"], inner=3)))
"/tmp/arrowtest.arrow"

julia> 
bates$ R

R version 4.0.4 (2021-02-15) -- "Lost Library Book"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> require("arrow")
Loading required package: arrow

Attaching package: ‘arrow’

The following object is masked from ‘package:utils’:

    timestamp

> read_feather("/tmp/arrowtest.arrow")
          x f
1 0.9617653 A
2 0.4586787 A
3 0.3014748 A
4 0.4049147 B
5 0.6989243 B
6 0.3541891 B
2 Likes
open("/tmp/my.arrow", "w") do f; RDatasets.dataset("ggplot2", "diamonds")) end

works with R’s read_feather("/tmp/my.arrow") but

open("/tmp/my.arrow", "w") do f
 Arrow.write(f, RDatasets.dataset("ggplot2", "diamonds")) 
end

gives an error in R

> df = read_feather("/tmp/my.arrow")
Error in ipc___feather___Reader__Open(file) : 
  Invalid: Not a Feather V1 or Arrow IPC file

I think we may need @quinnj to weigh in here. Arrow.write can write either the file format or the memory format and I suspect that is the distinction here. The file format has a magic number in the first 6 characters of “ARROW1”

julia> String(read("/tmp/my.arrow")[1:6])
"ARROW1"

Hmmm, weird. So there are 3 distinct formats possible:

  • Feather V1: this is pre-arrow 1.0, which supported only a subset of full arrow type spec and had a few other minor differences
  • Arrow IPC: this is the arrow “in memory” format in raw bytes, often used by being sent over the wire via HTTP request or gRPC, or you could just save these bytes to disk
  • Arrow file: this is basically the same as “in memory” format written to disk, but includes a little extra metadata to enable “random access” to specific record batches within a file. This is also known as Feather V2

So it seems weird that the R package says it expects a Feather V1 or an Arrow IPC, but seems to not be able to read Feather V2? Or maybe I’m getting this backwards because it seems from your example that if you try to write to a filename as a String (which produces Feather V2) that seems to work, but the IPC doesn’t?

1 Like

I think it may be the error message from the R arrow package that is incorrect. I certainly have had no problem reading the Feather V2 format files in R.

Yeah, looking back over the examples, I think that’s right. I think they don’t support reading the raw arrow IPC messages.

I think another function, read_ipc_stream, is used for reading the raw arrow IPC messages.