Setindex! issue with DataFrame saved as Arrow file

Working with a DataFrame saved as an Arrow file:
fn = download("https://www.dropbox.com/s/322g26p3apdeqpf/granules.remote?dl=0")

Loading in the file as a df
g = DataFrame(Arrow.Table(fn))

Copy the DataFrame
g0 = copy(g)

Set row 1 of g0 == row 1 of g
g0[1,:] = g[1,:]

Results in the following error:
ERROR: CanonicalIndexError: setindex! not defined for Arrow.List{String, Int32, Vector{UInt8}}
Stacktrace:
[1] error_if_canonical_setindex(#unused#::IndexLinear, A::Arrow.List{String, Int32, Vector{UInt8}}, #unused#::Int64)
@ Base ./abstractarray.jl:1352
[2] setindex!(A::Arrow.List{String, Int32, Vector{UInt8}}, v::String, I::Int64)
@ Base ./abstractarray.jl:1343
[3] insert_single_entry!(df::DataFrame, v::String, row_ind::Int64, col_ind::Symbol)
@ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/dataframe/dataframe.jl:631
[4] setindex!(df::DataFrame, v::String, row_ind::Int64, col_ind::Symbol)
@ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/dataframe/dataframe.jl:659
[5] setindex!(df::DataFrame, v::DataFrameRow{DataFrame, DataFrames.Index}, row_ind::Int64, col_inds::Colon)
@ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/dataframerow/dataframerow.jl:255
[6] top-level scope
@ REPL[53]:1

I cannot reproduce it. I get:

julia> g0[1,:] = g[1,:]
DataFrameRow
 Row β”‚ id                      center                     polygon                            granules
     β”‚ String                  NamedTup…                  NamedTup…                          Array…
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚ lat[+72+74]lon[-76-74]  (lat = 73.0, lon = -75.0)  (min_x = -76.0, min_y = 72.0, ma…  NamedTuple{(:id, :url, :bbox, :i…

under Julia 1.8.2 and:

(@v1.8) pkg> st DataFrames Arrow
Status `C:\Users\bogum\.julia\environments\v1.8\Project.toml`
  [69666777] Arrow v2.3.0
  [a93c6f00] DataFrames v1.4.3
2 Likes

Sigh… I was using DataFrames v1.3.2 updating to v1.3.6 resolves the issue. Thanks @bkamins

1 Like

@bkamins looks like the issue persists using:
[69666777] Arrow v2.3.0
[a93c6f00] DataFrames v1.4.3

This works:
fn = download("https://www.dropbox.com/s/322g26p3apdeqpf/granules.remote?dl=0")
g = DataFrame(Arrow.Table(fn))
g0 = copy(g)
g0[1,:] = g[1,:]

But when I define g0 from file it does not:
g0 = DataFrame(Arrow.Table(fn))
g0[1,:] = g[1,:]

throws the following error:

ERROR: CanonicalIndexError: setindex! not defined for Arrow.List{String, Int32, Vector{UInt8}}
Stacktrace:
[1] error_if_canonical_setindex(#unused#::IndexLinear, A::Arrow.List{String, Int32, Vector{UInt8}}, #unused#::Int64)
@ Base ./abstractarray.jl:1352
[2] setindex!(A::Arrow.List{String, Int32, Vector{UInt8}}, v::String, I::Int64)
@ Base ./abstractarray.jl:1343
[3] insert_single_entry!(df::DataFrame, v::String, row_ind::Int64, col_ind::Symbol)
@ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/dataframe/dataframe.jl:659
[4] setindex!(df::DataFrame, v::String, row_ind::Int64, col_ind::Symbol)
@ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/dataframe/dataframe.jl:688
[5] setindex!(df::DataFrame, v::DataFrameRow{DataFrame, DataFrames.Index}, row_ind::Int64, col_inds::Colon)
@ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/dataframerow/dataframerow.jl:255
[6] top-level scope
@ REPL[37]:1

But that’s expected, but default the data isn’t copied and read only, so you can’t mutate it.

1 Like

Try

g = DataFrame(Arrow.Table(fn); copycols=true)

Then you shouldn’t need the copy later.

1 Like

Ahh… ok. Thank you all for the guidance.

  1. I wrongly assumed it was lazy loading and would load once operated on.
  2. This error seems really cryptic for β€œDataFrame is read only”… maybe a new error could be added
1 Like

By default, Arrow.jl uses memory-mapping when loading a table from a file, as you say, and copycols=true tells it go eagerly load all the data into Vector’s. So if you are memory constrained, you may not want to use copycols=true.

The problem is Arrow.jl presents a read-only view of the data, so you can’t do in-place operations like g0[1,:] = g[1,:] on it. You need your DataFrame to be backed by mutable columns to do that. But you can do things like load the table with copycols=false to get the memory-map backed version, then select and subset (with view=true) to select columns or rows without making a copy, reducing the data size. Then once you have only the data you need, you could make a copy to move that data into mutable vectors.

2 Likes

The problem is that DataFrames.jl is not aware of this error. It is Arrow.jl that throws the error.

@quinnj - maybe there could be defined setindex! for immutable Arrow types that would produce a more informative error?

1 Like

Yeah, that’s probably a good idea.

2 Likes