Working with a DataFrame saved as an Arrow file:
fn = download("https://www.dropbox.com/s/322g26p3apdeqpf/granules.remote?dl=0")
Loading in the file as a df
g = DataFrame(Arrow.Table(fn))
Copy the DataFrame
g0 = copy(g)
Set row 1 of g0 == row 1 of g
g0[1,:] = g[1,:]
Results in the following error:
ERROR: CanonicalIndexError: setindex! not defined for Arrow.List{String, Int32, Vector{UInt8}}
Stacktrace:
[1] error_if_canonical_setindex(#unused #::IndexLinear, A::Arrow.List{String, Int32, Vector{UInt8}}, #unused #::Int64)
@ Base ./abstractarray.jl:1352
[2] setindex!(A::Arrow.List{String, Int32, Vector{UInt8}}, v::String, I::Int64)
@ Base ./abstractarray.jl:1343
[3] insert_single_entry!(df::DataFrame, v::String, row_ind::Int64, col_ind::Symbol)
@ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/dataframe/dataframe.jl:631
[4] setindex!(df::DataFrame, v::String, row_ind::Int64, col_ind::Symbol)
@ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/dataframe/dataframe.jl:659
[5] setindex!(df::DataFrame, v::DataFrameRow{DataFrame, DataFrames.Index}, row_ind::Int64, col_inds::Colon)
@ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/dataframerow/dataframerow.jl:255
[6] top-level scope
@ REPL[53]:1
bkamins
November 15, 2022, 6:25pm
2
I cannot reproduce it. I get:
julia> g0[1,:] = g[1,:]
DataFrameRow
Row β id center polygon granules
β String NamedTupβ¦ NamedTupβ¦ Arrayβ¦
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β lat[+72+74]lon[-76-74] (lat = 73.0, lon = -75.0) (min_x = -76.0, min_y = 72.0, maβ¦ NamedTuple{(:id, :url, :bbox, :iβ¦
under Julia 1.8.2 and:
(@v1.8) pkg> st DataFrames Arrow
Status `C:\Users\bogum\.julia\environments\v1.8\Project.toml`
[69666777] Arrow v2.3.0
[a93c6f00] DataFrames v1.4.3
2 Likes
Sigh⦠I was using DataFrames v1.3.2
updating to v1.3.6
resolves the issue. Thanks @bkamins
1 Like
@bkamins looks like the issue persists using:
[69666777] Arrow v2.3.0
[a93c6f00] DataFrames v1.4.3
This works:
fn = download("https://www.dropbox.com/s/322g26p3apdeqpf/granules.remote?dl=0")
g = DataFrame(Arrow.Table(fn))
g0 = copy(g)
g0[1,:] = g[1,:]
But when I define g0 from file it does not:
g0 = DataFrame(Arrow.Table(fn))
g0[1,:] = g[1,:]
throws the following error:
ERROR: CanonicalIndexError: setindex! not defined for Arrow.List{String, Int32, Vector{UInt8}}
Stacktrace:
[1] error_if_canonical_setindex(#unused #::IndexLinear, A::Arrow.List{String, Int32, Vector{UInt8}}, #unused #::Int64)
@ Base ./abstractarray.jl:1352
[2] setindex!(A::Arrow.List{String, Int32, Vector{UInt8}}, v::String, I::Int64)
@ Base ./abstractarray.jl:1343
[3] insert_single_entry!(df::DataFrame, v::String, row_ind::Int64, col_ind::Symbol)
@ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/dataframe/dataframe.jl:659
[4] setindex!(df::DataFrame, v::String, row_ind::Int64, col_ind::Symbol)
@ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/dataframe/dataframe.jl:688
[5] setindex!(df::DataFrame, v::DataFrameRow{DataFrame, DataFrames.Index}, row_ind::Int64, col_inds::Colon)
@ DataFrames ~/.julia/packages/DataFrames/KKiZW/src/dataframerow/dataframerow.jl:255
[6] top-level scope
@ REPL[37]:1
nilshg
November 15, 2022, 8:32pm
5
But thatβs expected, but default the data isnβt copied and read only, so you canβt mutate it.
1 Like
Try
g = DataFrame(Arrow.Table(fn); copycols=true)
Then you shouldnβt need the copy
later.
1 Like
Ahh⦠ok. Thank you all for the guidance.
I wrongly assumed it was lazy loading and would load once operated on.
This error seems really cryptic for βDataFrame is read onlyββ¦ maybe a new error could be added
1 Like
By default, Arrow.jl uses memory-mapping when loading a table from a file, as you say, and copycols=true
tells it go eagerly load all the data into Vector
βs. So if you are memory constrained, you may not want to use copycols=true
.
The problem is Arrow.jl presents a read-only view of the data, so you canβt do in-place operations like g0[1,:] = g[1,:]
on it. You need your DataFrame to be backed by mutable columns to do that. But you can do things like load the table with copycols=false
to get the memory-map backed version, then select
and subset
(with view=true
) to select columns or rows without making a copy, reducing the data size. Then once you have only the data you need, you could make a copy to move that data into mutable vectors.
2 Likes
bkamins
November 15, 2022, 9:50pm
9
The problem is that DataFrames.jl is not aware of this error. It is Arrow.jl that throws the error.
@quinnj - maybe there could be defined setindex!
for immutable Arrow types that would produce a more informative error?
1 Like
quinnj
November 16, 2022, 1:03am
10
Yeah, thatβs probably a good idea.
2 Likes