Why does Appending a Dataframe to an Arrow file Change the column type?

phantom · June 9, 2024, 5:16am

Hi,

Suppose I have the following DataFrame with non specified Column type.

Q = DataFrame(datmonth = [], datstat =[], qrytime = [])
0×3 DataFrame
 Row │ datmonth  datstat  qrytime 
     │ Any       Any      Any     
─────┴────────────────────────────

julia> push!(Q,(datmonth = "2022-03", datstat = true, qrytime = now()))
1×3 DataFrame
 Row │ datmonth  datstat  qrytime                 
     │ Any       Any      Any                     
─────┼────────────────────────────────────────────
   1 │ 2022-03   true     2024-06-08T21:32:32.774

Once I append this to an .arrow file the column type changes

Arrow.append("filepath.arrow", Q)
DataFrame(Arrow.Table("filepath.arrow"))
1×3 DataFrame
 Row │ datmonth  datstat  qrytime                 
     │ String    Bool     DateTime                
─────┼────────────────────────────────────────────
   1 │ 2022-03      true  2024-06-08T21:32:32.774

If I have a function that is generating new DataFrames with different or even the same Data Type as that of Q, they cannot be appended to the .arrow file.

# appending the same DataFrame a second time returns an error
Arrow.append("filepath.arrow", Q)
ERROR: ArgumentError: Table schema does not match existing arrow file schema
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:190 [inlined]
  [2] macro expansion
    @ ./task.jl:479 [inlined]
  [3] append(io::IOStream, source::DataFrame, arrow_schema::Tables.Schema{…}, compress::Nothing, largelists::Bool, denseunions::Bool, dictencode::Bool, dictencodenested::Bool, alignment::Int64, maxdepth::Int64, ntasks::Float64, meta::Nothing, colmeta::Nothing)
    @ Arrow ~/.julia/packages/Arrow/5pHqZ/src/append.jl:179
  [4] 
    @ Arrow ~/.julia/packages/Arrow/5pHqZ/src/append.jl:125
  [5] append
    @ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:70 [inlined]
  [6] #149
    @ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:64 [inlined]
  [7] open(::Arrow.var"#149#150"{@Kwargs{}, DataFrame}, ::String, ::Vararg{String}; kwargs::@Kwargs{})
    @ Base ./io.jl:396
  [8] open
    @ ./io.jl:393 [inlined]
  [9] #append#148
    @ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:63 [inlined]
 [10] append(file::String, tbl::DataFrame)
    @ Arrow ~/.julia/packages/Arrow/5pHqZ/src/append.jl:62
 [11] top-level scope
    @ REPL[27]:1
Some type information was truncated. Use `show(err)` to see complete types.

I understand the workaround would be to specify column type in the initialization of the dataframe

Q = DataFrame(datmonth = String[], datstat =Bool[], qrytime = DateTime[])

However this does not work if different data types are generated and need to be appended to the arrow file. i.e. Even if the DataFrame is initialized as

Q = DataFrame(datmonth = Any[], datstat =Any[], qrytime = Any[])

Creating a new arrow file via Arrow.append automatically converts the Data Columns to the type of the data in the initial DataFrame.

Just wondering if I was going about this the wrong way or if there was a kwarg to turn off this behavior. Thanks.

nilshg · June 9, 2024, 7:37am

I think this is all expected, Arrow only supports a limited set of types so you can’t write Any (I’m not sure how this would even work theoretically as the point of arrow is to have your data stored as binary in the correct memory layout to mmap it on reading it in, which conceptually seems to rule out storing Any types of unknown layout).

There’s a way to define custom types described here if you need to extend what’s available.

Topic		Replies	Views
Arrow changes a DataFrame column from type `Float32` to `Float32?` without missing values? New to Julia question , dataframes , arrow	5	292	May 22, 2023
Problem with numerical data in Arrow.jl New to Julia arrow	1	62	July 19, 2024
Help with appending row in read in DataFrame (weird behavior) General Usage dataframes , csv	6	542	September 24, 2022
Using Arrow.DictEncode() Data dataframes , arrow	1	673	June 30, 2021
Issues reading CSV file with array elements General Usage dataframes , csv	4	1775	September 6, 2021

Why does Appending a Dataframe to an Arrow file Change the column type?

Related topics