Why does Appending a Dataframe to an Arrow file Change the column type?

Hi,

Suppose I have the following DataFrame with non specified Column type.

Q = DataFrame(datmonth = [], datstat =[], qrytime = [])
0×3 DataFrame
 Row │ datmonth  datstat  qrytime 
     │ Any       Any      Any     
─────┴────────────────────────────

julia> push!(Q,(datmonth = "2022-03", datstat = true, qrytime = now()))
1×3 DataFrame
 Row │ datmonth  datstat  qrytime                 
     │ Any       Any      Any                     
─────┼────────────────────────────────────────────
   1 │ 2022-03   true     2024-06-08T21:32:32.774

Once I append this to an .arrow file the column type changes

Arrow.append("filepath.arrow", Q)
DataFrame(Arrow.Table("filepath.arrow"))
1×3 DataFrame
 Row │ datmonth  datstat  qrytime                 
     │ String    Bool     DateTime                
─────┼────────────────────────────────────────────
   1 │ 2022-03      true  2024-06-08T21:32:32.774

If I have a function that is generating new DataFrames with different or even the same Data Type as that of Q, they cannot be appended to the .arrow file.

# appending the same DataFrame a second time returns an error
Arrow.append("filepath.arrow", Q)
ERROR: ArgumentError: Table schema does not match existing arrow file schema
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:190 [inlined]
  [2] macro expansion
    @ ./task.jl:479 [inlined]
  [3] append(io::IOStream, source::DataFrame, arrow_schema::Tables.Schema{…}, compress::Nothing, largelists::Bool, denseunions::Bool, dictencode::Bool, dictencodenested::Bool, alignment::Int64, maxdepth::Int64, ntasks::Float64, meta::Nothing, colmeta::Nothing)
    @ Arrow ~/.julia/packages/Arrow/5pHqZ/src/append.jl:179
  [4] 
    @ Arrow ~/.julia/packages/Arrow/5pHqZ/src/append.jl:125
  [5] append
    @ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:70 [inlined]
  [6] #149
    @ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:64 [inlined]
  [7] open(::Arrow.var"#149#150"{@Kwargs{}, DataFrame}, ::String, ::Vararg{String}; kwargs::@Kwargs{})
    @ Base ./io.jl:396
  [8] open
    @ ./io.jl:393 [inlined]
  [9] #append#148
    @ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:63 [inlined]
 [10] append(file::String, tbl::DataFrame)
    @ Arrow ~/.julia/packages/Arrow/5pHqZ/src/append.jl:62
 [11] top-level scope
    @ REPL[27]:1
Some type information was truncated. Use `show(err)` to see complete types.

I understand the workaround would be to specify column type in the initialization of the dataframe

Q = DataFrame(datmonth = String[], datstat =Bool[], qrytime = DateTime[])

However this does not work if different data types are generated and need to be appended to the arrow file. i.e. Even if the DataFrame is initialized as

Q = DataFrame(datmonth = Any[], datstat =Any[], qrytime = Any[])

Creating a new arrow file via Arrow.append automatically converts the Data Columns to the type of the data in the initial DataFrame.

Just wondering if I was going about this the wrong way or if there was a kwarg to turn off this behavior. Thanks.

1 Like

I think this is all expected, Arrow only supports a limited set of types so you can’t write Any (I’m not sure how this would even work theoretically as the point of arrow is to have your data stored as binary in the correct memory layout to mmap it on reading it in, which conceptually seems to rule out storing Any types of unknown layout).

There’s a way to define custom types described here if you need to extend what’s available.