Hi,
Suppose I have the following DataFrame with non specified Column type.
Q = DataFrame(datmonth = [], datstat =[], qrytime = [])
0×3 DataFrame
Row │ datmonth datstat qrytime
│ Any Any Any
─────┴────────────────────────────
julia> push!(Q,(datmonth = "2022-03", datstat = true, qrytime = now()))
1×3 DataFrame
Row │ datmonth datstat qrytime
│ Any Any Any
─────┼────────────────────────────────────────────
1 │ 2022-03 true 2024-06-08T21:32:32.774
Once I append this to an .arrow
file the column type changes
Arrow.append("filepath.arrow", Q)
DataFrame(Arrow.Table("filepath.arrow"))
1×3 DataFrame
Row │ datmonth datstat qrytime
│ String Bool DateTime
─────┼────────────────────────────────────────────
1 │ 2022-03 true 2024-06-08T21:32:32.774
If I have a function that is generating new DataFrames with different or even the same Data Type as that of Q, they cannot be appended to the .arrow
file.
# appending the same DataFrame a second time returns an error
Arrow.append("filepath.arrow", Q)
ERROR: ArgumentError: Table schema does not match existing arrow file schema
Stacktrace:
[1] macro expansion
@ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:190 [inlined]
[2] macro expansion
@ ./task.jl:479 [inlined]
[3] append(io::IOStream, source::DataFrame, arrow_schema::Tables.Schema{…}, compress::Nothing, largelists::Bool, denseunions::Bool, dictencode::Bool, dictencodenested::Bool, alignment::Int64, maxdepth::Int64, ntasks::Float64, meta::Nothing, colmeta::Nothing)
@ Arrow ~/.julia/packages/Arrow/5pHqZ/src/append.jl:179
[4]
@ Arrow ~/.julia/packages/Arrow/5pHqZ/src/append.jl:125
[5] append
@ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:70 [inlined]
[6] #149
@ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:64 [inlined]
[7] open(::Arrow.var"#149#150"{@Kwargs{}, DataFrame}, ::String, ::Vararg{String}; kwargs::@Kwargs{})
@ Base ./io.jl:396
[8] open
@ ./io.jl:393 [inlined]
[9] #append#148
@ ~/.julia/packages/Arrow/5pHqZ/src/append.jl:63 [inlined]
[10] append(file::String, tbl::DataFrame)
@ Arrow ~/.julia/packages/Arrow/5pHqZ/src/append.jl:62
[11] top-level scope
@ REPL[27]:1
Some type information was truncated. Use `show(err)` to see complete types.
I understand the workaround would be to specify column type in the initialization of the dataframe
Q = DataFrame(datmonth = String[], datstat =Bool[], qrytime = DateTime[])
However this does not work if different data types are generated and need to be appended to the arrow file
. i.e. Even if the DataFrame
is initialized as
Q = DataFrame(datmonth = Any[], datstat =Any[], qrytime = Any[])
Creating a new arrow file via Arrow.append
automatically converts the Data Columns to the type of the data in the initial DataFrame.
Just wondering if I was going about this the wrong way or if there was a kwarg
to turn off this behavior. Thanks.