Help with appending row in read in DataFrame (weird behavior)

If I directly define a dataframe as

GMM_data = DataFrame(ror = [], rev_C = [], rev_NG = [], uu_1_C = [], uu_2_C = [], uu_3_C = [], uu_4_C = [], uu_1_NG = [], uu_2_NG = [], uu_3_NG = [], uu_4_NG = [],
        ramp_C = [], ramp_NG = [], g_h_C = [], g_h_NG = [], g_h_NGC = [], prof = [], ror_dif  = [], rev_C_dif = [], rev_NG_dif = [],
        uu_1_C_dif = [], uu_2_C_dif = [], uu_3_C_dif = [], uu_4_C_dif = [], uu_1_NG_dif = [], uu_2_NG_dif = [], uu_3_NG_dif = [], uu_4_NG_dif = [],
            ramp_C_dif = [], ramp_NG_dif = [], g_h_C_dif = [], g_h_NG_dif = [], g_h_NGC_dif = [], prof_dif = [], gmm = []);

Then I append a vector of the same size using push!:

push!(GMM_data, Any[-464547.4242273845, 3.853048934599117e8, 6.594229905162156e8, 2.3667284934715833, 2.337350615008504, 2.2234660448464014, 0.767728159659241, 0.0, -0.0255604538388607, -0.021426232084953166, -0.006821965746661444, 0.013891513438316122, 0.0006265685948713331, 0.8547474974727632, 0.06211081603260709, 0.08826851897694173, 7.006229768868772e9, 464547.39567910414, -3.853049537341995e8, -6.594231521204182e8, -1.8634861495585215, -1.008591746762811, -1.1093025904745037, 0.0839979043036051, -0.46252218106964815, -0.15331777535252566, -0.04712077936233181, 0.053266307472369115, -0.00933061823377325, 0.0004751324744484023, -0.318557425856661, 0.29173434423850525, -0.007439262327100238, -7.006228930191487e9, 2.164807258930988e14])
1Γ—35 DataFrame
 Row β”‚ ror        rev_C      rev_NG     uu_1_C   uu_2_C   uu_3_C   uu_4_C    β‹―
     β”‚ Any        Any        Any        Any      Any      Any      Any       β‹―
─────┼────────────────────────────────────────────────────────────────────────
   1 β”‚ -464547.0  3.85305e8  6.59423e8  2.36673  2.33735  2.22347  0.767728  β‹―
                                                            28 columns omitted

So it works fine but if I define the empty dataframe with the columns as before and write it out:

CSV.write("/users/miguelborrero/Desktop/Energy_Transitions/Data/GMM_data.csv", GMM_data);

Then I read it in and perform the exact same push as before it gives a weird error:

GMM_data = CSV.read("/users/miguelborrero/Desktop/Energy_Transitions/Data/GMM_data.csv", DataFrame)
0Γ—35 DataFrame

push!(GMM_data, Any[-464547.4242273845, 3.853048934599117e8, 6.594229905162156e8, 2.3667284934715833, 2.337350615008504, 2.2234660448464014, 0.767728159659241, 0.0, -0.0255604538388607, -0.021426232084953166, -0.006821965746661444, 0.013891513438316122, 0.0006265685948713331, 0.8547474974727632, 0.06211081603260709, 0.08826851897694173, 7.006229768868772e9, 464547.39567910414, -3.853049537341995e8, -6.594231521204182e8, -1.8634861495585215, -1.008591746762811, -1.1093025904745037, 0.0839979043036051, -0.46252218106964815, -0.15331777535252566, -0.04712077936233181, 0.053266307472369115, -0.00933061823377325, 0.0004751324744484023, -0.318557425856661, 0.29173434423850525, -0.007439262327100238, -7.006228930191487e9, 2.164807258930988e14])
β”Œ Error: Error adding value to column :ror.
β”” @ DataFrames ~/.julia/packages/DataFrames/JZ7x5/src/dataframe/dataframe.jl:1719
ERROR: StackOverflowError:
Stacktrace:
 [1] append! at /Users/miguelborrero/.julia/packages/SentinelArrays/EQtMp/src/missingvector.jl:109 [inlined]
 [2] push!(::SentinelArrays.MissingVector, ::Float64) at ./array.jl:961
 ... (the last 2 lines are repeated 79982 more times)
 [159967] append! at /Users/miguelborrero/.julia/packages/SentinelArrays/EQtMp/src/missingvector.jl:109 [inlined]

How can the two processes not be the same and therefore how can the second process fail???

Thanks a lot in advance.

I am not able to reproduce your case exactly, but,waiting for someone to explain the exact reason for the error you get, you could try using a tuple (or a namedtuple) instead of the array{any} to add a line to the empty df.

Thanks for your reply rocco. To reproduce my case just define an empty dataframe with two columns (eg x = , y = ) save the dataframe to a csv file. Read that same csv into a dataframe and try to push a row of floats (eg: push!(df, [1.2, 2.2]) and you should get the same error. But as you suggested I will try with tuple. Thanks again.

When reading the file back use:

GMM_data = CSV.read(β€œ/users/miguelborrero/Desktop/Energy_Transitions/Data/GMM_data.csv”, DataFrame, types=Float64)

as I assume you want to store floats in the columns.

The issue is that when reading back empty data frame CSV.read cannot infer eltype of columns, and assumes they allow only missing values. Therefore you need to pass a hint what is the eltype you want to accept.

2 Likes

Thanks a lot!

Using a namedtuple is not enough (I can’t repeat the steps that led me to think this), but with kwarg cols you get past the type check block. :grinning:

julia> GMM_data = DataFrame(ror = 1, rev_C = 2, rev_NG = 3, uu_1_C = 4)
1Γ—4 DataFrame
 Row β”‚ ror    rev_C  rev_NG  uu_1_C 
     β”‚ Int64  Int64  Int64   Int64
─────┼──────────────────────────────
   1 β”‚     1      2       3       4

julia> gmm_empty=empty(GMM_data)
0Γ—4 DataFrame

julia> push!(gmm_empty, (ror=1.4,rev_C=1.2,rev_NG=1.24242,uu_1_C=1.242222))
β”Œ Error: Error adding value to column :ror.
β”” @ DataFrames C:\Users\sprmn\.julia\packages\DataFrames\hFLqf\src\dataframe\dataframe.jl:1328
ERROR: InexactError: Int64(1.4)
Stacktrace:
 [1] Int64
   @ .\float.jl:788 [inlined]
 [2] convert
   @ .\number.jl:7 [inlined]
 [3] push!(a::Vector{Int64}, item::Float64)
   @ Base .\array.jl:1057
 [4] push!(df::DataFrame, row::NamedTuple{(:ror, :rev_C, :rev_NG, :uu_1_C), NTuple{4, Float64}}; cols::Symbol, promote::Bool)
   @ DataFrames C:\Users\sprmn\.julia\packages\DataFrames\hFLqf\src\dataframe\dataframe.jl:1310
 [5] push!(df::DataFrame, row::NamedTuple{(:ror, :rev_C, :rev_NG, :uu_1_C), NTuple{4, Float64}})
   @ DataFrames C:\Users\sprmn\.julia\packages\DataFrames\hFLqf\src\dataframe\dataframe.jl:1195
 [6] top-level scope
   @ c:\Users\sprmn\.julia\v1.8\dataframes21.jl:111

julia> push!(gmm_empty, (ror=1.4,rev_C=1.2,rev_NG=1.24242,uu_1_C=1.242222), cols=:union)
1Γ—4 DataFrame
 Row β”‚ ror      rev_C    rev_NG   uu_1_C  
     β”‚ Float64  Float64  Float64  Float64
─────┼────────────────────────────────────
   1 β”‚     1.4      1.2  1.24242  1.24222

On the other hand, if you use the types that are β€œright” or that can be promoted β€œwell”, the new rowis accepted.

julia> df=empty(DataFrame(x=1., y=2))
0Γ—2 DataFrame

julia> push!(df, [1.1, 2.0])
1Γ—2 DataFrame
 Row β”‚ x        y     
     β”‚ Float64  Int64
─────┼────────────────
   1 β”‚     1.1      2

julia> df=empty(DataFrame(x=1., y=2))
0Γ—2 DataFrame

julia> push!(df, (1.1, 2.0))
1Γ—2 DataFrame
 Row β”‚ x        y     
     β”‚ Float64  Int64
─────┼────────────────
   1 β”‚     1.1      2

julia> df=empty(DataFrame(x=1., y=2))
0Γ—2 DataFrame

julia> push!(df, (x=1.1,y=2.1))
β”Œ Error: Error adding value to column :y.
β”” @ DataFrames C:\Users\sprmn\.julia\packages\DataFrames\hFLqf\src\dataframe\dataframe.jl:1328
ERROR: InexactError: Int64(2.1)
Stacktrace:
 [1] Int64
   @ .\float.jl:788 [inlined]
 [2] convert
   @ .\number.jl:7 [inlined]
 [3] push!(a::Vector{Int64}, item::Float64)
   @ Base .\array.jl:1057
 [4] push!(df::DataFrame, row::NamedTuple{(:x, :y), Tuple{Float64, Float64}}; cols::Symbol, promote::Bool)
   @ DataFrames C:\Users\sprmn\.julia\packages\DataFrames\hFLqf\src\dataframe\dataframe.jl:1310
 [5] push!(df::DataFrame, row::NamedTuple{(:x, :y), Tuple{Float64, Float64}})
   @ DataFrames C:\Users\sprmn\.julia\packages\DataFrames\hFLqf\src\dataframe\dataframe.jl:1195
 [6] top-level scope
   @ c:\Users\sprmn\.julia\v1.8\dataframes21.jl:89

julia> df=empty(DataFrame(x=1., y=2))
0Γ—2 DataFrame

julia> push!(df, (x=1.1,y=2.5), cols=:union)
1Γ—2 DataFrame
 Row β”‚ x        y       
     β”‚ Float64  Float64
─────┼──────────────────
   1 β”‚     1.1      2.5

As was commented in Allow `Any` type in column Β· Issue #1027 Β· JuliaData/CSV.jl Β· GitHub the alternative solution is to add promote=true kwarg:

push!(df, [ some data ...], promote=true)
3 Likes