Error when combining single row with multiple row CSV file into DataFrames

Hi! I am combining several .csv files into DataFrames. Each file has uniform column numbers and data types. Everything works but when I try to combine a .csv file that contains a single row with one that contains multiple rows I get the following error.

CSV.read(["singlerowfilepath","multiplerowfilepath"], DataFrame)  

returns

ERROR: UndefVarError: `A` not defined
Stacktrace:
 [1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String31, UInt32, Vector{UInt32}})
   @ CSV ./none:0
 [2] iterate
   @ ./generator.jl:47 [inlined]
 [3] collect(itr::Base.Generator{Vector{PooledArrays.PooledVector{String31, UInt32, Vector{UInt32}}}, CSV.var"#3#4"})
   @ Base ./array.jl:834
 [4] chaincolumns!(a::Any, b::Any)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/utils.jl:240
 [5] CSV.File(sources::Vector{String}; source::Nothing, kw::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:930
 [6] File
   @ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
 [7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
 [8] read(source::Vector{String}, sink::Type)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:113
 [9] top-level scope
   @ REPL[134]:1  

and

DataFrame!(CSV.File(["singlerowfilepath","multiplerowfilepath"]))

or 

DataFrame(CSV.File(["singlerowfilepath","multiplerowfilepath"]))

each return

ERROR: UndefVarError: `A` not defined
Stacktrace:
 [1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String31, UInt32, Vector{UInt32}})
   @ CSV ./none:0
 [2] iterate
   @ ./generator.jl:47 [inlined]
 [3] collect(itr::Base.Generator{Vector{PooledArrays.PooledVector{String31, UInt32, Vector{UInt32}}}, CSV.var"#3#4"})
   @ Base ./array.jl:834
 [4] chaincolumns!(a::Any, b::Any)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/utils.jl:240
 [5] CSV.File(sources::Vector{String}; source::Nothing, kw::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:930
 [6] CSV.File(sources::Vector{String})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:901
 [7] top-level scope
   @ REPL[138]:1 

Whereas reversing the order of the files i.e.

CSV.read(["multiplerowfilepath","singlerowfilepath"], DataFrame)

returns a Dataframe with uniform column number and data type.

Just wondering if anyone had any insight on why this is happening and what I might do to fix it without having to worry about the order of the files. Thanks!

This is a pretty annoying error and I think the error message should be improved (or maybe the behaviour).

You can roughly guess from here:

[1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String31, UInt32, Vector{UInt32}})
   @ CSV ./none:0

That it comes from PooledArrays. What’s happening here is that CSV is trying to stitch together one table from the two CSVs, but if it decides to pool the values in the first one this fails if there are additional values in the second table. In your case where you have only one row there’s only one value in it and the pool can’t capture the other values, while if you go the other way around the pool will have seen the whole range of values and can accomodate the additional value.

You can turn pooling off with the pool = false kwarg.

2 Likes

got it thanks! Is there a downside to defaulting to pool = false other than memory usage?

Hopefully not! PooledVector{String} should be the exact same as Vector{String} for all intents and purposes.

1 Like

got it, thanks!

Could you file a bug against CSV.jl on GitHub? If you can provide toy files to reproduce the problem that would be even better.

1 Like

Sure thing. I submitted an issue with a dummy example here.

1 Like