When using CSV.read
to read multiple CSV files into a single DataFrame. When does the order the files are listed makes a difference. @nilshg points out that this is an issue with pooled arrays and can be remedied by setting pool = false
. I thought that was only an issue when single rowed dataframes are combined with with multiple row DataFrames, or perhaps more generally that the files had to be listed in order of nrow().
However it appears to be an issue even when the DataFrames are of equal nrow()
?
df1 = DataFrame(A = 1:5, B = ["M", "F", "F", "M", "F"])
df2 = DataFrame(A = 11:15, B = ["A", "B", "C", "D", "E"])
CSV.write("filepath.csv",df1)
CSV.write("filepath.csv", df2)
fls = glob("*.csv", "filepath")
Reading the files in one order procuces an error.
julia> CSV.read(fls, DataFrame; source = "nwcol" => first.(split.(basename.(fls),'-')))
ERROR: UndefVarError: `A` not defined
Stacktrace:
[1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String1, UInt32, Vector{UInt32}})
@ CSV ./none:0
[2] iterate
@ ./generator.jl:47 [inlined]
[3] collect(itr::Base.Generator{Vector{PooledArrays.PooledVector{String1, UInt32, Vector{UInt32}}}, CSV.var"#3#4"})
@ Base ./array.jl:834
[4] chaincolumns!(a::Any, b::Any)
@ CSV ~/.julia/packages/CSV/tmZyn/src/utils.jl:240
[5] CSV.File(sources::Vector{String}; source::Pair{String, Vector{SubString{String}}}, kw::@Kwargs{})
@ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:930
[6] File
@ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
[7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{source::Pair{String, Vector{SubString{β¦}}}})
@ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
[8] top-level scope
@ REPL[259]:1
Some type information was truncated. Use `show(err)` to see complete types.
but again reversing the file order works even though both files are the same size
CSV.read(reverse(fls), DataFrame; source = "nwcol" => first.(split.(basename.(reverse(fls)),'-')))
10Γ3 DataFrame
Row β A B nwcol
β Int64 String1 SubStrinβ¦
ββββββΌβββββββββββββββββββββββββββ
1 β 11 A df2
2 β 12 B df2
3 β 13 C df2
4 β 14 D df2
5 β 15 E df2
6 β 1 M df1
7 β 2 F df1
8 β 3 F df1
9 β 4 M df1
10 β 5 F df1
I realize I might be missing something simple here. But if pool = true
how is one to determine the correct order of the files? Or is it better to just default to βpoo = falseβ For reference this was done using CSV v0.10.14