Clarification on when order matters when reading multiple files with CSV.read?

When using CSV.read to read multiple CSV files into a single DataFrame. When does the order the files are listed makes a difference. @nilshg points out that this is an issue with pooled arrays and can be remedied by setting pool = false. I thought that was only an issue when single rowed dataframes are combined with with multiple row DataFrames, or perhaps more generally that the files had to be listed in order of nrow(). However it appears to be an issue even when the DataFrames are of equal nrow()?

df1 = DataFrame(A = 1:5, B = ["M", "F", "F", "M", "F"])

df2 = DataFrame(A = 11:15, B = ["A", "B", "C", "D", "E"])

CSV.write("filepath.csv",df1)
CSV.write("filepath.csv", df2)

fls = glob("*.csv", "filepath")

Reading the files in one order procuces an error.


julia> CSV.read(fls, DataFrame; source = "nwcol" => first.(split.(basename.(fls),'-')))
ERROR: UndefVarError: `A` not defined
Stacktrace:
 [1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String1, UInt32, Vector{UInt32}})
   @ CSV ./none:0
 [2] iterate
   @ ./generator.jl:47 [inlined]
 [3] collect(itr::Base.Generator{Vector{PooledArrays.PooledVector{String1, UInt32, Vector{UInt32}}}, CSV.var"#3#4"})
   @ Base ./array.jl:834
 [4] chaincolumns!(a::Any, b::Any)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/utils.jl:240
 [5] CSV.File(sources::Vector{String}; source::Pair{String, Vector{SubString{String}}}, kw::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:930
 [6] File
   @ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
 [7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{source::Pair{String, Vector{SubString{…}}}})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
 [8] top-level scope
   @ REPL[259]:1
Some type information was truncated. Use `show(err)` to see complete types.

but again reversing the file order works even though both files are the same size

CSV.read(reverse(fls), DataFrame; source = "nwcol" => first.(split.(basename.(reverse(fls)),'-')))
10Γ—3 DataFrame
 Row β”‚ A      B        nwcol     
     β”‚ Int64  String1  SubStrin… 
─────┼───────────────────────────
   1 β”‚    11  A        df2
   2 β”‚    12  B        df2
   3 β”‚    13  C        df2
   4 β”‚    14  D        df2
   5 β”‚    15  E        df2
   6 β”‚     1  M        df1
   7 β”‚     2  F        df1
   8 β”‚     3  F        df1
   9 β”‚     4  M        df1
  10 β”‚     5  F        df1

I realize I might be missing something simple here. But if pool = true how is one to determine the correct order of the files? Or is it better to just default to β€˜poo = false’ For reference this was done using CSV v0.10.14

It’s always good to set poo = false :rofl:

When putting the A…E file first filepath2 read decides not to pool, and has no problem. When filepath1 is first it decides to pool. So yes, setting pool = false seems to be best:

julia> CSV.read(fls, DataFrame; pool=false)
10Γ—2 DataFrame
 Row β”‚ A      B       
     β”‚ Int64  String1 
─────┼────────────────
   1 β”‚     1  M
   2 β”‚     2  F
   3 β”‚     3  F
   4 β”‚     4  M
   5 β”‚     5  F
   6 β”‚    11  A
   7 β”‚    12  B
   8 β”‚    13  C
   9 β”‚    14  D
  10 β”‚    15  E

Oddly enough, this also works:

julia> CSV.read(["filepath1.csv", "filepath2.csv"], DataFrame;
  pool=[false, true])
10Γ—2 DataFrame
 Row β”‚ A      B       
     β”‚ Int64  String1 
─────┼────────────────
   1 β”‚     1  M
:
  10 β”‚    15  E

and pools correctly the whole column.

A possible hypothesis is that read default value for pool which is (0.2,500) is evaluated after the first file and the pooled type fixed, and the second file isn’t compatible with the pooled column from first file. The other order does not trigger the pooling on the first file (values too diverse i.e. > 0.2) and so goes through and the second file triggers the pooling according to its short column but the pooling is done with the whole column (short column does go below the 0.2 threshold).

1 Like

It’s not about single row, it’s about encountering values in the second file which don’t fit in the pool based on the first file.

1 Like

But oddly enough, the decision to pool the column when pool = (0.2,500) is done for each file based on only that file’s values, yet the whole column collected up to decision to pool is processed (i.e. previous file’s values also pooled).
It does create a subtle dependence on the order.

1 Like

Thanks for the replies! This has been incredibly helpful. longemen3000 posted the following with respect to the single row case (which I now understand is not caused by the single row) on github. Does this explain the described behavior?

found the error.

CSV.jl/src/utils.jl

Lines 234 to 242 in acd36a6
 elseif c isa Vector && b isa Vector 
     # two vectors, but we know eltype doesn't match, so try to promote 
     A = Vector{promote_types(eltype(c), eltype(b))} 
 elseif c isa SentinelVector && b isa SentinelVector 
     A = vectype(promote_types(Base.nonmissingtype(eltype(c)), Base.nonmissingtype(eltype(b)))) 
 end 
 x = ChainedVector([_promote(A, x) for x in a.arrays]) 
 y = _promote(A, b) 
 return append!(x, y)

there is a missing case where c isa PooledVector