Clarification on when order matters when reading multiple files with CSV.read?

phantom · April 15, 2024, 4:40pm

When using CSV.read to read multiple CSV files into a single DataFrame. When does the order the files are listed makes a difference. @nilshg points out that this is an issue with pooled arrays and can be remedied by setting pool = false. I thought that was only an issue when single rowed dataframes are combined with with multiple row DataFrames, or perhaps more generally that the files had to be listed in order of nrow(). However it appears to be an issue even when the DataFrames are of equal nrow()?

df1 = DataFrame(A = 1:5, B = ["M", "F", "F", "M", "F"])

df2 = DataFrame(A = 11:15, B = ["A", "B", "C", "D", "E"])

CSV.write("filepath.csv",df1)
CSV.write("filepath.csv", df2)

fls = glob("*.csv", "filepath")

Reading the files in one order procuces an error.


julia> CSV.read(fls, DataFrame; source = "nwcol" => first.(split.(basename.(fls),'-')))
ERROR: UndefVarError: `A` not defined
Stacktrace:
 [1] (::CSV.var"#3#4")(x::PooledArrays.PooledVector{String1, UInt32, Vector{UInt32}})
   @ CSV ./none:0
 [2] iterate
   @ ./generator.jl:47 [inlined]
 [3] collect(itr::Base.Generator{Vector{PooledArrays.PooledVector{String1, UInt32, Vector{UInt32}}}, CSV.var"#3#4"})
   @ Base ./array.jl:834
 [4] chaincolumns!(a::Any, b::Any)
   @ CSV ~/.julia/packages/CSV/tmZyn/src/utils.jl:240
 [5] CSV.File(sources::Vector{String}; source::Pair{String, Vector{SubString{String}}}, kw::@Kwargs{})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:930
 [6] File
   @ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
 [7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{source::Pair{String, Vector{SubString{…}}}})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
 [8] top-level scope
   @ REPL[259]:1
Some type information was truncated. Use `show(err)` to see complete types.

but again reversing the file order works even though both files are the same size

CSV.read(reverse(fls), DataFrame; source = "nwcol" => first.(split.(basename.(reverse(fls)),'-')))
10×3 DataFrame
 Row │ A      B        nwcol     
     │ Int64  String1  SubStrin… 
─────┼───────────────────────────
   1 │    11  A        df2
   2 │    12  B        df2
   3 │    13  C        df2
   4 │    14  D        df2
   5 │    15  E        df2
   6 │     1  M        df1
   7 │     2  F        df1
   8 │     3  F        df1
   9 │     4  M        df1
  10 │     5  F        df1

I realize I might be missing something simple here. But if pool = true how is one to determine the correct order of the files? Or is it better to just default to ‘poo = false’ For reference this was done using CSV v0.10.14

Dan · April 15, 2024, 5:20pm

It’s always good to set poo = false

When putting the A…E file first filepath2 read decides not to pool, and has no problem. When filepath1 is first it decides to pool. So yes, setting pool = false seems to be best:

julia> CSV.read(fls, DataFrame; pool=false)
10×2 DataFrame
 Row │ A      B       
     │ Int64  String1 
─────┼────────────────
   1 │     1  M
   2 │     2  F
   3 │     3  F
   4 │     4  M
   5 │     5  F
   6 │    11  A
   7 │    12  B
   8 │    13  C
   9 │    14  D
  10 │    15  E

Oddly enough, this also works:

julia> CSV.read(["filepath1.csv", "filepath2.csv"], DataFrame;
  pool=[false, true])
10×2 DataFrame
 Row │ A      B       
     │ Int64  String1 
─────┼────────────────
   1 │     1  M
:
  10 │    15  E

and pools correctly the whole column.

A possible hypothesis is that read default value for pool which is (0.2,500) is evaluated after the first file and the pooled type fixed, and the second file isn’t compatible with the pooled column from first file. The other order does not trigger the pooling on the first file (values too diverse i.e. > 0.2) and so goes through and the second file triggers the pooling according to its short column but the pooling is done with the whole column (short column does go below the 0.2 threshold).

nilshg · April 15, 2024, 8:54pm

It’s not about single row, it’s about encountering values in the second file which don’t fit in the pool based on the first file.

Dan · April 15, 2024, 10:45pm

But oddly enough, the decision to pool the column when pool = (0.2,500) is done for each file based on only that file’s values, yet the whole column collected up to decision to pool is processed (i.e. previous file’s values also pooled).
It does create a subtle dependence on the order.

phantom · April 15, 2024, 11:03pm

Thanks for the replies! This has been incredibly helpful. longemen3000 posted the following with respect to the single row case (which I now understand is not caused by the single row) on github. Does this explain the described behavior?

found the error.

CSV.jl/src/utils.jl

Lines 234 to 242 in acd36a6
 elseif c isa Vector && b isa Vector 
     # two vectors, but we know eltype doesn't match, so try to promote 
     A = Vector{promote_types(eltype(c), eltype(b))} 
 elseif c isa SentinelVector && b isa SentinelVector 
     A = vectype(promote_types(Base.nonmissingtype(eltype(c)), Base.nonmissingtype(eltype(b)))) 
 end 
 x = ChainedVector([_promote(A, x) for x in a.arrays]) 
 y = _promote(A, b) 
 return append!(x, y)

there is a missing case where c isa PooledVector

Topic		Replies	Views
Error when combining single row with multiple row CSV file into DataFrames Data dataframes , csv	6	178	March 15, 2024
Performance Report: Effect of Reading CSV file on Mergeing two DataFrames Performance question , dataframes , csv	18	488	November 17, 2023
Creating an identifier column when combing multiple DataFrames with CSV.read New to Julia question , csv	2	139	April 16, 2024
Curious CSV round-trip problem General Usage question , csv	11	264	September 2, 2024
Read file with CSV.read New to Julia	8	19790	September 9, 2019

Clarification on when order matters when reading multiple files with CSV.read?

Related topics