Creating an identifier column when combing multiple DataFrames with CSV.read

Suppose I have the following CSV files.

df1 = DataFrame(A = 1:5, B = ["M", "F", "F", "M", "F"])

df2 = DataFrame(A = 11:15, B = ["A", "B", "C", "D", "E"])

df3 = DataFrame(A = 16:21, B = ["A", "B", "C", "D", "E","T"])

CSV.write("folder/df-1.csv",df1)
CSV.write("folder/df-2.csv", df2)
CSV.write("folder/df-3.csv", df3)

fls = glob("*.csv", "folder")

I want to combine them into a new DataFrame while creating a new column that contains just the first part of the filename i.e. new column would just something like

 "df"
 "df"
 "df"
 "df"
 ...   

as @nilshg points out here I can create a new column with the filenames using the source kwarg

DF1 = CSV.read(fls, DataFrame; source = "nwcol" => split.(basename.(fls),'-'), pool = false)
16Γ—3 DataFrame
 Row β”‚ A      B        nwcol                             
     β”‚ Int64  String1  Array…                            
─────┼───────────────────────────────────────────────────
   1 β”‚     1  M        SubString{String}["df", "1", "20…
   2 β”‚     2  F        SubString{String}["df", "1", "20…
   3 β”‚     3  F        SubString{String}["df", "1", "20…
   4 β”‚     4  M        SubString{String}["df", "1", "20…
   5 β”‚     5  F        SubString{String}["df", "1", "20…
   6 β”‚    11  A        SubString{String}["df", "2", "20…
   7 β”‚    12  B        SubString{String}["df", "2", "20…
   8 β”‚    13  C        SubString{String}["df", "2", "20…
   9 β”‚    14  D        SubString{String}["df", "2", "20…
  10 β”‚    15  E        SubString{String}["df", "2", "20…
  11 β”‚    16  A        SubString{String}["df", "3", "20…
  12 β”‚    17  B        SubString{String}["df", "3", "20…
  13 β”‚    18  C        SubString{String}["df", "3", "20…
  14 β”‚    19  D        SubString{String}["df", "3", "20…
  15 β”‚    20  E        SubString{String}["df", "3", "20…
  16 β”‚    21  T        SubString{String}["df", "3", "20…

However if I try to create the column by taking the first element of each substring via first.(split.(basename.(fls),'-')) . The following error is returned even though pool = false.

DF2 = CSV.read(fls, DataFrame; source = "nwcol" => first.(split.(basename.(fls),'-')), pool = false)
ERROR: BoundsError: attempt to access 1-element Vector{SubString{String}} at index [3]
Stacktrace:
 [1] setindex!
   @ ./array.jl:1021 [inlined]
 [2] setindex!
   @ ./multidimensional.jl:698 [inlined]
 [3] _invert(d::Dict{SubString{String}, UInt32})
   @ PooledArrays ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:26
 [4] PooledArray
   @ ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:87 [inlined]
 [5] CSV.File(sources::Vector{String}; source::Pair{String, Vector{SubString{String}}}, kw::@Kwargs{pool::Bool})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:941
 [6] File
   @ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
 [7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{source::Pair{String, Vector{…}}, pool::Bool})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
 [8] top-level scope
   @ REPL[321]:1
Some type information was truncated. Use `show(err)` to see complete types.

I realize it is trivial to just set DF1.nwcol = first.(DF1.nwcol) But just trying to understand why it returns an error with source kwarg. Is first.() calling pooled arrays and is there a way to turn it off?

That’s a really interesting corner case, worth opening an issue. It has nothing to do with first though, it’s just an error that happens when you pass as source a vector with fewer unique elements than files:

julia> CSV.read(fls, DataFrame; source = "nwcol" => [1, 1, 1])
ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [3]

julia> CSV.read(fls, DataFrame; source = "nwcol" => [1, 1, 2])
ERROR: BoundsError: attempt to access 2-element Vector{Int64} at index [3]

I think that’s a good thing, because what’s the point of the source argument if it doesn’t actually distinguish between the sources, but it’s worth discussing whether (1) this should be support (imho no) and (2) if it isn’t whether it should just raise a more informative error, similar to what you’re getting now if you’re passing an inconsistent source lenght:

julia> CSV.read(fls, DataFrame; source = "nwcol" => [1, 1])
ERROR: ArgumentError: source pair keyword argument list (2) must match length of input vector (3)
3 Likes

Thanks so much for the clarification! I’ll open an issue on github.

I defer to your judgement on whether it should be supported. I’ll just note that when combining a Dataframe of n sources where data for each source has been collected across kn or n^k .csv files - (e.g. cardiac data on patients/populations collected over time, etc.) it seems quite natural to create a new column labeling each source directly with the source kwarg.

In cases where data is collected periodically in separate files and has accumulated meaningfully over time there will naturally be fewer unique elements than files - so from a novice perspective if seems like it would be a nice option if the tradeoffs aren’t too steep.