Creating an identifier column when combing multiple DataFrames with CSV.read

phantom · April 16, 2024, 1:42pm

Suppose I have the following CSV files.

df1 = DataFrame(A = 1:5, B = ["M", "F", "F", "M", "F"])

df2 = DataFrame(A = 11:15, B = ["A", "B", "C", "D", "E"])

df3 = DataFrame(A = 16:21, B = ["A", "B", "C", "D", "E","T"])

CSV.write("folder/df-1.csv",df1)
CSV.write("folder/df-2.csv", df2)
CSV.write("folder/df-3.csv", df3)

fls = glob("*.csv", "folder")

I want to combine them into a new DataFrame while creating a new column that contains just the first part of the filename i.e. new column would just something like

 "df"
 "df"
 "df"
 "df"
 ...

as @nilshg points out here I can create a new column with the filenames using the source kwarg

DF1 = CSV.read(fls, DataFrame; source = "nwcol" => split.(basename.(fls),'-'), pool = false)
16×3 DataFrame
 Row │ A      B        nwcol                             
     │ Int64  String1  Array…                            
─────┼───────────────────────────────────────────────────
   1 │     1  M        SubString{String}["df", "1", "20…
   2 │     2  F        SubString{String}["df", "1", "20…
   3 │     3  F        SubString{String}["df", "1", "20…
   4 │     4  M        SubString{String}["df", "1", "20…
   5 │     5  F        SubString{String}["df", "1", "20…
   6 │    11  A        SubString{String}["df", "2", "20…
   7 │    12  B        SubString{String}["df", "2", "20…
   8 │    13  C        SubString{String}["df", "2", "20…
   9 │    14  D        SubString{String}["df", "2", "20…
  10 │    15  E        SubString{String}["df", "2", "20…
  11 │    16  A        SubString{String}["df", "3", "20…
  12 │    17  B        SubString{String}["df", "3", "20…
  13 │    18  C        SubString{String}["df", "3", "20…
  14 │    19  D        SubString{String}["df", "3", "20…
  15 │    20  E        SubString{String}["df", "3", "20…
  16 │    21  T        SubString{String}["df", "3", "20…

However if I try to create the column by taking the first element of each substring via first.(split.(basename.(fls),'-')) . The following error is returned even though pool = false.

DF2 = CSV.read(fls, DataFrame; source = "nwcol" => first.(split.(basename.(fls),'-')), pool = false)
ERROR: BoundsError: attempt to access 1-element Vector{SubString{String}} at index [3]
Stacktrace:
 [1] setindex!
   @ ./array.jl:1021 [inlined]
 [2] setindex!
   @ ./multidimensional.jl:698 [inlined]
 [3] _invert(d::Dict{SubString{String}, UInt32})
   @ PooledArrays ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:26
 [4] PooledArray
   @ ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:87 [inlined]
 [5] CSV.File(sources::Vector{String}; source::Pair{String, Vector{SubString{String}}}, kw::@Kwargs{pool::Bool})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:941
 [6] File
   @ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
 [7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{source::Pair{String, Vector{…}}, pool::Bool})
   @ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
 [8] top-level scope
   @ REPL[321]:1
Some type information was truncated. Use `show(err)` to see complete types.

I realize it is trivial to just set DF1.nwcol = first.(DF1.nwcol) But just trying to understand why it returns an error with source kwarg. Is first.() calling pooled arrays and is there a way to turn it off?

nilshg · April 16, 2024, 4:22pm

That’s a really interesting corner case, worth opening an issue. It has nothing to do with first though, it’s just an error that happens when you pass as source a vector with fewer unique elements than files:

julia> CSV.read(fls, DataFrame; source = "nwcol" => [1, 1, 1])
ERROR: BoundsError: attempt to access 1-element Vector{Int64} at index [3]

julia> CSV.read(fls, DataFrame; source = "nwcol" => [1, 1, 2])
ERROR: BoundsError: attempt to access 2-element Vector{Int64} at index [3]

I think that’s a good thing, because what’s the point of the source argument if it doesn’t actually distinguish between the sources, but it’s worth discussing whether (1) this should be support (imho no) and (2) if it isn’t whether it should just raise a more informative error, similar to what you’re getting now if you’re passing an inconsistent source lenght:

julia> CSV.read(fls, DataFrame; source = "nwcol" => [1, 1])
ERROR: ArgumentError: source pair keyword argument list (2) must match length of input vector (3)

phantom · April 16, 2024, 5:59pm

Thanks so much for the clarification! I’ll open an issue on github.

I defer to your judgement on whether it should be supported. I’ll just note that when combining a Dataframe of n sources where data for each source has been collected across kn or n^k .csv files - (e.g. cardiac data on patients/populations collected over time, etc.) it seems quite natural to create a new column labeling each source directly with the source kwarg.

In cases where data is collected periodically in separate files and has accumulated meaningfully over time there will naturally be fewer unique elements than files - so from a novice perspective if seems like it would be a nice option if the tradeoffs aren’t too steep.

Topic		Replies	Views
Creating a grouped dataframe out of multiple csv files New to Julia dataframes , csv , grouping	5	1088	January 4, 2022
Reading multiple CSVs into one DataFrame New to Julia	12	1192	February 11, 2021
Read CSVs. Build DataFrame with column identifying the original CSV New to Julia question	9	1080	June 2, 2020
Extra several columns from a dataframe to create a new dataframe in julia General Usage	5	571	April 12, 2019
Combine(Merge) Columns New to Julia dataframes	2	2553	February 28, 2021

Creating an identifier column when combing multiple DataFrames with CSV.read

Related topics