Suppose I have the following CSV files.
df1 = DataFrame(A = 1:5, B = ["M", "F", "F", "M", "F"])
df2 = DataFrame(A = 11:15, B = ["A", "B", "C", "D", "E"])
df3 = DataFrame(A = 16:21, B = ["A", "B", "C", "D", "E","T"])
CSV.write("folder/df-1.csv",df1)
CSV.write("folder/df-2.csv", df2)
CSV.write("folder/df-3.csv", df3)
fls = glob("*.csv", "folder")
I want to combine them into a new DataFrame while creating a new column that contains just the first part of the filename i.e. new column would just something like
"df"
"df"
"df"
"df"
...
as @nilshg points out here I can create a new column with the filenames using the source
kwarg
DF1 = CSV.read(fls, DataFrame; source = "nwcol" => split.(basename.(fls),'-'), pool = false)
16Γ3 DataFrame
Row β A B nwcol
β Int64 String1 Arrayβ¦
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β 1 M SubString{String}["df", "1", "20β¦
2 β 2 F SubString{String}["df", "1", "20β¦
3 β 3 F SubString{String}["df", "1", "20β¦
4 β 4 M SubString{String}["df", "1", "20β¦
5 β 5 F SubString{String}["df", "1", "20β¦
6 β 11 A SubString{String}["df", "2", "20β¦
7 β 12 B SubString{String}["df", "2", "20β¦
8 β 13 C SubString{String}["df", "2", "20β¦
9 β 14 D SubString{String}["df", "2", "20β¦
10 β 15 E SubString{String}["df", "2", "20β¦
11 β 16 A SubString{String}["df", "3", "20β¦
12 β 17 B SubString{String}["df", "3", "20β¦
13 β 18 C SubString{String}["df", "3", "20β¦
14 β 19 D SubString{String}["df", "3", "20β¦
15 β 20 E SubString{String}["df", "3", "20β¦
16 β 21 T SubString{String}["df", "3", "20β¦
However if I try to create the column by taking the first element of each substring via first.(split.(basename.(fls),'-'))
. The following error is returned even though pool = false
.
DF2 = CSV.read(fls, DataFrame; source = "nwcol" => first.(split.(basename.(fls),'-')), pool = false)
ERROR: BoundsError: attempt to access 1-element Vector{SubString{String}} at index [3]
Stacktrace:
[1] setindex!
@ ./array.jl:1021 [inlined]
[2] setindex!
@ ./multidimensional.jl:698 [inlined]
[3] _invert(d::Dict{SubString{String}, UInt32})
@ PooledArrays ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:26
[4] PooledArray
@ ~/.julia/packages/PooledArrays/Vy2X0/src/PooledArrays.jl:87 [inlined]
[5] CSV.File(sources::Vector{String}; source::Pair{String, Vector{SubString{String}}}, kw::@Kwargs{pool::Bool})
@ CSV ~/.julia/packages/CSV/tmZyn/src/file.jl:941
[6] File
@ ~/.julia/packages/CSV/tmZyn/src/file.jl:901 [inlined]
[7] read(source::Vector{String}, sink::Type; copycols::Bool, kwargs::@Kwargs{source::Pair{String, Vector{β¦}}, pool::Bool})
@ CSV ~/.julia/packages/CSV/tmZyn/src/CSV.jl:117
[8] top-level scope
@ REPL[321]:1
Some type information was truncated. Use `show(err)` to see complete types.
I realize it is trivial to just set DF1.nwcol = first.(DF1.nwcol)
But just trying to understand why it returns an error with source
kwarg. Is first.()
calling pooled arrays and is there a way to turn it off?