When converting multiple .csv files to a DataFrame, what is the most effective way to identify the data entries in the DataFrame with the .csv filename?

Sorry I was on the phone so didn’t expand on my answer.

Here’s a full example:

julia> using CSV, DataFrames

julia> CSV.write("df1.csv", DataFrame(rand(2, 2), :auto));

julia> CSV.write("df2.csv", DataFrame(5 .* rand(2, 2), :auto));

julia> csv_files = filter(endswith(".csv"), readdir(; join = true))
2-element Vector{String}:
 "C:\\Users\\ngudat\\Documents\\df1.csv"
 "C:\\Users\\ngudat\\Documents\\df2.csv"

julia> CSV.read(csv_files, DataFrame; source = "source")
4×3 DataFrame
 Row │ x1         x2         source
     │ Float64    Float64    String
─────┼─────────────────────────────────────────────────────────
   1 │ 0.0292249  0.281584   C:\\Users\\ngudat\\Documents\\df…
   2 │ 0.220783   0.717647   C:\\Users\\ngudat\\Documents\\df…
   3 │ 3.48909    0.0183668  C:\\Users\\ngudat\\Documents\\df…
   4 │ 4.00998    3.90755    C:\\Users\\ngudat\\Documents\\df…

julia> CSV.read(csv_files, DataFrame; source = "source" => basename.(csv_files))
4×3 DataFrame
 Row │ x1         x2         source
     │ Float64    Float64    String
─────┼───────────────────────────────
   1 │ 0.0292249  0.281584   df1.csv
   2 │ 0.220783   0.717647   df1.csv
   3 │ 3.48909    0.0183668  df2.csv
   4 │ 4.00998    3.90755    df2.csv

I prefer to pass a pair as kwarg as I generally find the full paths less useful (although if one needs to re-parse part of the data it can of course sometimes be useful to have them in the resulting DataFrame).

5 Likes