When converting multiple .csv files to a DataFrame, what is the most effective way to identify the data entries in the DataFrame with the .csv filename?

Suppose there is a Folder with the following files all with a different number of data entries:

File1 = Data1.csv
File2 = Data2.csv
:
:
Filen = Datan.csv

where n >> 100. I am converting these files to a single DataFrame, however I need to identify which file each entry originally came from. I was just wondering whether there was a direct way to do this in DataFrames?

Currently I am using the following approach which is not great and may be prone to mislabelling error.

First I created a vector of DataFrames with all the Data.

Files = glob("*.csv", Folder)
CombinedData = DataFrame.(CSV.File.(Files))

Then I gathered the names of all the files in a vector and pushed them with the data onto a new DataFrame.

Names = readdir("Users/Folder")
LabledData = DataFrame( Name = Srtring[], Data = Float64)
for k = 1:size(CombinedData)[1]
     for i = 1:size(CombinedData[k])[1]
          push!(LabledData, [ Names[ k ], CombinedData[k][ i , : ] ])
     end
end 

My main concern here is whether there would be a relatively high mislabelling risk if for some reason the files in the folder and Names are not aligned. (e.g. a file is not loaded, added, or removed). Given that n is large I wouldn’t be able to verify whether the data matches the label. I saw another post on the matter but it also seems to take a similar approach with readdir(). Just wondering if there was a better way to go about adding a column with the file name or identifying the data in general? Thanks!

CSV.read takes a vector of filepaths which will read all files into a single table, and a source kwarg which is a vector of source names (to which you can pass the vector of full file paths as well or just the file names).

1 Like
  • source: … As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input β€œname” (usually file name) of the input from whence the value was parsed.

This seems to be saying that if you pass in just the column name for the source keyword, the values will automatically be the filenames where the row came from. That makes it even more convenient, and more robust too.

1 Like

Sorry I was on the phone so didn’t expand on my answer.

Here’s a full example:

julia> using CSV, DataFrames

julia> CSV.write("df1.csv", DataFrame(rand(2, 2), :auto));

julia> CSV.write("df2.csv", DataFrame(5 .* rand(2, 2), :auto));

julia> csv_files = filter(endswith(".csv"), readdir(; join = true))
2-element Vector{String}:
 "C:\\Users\\ngudat\\Documents\\df1.csv"
 "C:\\Users\\ngudat\\Documents\\df2.csv"

julia> CSV.read(csv_files, DataFrame; source = "source")
4Γ—3 DataFrame
 Row β”‚ x1         x2         source
     β”‚ Float64    Float64    String
─────┼─────────────────────────────────────────────────────────
   1 β”‚ 0.0292249  0.281584   C:\\Users\\ngudat\\Documents\\df…
   2 β”‚ 0.220783   0.717647   C:\\Users\\ngudat\\Documents\\df…
   3 β”‚ 3.48909    0.0183668  C:\\Users\\ngudat\\Documents\\df…
   4 β”‚ 4.00998    3.90755    C:\\Users\\ngudat\\Documents\\df…

julia> CSV.read(csv_files, DataFrame; source = "source" => basename.(csv_files))
4Γ—3 DataFrame
 Row β”‚ x1         x2         source
     β”‚ Float64    Float64    String
─────┼───────────────────────────────
   1 β”‚ 0.0292249  0.281584   df1.csv
   2 β”‚ 0.220783   0.717647   df1.csv
   3 β”‚ 3.48909    0.0183668  df2.csv
   4 β”‚ 4.00998    3.90755    df2.csv

I prefer to pass a pair as kwarg as I generally find the full paths less useful (although if one needs to re-parse part of the data it can of course sometimes be useful to have them in the resulting DataFrame).

5 Likes

awesome thanks for saving the day again! ( I was just going over one of your posts about splitting columns). In one of your earlier posts you mentioned that CSV.read() was being deprecated in favor of DataFrame(CSV.File()) so will this still work in the future or is there a planned equivalent in DataFrames?

Sorry for the confusion my comment is outdated and we basically came full circle on this.

Originally, CSV was depending on DataFrames to provide CSV.read() functionality which returned a DataFrame by default. As DataFrames is a pretty heavy dependency, CSV.read was at some point deprecated in favour of just having the basic parsing into a generic table structure provided by CSV.File, dropping the DataFrames dependency. This required users to do DataFrame(CSV.File()) to get back the old functionality. After some requests from users to bring CSV.read back and with advances in the generic Tables packages, the read function was introduced with the requirement to provide a sink argument.

So in summary we went from

using CSV
CSV.read("myfile.csv") # returns DataFrame by default

via

using CSV, DataFrames
DataFrame(CSV.File("myfile.csv"))

to

using CSV, DataFrames
CSV.read("myfile.csv", DataFrame)

while CSV is still not at 1.0, as far as I’m aware it’s been tantalizingly close for a while now and I wouldn’t expect the API for CSV.read to change again once 1.0 is released, so you probably don’t have to worry about future deprecations.

4 Likes

awesome, really appreciate the detailed explanation.

1 Like