where n >> 100. I am converting these files to a single DataFrame, however I need to identify which file each entry originally came from. I was just wondering whether there was a direct way to do this in DataFrames?
Currently I am using the following approach which is not great and may be prone to mislabelling error.
First I created a vector of DataFrames with all the Data.
Then I gathered the names of all the files in a vector and pushed them with the data onto a new DataFrame.
Names = readdir("Users/Folder")
LabledData = DataFrame( Name = Srtring[], Data = Float64)
for k = 1:size(CombinedData)[1]
for i = 1:size(CombinedData[k])[1]
push!(LabledData, [ Names[ k ], CombinedData[k][ i , : ] ])
end
end
My main concern here is whether there would be a relatively high mislabelling risk if for some reason the files in the folder and Names are not aligned. (e.g. a file is not loaded, added, or removed). Given that n is large I wouldnβt be able to verify whether the data matches the label. I saw another post on the matter but it also seems to take a similar approach with readdir(). Just wondering if there was a better way to go about adding a column with the file name or identifying the data in general? Thanks!
CSV.read takes a vector of filepaths which will read all files into a single table, and a source kwarg which is a vector of source names (to which you can pass the vector of full file paths as well or just the file names).
source: β¦ As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input βnameβ (usually file name) of the input from whence the value was parsed.
This seems to be saying that if you pass in just the column name for the source keyword, the values will automatically be the filenames where the row came from. That makes it even more convenient, and more robust too.
I prefer to pass a pair as kwarg as I generally find the full paths less useful (although if one needs to re-parse part of the data it can of course sometimes be useful to have them in the resulting DataFrame).
awesome thanks for saving the day again! ( I was just going over one of your posts about splitting columns). In one of your earlier posts you mentioned that CSV.read() was being deprecated in favor of DataFrame(CSV.File()) so will this still work in the future or is there a planned equivalent in DataFrames?
Sorry for the confusion my comment is outdated and we basically came full circle on this.
Originally, CSV was depending on DataFrames to provide CSV.read() functionality which returned a DataFrame by default. As DataFrames is a pretty heavy dependency, CSV.read was at some point deprecated in favour of just having the basic parsing into a generic table structure provided by CSV.File, dropping the DataFrames dependency. This required users to do DataFrame(CSV.File()) to get back the old functionality. After some requests from users to bring CSV.read back and with advances in the generic Tables packages, the read function was introduced with the requirement to provide a sink argument.
So in summary we went from
using CSV
CSV.read("myfile.csv") # returns DataFrame by default
via
using CSV, DataFrames
DataFrame(CSV.File("myfile.csv"))
to
using CSV, DataFrames
CSV.read("myfile.csv", DataFrame)
while CSV is still not at 1.0, as far as Iβm aware itβs been tantalizingly close for a while now and I wouldnβt expect the API for CSV.read to change again once 1.0 is released, so you probably donβt have to worry about future deprecations.