I have multiple csv files with the same column formatting and naming and I need to create a new grouped dataframe with all the filesβ content, but each group of the dataframe consists of the data of each separate file. In other words, as I am creating the new dataframe with the methods suggested here, I would like to also create on the fly new groups of data within the dataframe.
The other solution I thought was to simply add a new column to each csv file with a separate id for each file so that when the new dataframe is created I could create groups out of this new column. However, this could add hundreds of thousands of lines and was wondering whether there is a shorter way of doing this.
The solution of adding a file ID column is probably the easiest. If you make it a Categorical Array then it wonβt take hardly any additional memory. If you are reading the whole thing into memory anyway, Iβd probably just write a function that reads a file and adds the ID column, then map that function over the list of files, then vcat the returned vector of DataFrames.
Note that CSV.read does this automatically already if you pass a vector of file names to parse:
help?> CSV.read
CSV.read(source, sink::T; kwargs...) => T
Read and parses a delimited file or files, materializing directly using the sink function. Allows avoiding excessive copies of columns for certain sinks like DataFrame.
(...)
source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values
of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each
value will be used instead of the input name for that inputs values in the auto-added column.
Hi! Thanks @nilshg ! I edited my previous message. It works as expected @tbeason by adding the source keyword that adds a separate column to the newly-created df with the fileid! Thanks both!