Creating a grouped dataframe out of multiple csv files

Hi there!

I have multiple csv files with the same column formatting and naming and I need to create a new grouped dataframe with all the files’ content, but each group of the dataframe consists of the data of each separate file. In other words, as I am creating the new dataframe with the methods suggested here, I would like to also create on the fly new groups of data within the dataframe.
The other solution I thought was to simply add a new column to each csv file with a separate id for each file so that when the new dataframe is created I could create groups out of this new column. However, this could add hundreds of thousands of lines and was wondering whether there is a shorter way of doing this.

Thanks!

The solution of adding a file ID column is probably the easiest. If you make it a Categorical Array then it won’t take hardly any additional memory. If you are reading the whole thing into memory anyway, I’d probably just write a function that reads a file and adds the ID column, then map that function over the list of files, then vcat the returned vector of DataFrames.

1 Like

Note that CSV.read does this automatically already if you pass a vector of file names to parse:

help?> CSV.read
  CSV.read(source, sink::T; kwargs...) => T

  Read and parses a delimited file or files, materializing directly using the sink function. Allows avoiding excessive copies of columns for certain sinks like DataFrame.

(...)

 source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values
       of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each
       value will be used instead of the input name for that inputs values in the auto-added column.
3 Likes

Thanks for the response! That is what I thought too…

Hi! Thanks @nilshg ! I edited my previous message. It works as expected @tbeason by adding the source keyword that adds a separate column to the newly-created df with the fileid! Thanks both!

julia> lexicaDf = CSV.read(["/Users/atantos/Documents/Papers/GrWord2Vec/lexiko.csv", "/Users/atantos/Documents/Papers/GrWord2Vec/lexiko2.csv"], DataFrame; source="fileid")
julia>  names(lexicaDf)
3-element Vector{String}:
 "WORD"
 "COUNT"
 "fileid"
julia> first(fileDf)
DataFrameRow
 Row β”‚ WORD      COUNT      fileid                            
     β”‚ String63  Int64      String                            
─────┼────────────────────────────────────────────────────────
   1 β”‚ ΞΊΞ±ΞΉ       288842330  /Users/atantos/Documents/Papers/…

@nilshg very fancy! I did not know this feature existed.

@Alex_Tantos you need to supply the source keyword argument. I’ve not used it but try something like

CSV.read(["/Users/atantos/Documents/Papers/GrWord2Vec/lexiko.csv", "/Users/atantos/Documents/Papers/GrWord2Vec/lexiko2.csv"], DataFrame; source="fileid")
1 Like