Creating a grouped dataframe out of multiple csv files

Alex_Tantos · January 4, 2022, 2:39pm

Hi there!

I have multiple csv files with the same column formatting and naming and I need to create a new grouped dataframe with all the files’ content, but each group of the dataframe consists of the data of each separate file. In other words, as I am creating the new dataframe with the methods suggested here, I would like to also create on the fly new groups of data within the dataframe.
The other solution I thought was to simply add a new column to each csv file with a separate id for each file so that when the new dataframe is created I could create groups out of this new column. However, this could add hundreds of thousands of lines and was wondering whether there is a shorter way of doing this.

Thanks!

tbeason · January 4, 2022, 2:48pm

The solution of adding a file ID column is probably the easiest. If you make it a Categorical Array then it won’t take hardly any additional memory. If you are reading the whole thing into memory anyway, I’d probably just write a function that reads a file and adds the ID column, then map that function over the list of files, then vcat the returned vector of DataFrames.

nilshg · January 4, 2022, 3:00pm

Note that CSV.read does this automatically already if you pass a vector of file names to parse:

help?> CSV.read
  CSV.read(source, sink::T; kwargs...) => T

  Read and parses a delimited file or files, materializing directly using the sink function. Allows avoiding excessive copies of columns for certain sinks like DataFrame.

(...)

 source: [only applicable for vector of inputs to CSV.File] a Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values
       of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each
       value will be used instead of the input name for that inputs values in the auto-added column.

Alex_Tantos · January 4, 2022, 3:19pm

Thanks for the response! That is what I thought too…

Alex_Tantos · January 4, 2022, 3:25pm

Hi! Thanks @nilshg ! I edited my previous message. It works as expected @tbeason by adding the source keyword that adds a separate column to the newly-created df with the fileid! Thanks both!

julia> lexicaDf = CSV.read(["/Users/atantos/Documents/Papers/GrWord2Vec/lexiko.csv", "/Users/atantos/Documents/Papers/GrWord2Vec/lexiko2.csv"], DataFrame; source="fileid")
julia>  names(lexicaDf)
3-element Vector{String}:
 "WORD"
 "COUNT"
 "fileid"
julia> first(fileDf)
DataFrameRow
 Row │ WORD      COUNT      fileid                            
     │ String63  Int64      String                            
─────┼────────────────────────────────────────────────────────
   1 │ και       288842330  /Users/atantos/Documents/Papers/…

tbeason · January 4, 2022, 3:57pm

@nilshg very fancy! I did not know this feature existed.

@Alex_Tantos you need to supply the source keyword argument. I’ve not used it but try something like

CSV.read(["/Users/atantos/Documents/Papers/GrWord2Vec/lexiko.csv", "/Users/atantos/Documents/Papers/GrWord2Vec/lexiko2.csv"], DataFrame; source="fileid")

Topic		Replies	Views
Creating an identifier column when combing multiple DataFrames with CSV.read New to Julia question , csv	2	139	April 16, 2024
When converting multiple .csv files to a DataFrame, what is the most effective way to identify the data entries in the DataFrame with the .csv filename? General Usage question , dataframes , csv	6	844	October 12, 2022
Read CSVs. Build DataFrame with column identifying the original CSV New to Julia question	9	1081	June 2, 2020
Grouped Data Frames General Usage dataframes , juliadb , grouped-data	8	418	February 7, 2023
Reading multiple CSVs into one DataFrame New to Julia	12	1195	February 11, 2021

Creating a grouped dataframe out of multiple csv files

Related topics