Reading multiple CSVs into one DataFrame

Nathan_Boyer · February 11, 2021, 3:39pm

I have a set of CSV files, each with two columns where I don’t know the column names ahead of time. I don’t need the first columns because they are essentially the same as the row number. What is the simplest way to combine the second column of each CSV file into a DataFrame?

I also would like to check that there are no missing/skipped/duplicate values in the dataset; checking the column lengths may be good enough.

I do have working code below, but it seems way more complex than it should have to be.

using CSV, Glob, DataFrames
# dir = "../path/to/files/"
# files = glob( "*.csv", dir)
# tempdfs = DataFrame.(CSV.File.(files))
files = ["file1", "file2", "file3"] # Replacement for MWE
tempdfs = [DataFrame(n = 1:5, x = rand(5)), DataFrame(n = 1:5, y = rand(5)), DataFrame(n = 1:5, z = rand(5))]  # Replacement for MWE

firstcolumns = [tempdfs[i][!,1] for i in 1:length(files)]
firstcolumnsmatch = [firstcolumns[i] == first(firstcolumns) for i in 1:length(files)]
if all(firstcolumnsmatch)
    df = outerjoin(tempdfs..., on=first(names(first(tempdfs))), matchmissing = :error)
    disallowmissing!(df)
    sort!(df)
else
    throw(ErrorException("First columns in $(files[.!firstcolumnsmatch]) do not match first column in $(first(files))."))
end

Here is what the data looks like. Maybe a DataFrame isn’t even the right tool?

julia> firstcolumns
4-element Array{Array{Int64,1},1}:
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]

julia> firstcolumnsmatch
4-element Array{Bool,1}:
 1
 1
 1
 1

julia> df
3469138×5 DataFrame
     Row │ Node Number  Equivalent Elastic Strain (in/in)  Equivalent Plastic Strain (in/in)  Equivalent (von-Mises) Stress (psi)  Equivalent Total Strain (in/in) 
         │ Int64        Float64                            Float64                            Float64                              Float64
─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       1 │           1                         0.0016728                                 0.0                             46100.0                        0.0016728
       2 │           2                         0.0017646                                 0.0                             48274.0                        0.0017646
       3 │           3                         0.00062663                                0.0                             17371.0                        0.00062663
       4 │           4                         0.0008819                                 0.0                             24052.0                        0.0008819
    ⋮    │      ⋮                       ⋮                                  ⋮                                   ⋮                                  ⋮
 3469135 │     3469135                         3.4294e-5                                 0.0                               951.59                       3.4294e-5
 3469136 │     3469136                         3.794e-5                                  0.0                              1053.1                        3.794e-5
 3469137 │     3469137                         0.0012972                                 0.0                             35699.0                        0.0012972
 3469138 │     3469138                         0.00073033                                0.0                             19259.0                        0.00073033
                                                                                                                                               3469130 rows omitted

TDLR: Is there a simpler way to 1.) combine data from CSVs and to 2.) make sure the data is complete and unique?

pdeffebach · February 11, 2021, 3:56pm

Wait do you want to vcat them together? It’s not clear what outerjoin is doing here.

Nathan_Boyer · February 11, 2021, 3:59pm

Essentially yes. outerjoin is a built-in function in the DataFrames package. It is joining the DataFrames based on the data in the first column, but that data should already be the same across all the CSVs and I don’t really need it.

pdeffebach · February 11, 2021, 4:00pm

Okay so the number of rows should be constant across time as you add more data frames?

An MWE would be nice. It’s hard to know exactly what you are tryng to do

Nathan_Boyer · February 11, 2021, 4:06pm

Yes.

Replace these two lines and it will be a reproducible MWE:

files = ["file1", "file2", "file3"]
tempdfs = [DataFrame(n = 1:5, x = rand(5)), DataFrame(n = 1:5, y = rand(5)), DataFrame(n = 1:5, z = rand(5))]

I will edit the question.

pdeffebach · February 11, 2021, 4:08pm

Thanks!

I would say just sort the DataFrames by the first column and then use hcat. It will probably be cheaper than outerjoin.

reduce(hcat, dfs)

Your method of error checking seems fine.

Nathan_Boyer · February 11, 2021, 4:20pm

How do I avoid duplicating the first column?

This destroys my column names that I need to preserve:

julia> df = DataFrame(reduce(hcat, [tempdfs[i][!,2] for i in 1:length(files)]))
5×3 DataFrame
 Row │ x1        x2         x3       
     │ Float64   Float64    Float64  
─────┼───────────────────────────────
   1 │ 0.284214  0.0289615  0.397072
   2 │ 0.433917  0.359595   0.477026
   3 │ 0.320301  0.254339   0.374253
   4 │ 0.312021  0.593508   0.455454
   5 │ 0.039323  0.508584   0.467351

pdeffebach · February 11, 2021, 4:25pm

ohhh I see. Sorry about overlooking that. I would probably do

df = mapreduce(t -> select!(t, Not(1)), (x, y) -> hcat(x, y; copycols = false), dfs)
insertcols!(df, 1, :id => first(dfs)[:, 1])

This should be faster, since you aren’t copying the columns in hcat. You get rid of the first column each time you hcat, then add it at the end.

Nathan_Boyer · February 11, 2021, 4:32pm

haha, well that isn’t any simpler, but I’ll take the speed boost! I’ll need to stare at that for a bit.

pdeffebach · February 11, 2021, 4:37pm

Yeah it definitely isn’t simpler, but likely has a speed boost.

Nathan_Boyer · February 11, 2021, 5:07pm

I came up with this, but I’m not sure if it is better.

df = tempdfs[1]
for tempdf in tempdfs[2:end]
    df = hcat(df, tempdf[!, Not(first(names(df)))], copycols = false)
end

pdeffebach · February 11, 2021, 5:12pm

should be the same speed. It’s not like reduce is optimized to be different from the above loop.

Nathan_Boyer · February 11, 2021, 5:18pm

Okay, thanks!

Topic		Replies	Views
How to merge 2 dataframes (DataFrames.jl) General Usage question , dataframes	5	5189	July 9, 2021
Vcat DataFrame columns based on multiple columns in Julia General Usage dataframes	6	236	August 20, 2023
Creating an identifier column when combing multiple DataFrames with CSV.read New to Julia question , csv	2	139	April 16, 2024
Combine more DataFrames with some columns missing New to Julia json , dataframes	4	2337	May 13, 2020
Read CSVs. Build DataFrame with column identifying the original CSV New to Julia question	9	1081	June 2, 2020

Reading multiple CSVs into one DataFrame

Related topics