Reading multiple CSVs into one DataFrame

I have a set of CSV files, each with two columns where I don’t know the column names ahead of time. I don’t need the first columns because they are essentially the same as the row number. What is the simplest way to combine the second column of each CSV file into a DataFrame?

I also would like to check that there are no missing/skipped/duplicate values in the dataset; checking the column lengths may be good enough.

I do have working code below, but it seems way more complex than it should have to be.

using CSV, Glob, DataFrames
# dir = "../path/to/files/"
# files = glob( "*.csv", dir)
# tempdfs = DataFrame.(CSV.File.(files))
files = ["file1", "file2", "file3"] # Replacement for MWE
tempdfs = [DataFrame(n = 1:5, x = rand(5)), DataFrame(n = 1:5, y = rand(5)), DataFrame(n = 1:5, z = rand(5))]  # Replacement for MWE

firstcolumns = [tempdfs[i][!,1] for i in 1:length(files)]
firstcolumnsmatch = [firstcolumns[i] == first(firstcolumns) for i in 1:length(files)]
if all(firstcolumnsmatch)
    df = outerjoin(tempdfs..., on=first(names(first(tempdfs))), matchmissing = :error)
    disallowmissing!(df)
    sort!(df)
else
    throw(ErrorException("First columns in $(files[.!firstcolumnsmatch]) do not match first column in $(first(files))."))
end

Here is what the data looks like. Maybe a DataFrame isn’t even the right tool?

julia> firstcolumns
4-element Array{Array{Int64,1},1}:
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]

julia> firstcolumnsmatch
4-element Array{Bool,1}:
 1
 1
 1
 1

julia> df
3469138×5 DataFrame
     Row │ Node Number  Equivalent Elastic Strain (in/in)  Equivalent Plastic Strain (in/in)  Equivalent (von-Mises) Stress (psi)  Equivalent Total Strain (in/in) 
         │ Int64        Float64                            Float64                            Float64                              Float64
─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       1 │           1                         0.0016728                                 0.0                             46100.0                        0.0016728
       2 │           2                         0.0017646                                 0.0                             48274.0                        0.0017646
       3 │           3                         0.00062663                                0.0                             17371.0                        0.00062663
       4 │           4                         0.0008819                                 0.0                             24052.0                        0.0008819
    ⋮    │      ⋮                       ⋮                                  ⋮                                   ⋮                                  ⋮
 3469135 │     3469135                         3.4294e-5                                 0.0                               951.59                       3.4294e-5
 3469136 │     3469136                         3.794e-5                                  0.0                              1053.1                        3.794e-5
 3469137 │     3469137                         0.0012972                                 0.0                             35699.0                        0.0012972
 3469138 │     3469138                         0.00073033                                0.0                             19259.0                        0.00073033
                                                                                                                                               3469130 rows omitted

TDLR: Is there a simpler way to 1.) combine data from CSVs and to 2.) make sure the data is complete and unique?

Wait do you want to vcat them together? It’s not clear what outerjoin is doing here.

Essentially yes. outerjoin is a built-in function in the DataFrames package. It is joining the DataFrames based on the data in the first column, but that data should already be the same across all the CSVs and I don’t really need it.

Okay so the number of rows should be constant across time as you add more data frames?

An MWE would be nice. It’s hard to know exactly what you are tryng to do

Yes.

Replace these two lines and it will be a reproducible MWE:

files = ["file1", "file2", "file3"]
tempdfs = [DataFrame(n = 1:5, x = rand(5)), DataFrame(n = 1:5, y = rand(5)), DataFrame(n = 1:5, z = rand(5))]

I will edit the question.

Thanks!

I would say just sort the DataFrames by the first column and then use hcat. It will probably be cheaper than outerjoin.

reduce(hcat, dfs)

Your method of error checking seems fine.

How do I avoid duplicating the first column?

This destroys my column names that I need to preserve:

julia> df = DataFrame(reduce(hcat, [tempdfs[i][!,2] for i in 1:length(files)]))
5×3 DataFrame
 Row │ x1        x2         x3       
     │ Float64   Float64    Float64  
─────┼───────────────────────────────
   1 │ 0.284214  0.0289615  0.397072
   2 │ 0.433917  0.359595   0.477026
   3 │ 0.320301  0.254339   0.374253
   4 │ 0.312021  0.593508   0.455454
   5 │ 0.039323  0.508584   0.467351

ohhh I see. Sorry about overlooking that. I would probably do

df = mapreduce(t -> select!(t, Not(1)), (x, y) -> hcat(x, y; copycols = false), dfs)
insertcols!(df, 1, :id => first(dfs)[:, 1])

This should be faster, since you aren’t copying the columns in hcat. You get rid of the first column each time you hcat, then add it at the end.

1 Like

haha, well that isn’t any simpler, but I’ll take the speed boost! I’ll need to stare at that for a bit.

Yeah it definitely isn’t simpler, but likely has a speed boost.

I came up with this, but I’m not sure if it is better.

df = tempdfs[1]
for tempdf in tempdfs[2:end]
    df = hcat(df, tempdf[!, Not(first(names(df)))], copycols = false)
end
1 Like

should be the same speed. It’s not like reduce is optimized to be different from the above loop.

Okay, thanks!