I have a set of CSV files, each with two columns where I don’t know the column names ahead of time. I don’t need the first columns because they are essentially the same as the row number. What is the simplest way to combine the second column of each CSV file into a DataFrame?
I also would like to check that there are no missing/skipped/duplicate values in the dataset; checking the column lengths may be good enough.
I do have working code below, but it seems way more complex than it should have to be.
using CSV, Glob, DataFrames
# dir = "../path/to/files/"
# files = glob( "*.csv", dir)
# tempdfs = DataFrame.(CSV.File.(files))
files = ["file1", "file2", "file3"] # Replacement for MWE
tempdfs = [DataFrame(n = 1:5, x = rand(5)), DataFrame(n = 1:5, y = rand(5)), DataFrame(n = 1:5, z = rand(5))] # Replacement for MWE
firstcolumns = [tempdfs[i][!,1] for i in 1:length(files)]
firstcolumnsmatch = [firstcolumns[i] == first(firstcolumns) for i in 1:length(files)]
if all(firstcolumnsmatch)
df = outerjoin(tempdfs..., on=first(names(first(tempdfs))), matchmissing = :error)
disallowmissing!(df)
sort!(df)
else
throw(ErrorException("First columns in $(files[.!firstcolumnsmatch]) do not match first column in $(first(files))."))
end
Here is what the data looks like. Maybe a DataFrame isn’t even the right tool?
julia> firstcolumns
4-element Array{Array{Int64,1},1}:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 3469129, 3469130, 3469131, 3469132, 3469133, 3469134, 3469135, 3469136, 3469137, 3469138]
julia> firstcolumnsmatch
4-element Array{Bool,1}:
1
1
1
1
julia> df
3469138×5 DataFrame
Row │ Node Number Equivalent Elastic Strain (in/in) Equivalent Plastic Strain (in/in) Equivalent (von-Mises) Stress (psi) Equivalent Total Strain (in/in)
│ Int64 Float64 Float64 Float64 Float64
─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1 0.0016728 0.0 46100.0 0.0016728
2 │ 2 0.0017646 0.0 48274.0 0.0017646
3 │ 3 0.00062663 0.0 17371.0 0.00062663
4 │ 4 0.0008819 0.0 24052.0 0.0008819
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
3469135 │ 3469135 3.4294e-5 0.0 951.59 3.4294e-5
3469136 │ 3469136 3.794e-5 0.0 1053.1 3.794e-5
3469137 │ 3469137 0.0012972 0.0 35699.0 0.0012972
3469138 │ 3469138 0.00073033 0.0 19259.0 0.00073033
3469130 rows omitted
TDLR: Is there a simpler way to 1.) combine data from CSVs and to 2.) make sure the data is complete and unique?