How to merge 2 dataframes (DataFrames.jl)

I have 2 dataframes. Each contains independent datapoints in different rows with different IDs. They share most of the same columns, except for one. I want to combine them into one data frame. I feel like there should be a function that does it, but I can’t find it. So far as I can tell outerjoin should do this, but it doesn’t work:

using DataFrames                                                                                                     
test1 = DataFrame(ID=1:3, Name=["A", "B", "C"], Argh=[5.0, 62.1, 89.0])
test2 = DataFrame(ID=4:6, Name=["D", "E", "F"])
testmerge = outerjoin(test1, test2, on=:ID)

result

ERROR: LoadError: ArgumentError: Duplicate variable names: :Name. Pass makeunique=true to make them unique using a suffix automatically.

Have you tried to do, what they say in error message? Pass makeunique=true?

1 Like

I agree with the above, although it seems to me like you are just trying to vertically concatenate the DataFrames rather than join them?

julia> vcat(test1, test2, cols = :union)
6Γ—3 DataFrame
 Row β”‚ ID     Name    Argh      
     β”‚ Int64  String  Float64?  
─────┼──────────────────────────
   1 β”‚     1  A             5.0
   2 β”‚     2  B            62.1
   3 β”‚     3  C            89.0
   4 β”‚     4  D       missing   
   5 β”‚     5  E       missing   
   6 β”‚     6  F       missing   

(note the cols = :union argument which fills in the missing Argh column in test2 with missing)

4 Likes

makeunique=true creates a bizarre dataframe with duplicate columns:

julia> testmerge
6Γ—4 DataFrame
 Row β”‚ ID     Name     Argh       Name_1  
     β”‚ Int64  String?  Float64?   String? 
─────┼────────────────────────────────────
   1 β”‚     1  A              5.0  missing 
   2 β”‚     2  B             62.1  missing 
   3 β”‚     3  C             89.0  missing 
   4 β”‚     4  missing  missing    D
   5 β”‚     5  missing  missing    E
   6 β”‚     6  missing  missing    F

That’s not bizarre at all, that’s just what you asked for - you are joining two DataFrames, both with a Name column, but you’re not joining on Name, so the new DataFrame will have two Name columns. As I said above, I think you’re just not looking for a join here…

2 Likes

If you feel confortable with SQL you can also do:

using SQLdf

test1 = DataFrame(ID=1:3, Name=["A", "B", "C"], Argh=[5.0, 62.1, 89.0])
test2 = DataFrame(ID=4:6, Name=["D", "E", "F"])

testmerge = sqldf("""
    select * from test1
    union
    select *, 0.0 from test2
    """)

6Γ—3 DataFrame
 Row β”‚ ID     Name    Argh    
     β”‚ Int64  String  Float64 
─────┼────────────────────────
   1 β”‚     1  A           5.0
   2 β”‚     2  B          62.1
   3 β”‚     3  C          89.0
   4 β”‚     4  D           0.0
   5 β”‚     5  E           0.0
   6 β”‚     6  F           0.0
1 Like