Intersection of two dataframes based on columns

Hi all-

I have two dataframes in which one is a subset of the other based on the values in some columns. For example,

using DataFrames

df1 = DataFrame(a=[1,1,2,2,3,3], b=[1,2,1,2,1,2], c=rand(6))
df2 = DataFrame(a=[1,1,2,2], b=[1,2,1,2], c=rand(4))

df2 is a subset of df1 based on shared values in columns a and b:

julia> df1
6×3 DataFrame
│ Row │ a     │ b     │ c        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.448961 │
│ 2   │ 1     │ 2     │ 0.858127 │
│ 3   │ 2     │ 1     │ 0.97272  │
│ 4   │ 2     │ 2     │ 0.655589 │
│ 5   │ 3     │ 1     │ 0.723655 │
│ 6   │ 3     │ 2     │ 0.426203 │

julia> df2
4×3 DataFrame
│ Row │ a     │ b     │ c          │
│     │ Int64 │ Int64 │ Float64    │
├─────┼───────┼───────┼────────────┤
│ 1   │ 1     │ 1     │ 0.498987   │
│ 2   │ 1     │ 2     │ 0.813332   │
│ 3   │ 2     │ 1     │ 0.566679   │
│ 4   │ 2     │ 2     │ 0.00879591 │

What I would like to do is extract the subset of df1 that intersects with d2 on columns a and b, yielding the following new dataframe:

julia> df3
4×3 DataFrame
│ Row │ a     │ b     │ c        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.448961 │
│ 2   │ 1     │ 2     │ 0.858127 │
│ 3   │ 2     │ 1     │ 0.97272  │
│ 4   │ 2     │ 2     │ 0.655589 │

Is there a general way to achieve this result when given two dataframes and a vector of columns, e.g. extract(df1, df2, [:a, :b])

2 Likes

Short answer as I am on mobile, but what you want is “join”:

fullDf = join(df1, df2, on = :commonCol) : Join dataframes horizontally. The on parameter can also be an array of common columns. There are many possible types of join , the most common ones are:

  • :inner (default, only rows with keys on both sides are returned),
  • :left (additionally, rows on the left df with keys not present on the df on the right are also returned),
  • :right (opposite of :left ),
  • :outer (rows with elements missing in any of the two df are also returned);

From my https://julia-book.com

3 Likes

I think you’re just looking for an inner join:

julia> join(df1, df2[!, [:a, :b]], on = [:a, :b], kind = :inner)
4×3 DataFrame
│ Row │ a     │ b     │ c        │
│     │ Int64 │ Int64 │ Float64  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.069781 │
│ 2   │ 1     │ 2     │ 0.317998 │
│ 3   │ 2     │ 1     │ 0.384543 │
│ 4   │ 2     │ 2     │ 0.445616 │

Edit: beaten by a second :slight_smile:

2 Likes

Thank you guys for the solution and thank you syvitcus for the in depth explanation.

I see now where I went wrong. Among my various failed attempts, I tried

df4 = join(df1, df2, on = [:a, :b], kind = :inner, makeunique=true)

which produced an unnecessary column because I didn’t think of using df2[!, [:a, :b]]

kind is no longer a supported key word argument. Updated solution is:

julia> innerjoin(df1, df2[!, [:a, :b]], on = [:a, :b])
4×3 DataFrame
 Row │ a      b      c         
     │ Int64  Int64  Float64   
─────┼─────────────────────────
   1 │     1      1  0.142809
   2 │     1      2  0.0343224
   3 │     2      1  0.692197
   4 │     2      2  0.454196

see: Joins · DataFrames.jl

1 Like