Set-like operations on DataFrames?

How to merge two dataframes with the same columns, similar to set operations?
For example, whether there are functions like

df = union(df1, df2, on=:ID)
df = intersect(df1, df2, on=:ID)
df = diff(df1, df2, on=:ID)

Can you please explain what you mean by β€œsimilar to set operations”? Can you please give some example and an expected output?

You can look up sql-styled joins in julia which will answer some of your questions.

https://juliadata.github.io/DataFrames.jl/stable/man/joins/

An intersect is basically an inner-join.
A union can be accomplished by hcat().

What are you trying to accomplish by diff?

Treat each row as an element in a set. This is viable because we assume the data has unique keys. Then perform set like operations like

julia> a = Set([1 2 3])
Set([2, 3, 1])

julia> b = Set([2 3 4])
Set([4, 2, 3])

julia> union(a, b)
Set([4, 2, 3, 1])

julia> intersect(a, b)
Set([2, 3])

julia> setdiff(a, b)
Set([1])

Note: replace the elements to rows of a dataframe.

Thanks but its not sql-styled joins. It’s set-styled operations. I gave examples in the above reply :slight_smile:

This is what I assumed but your definition has on keyword argument which I do not understand in the context of your answer.

Without on argument you can do e.g.:

DataFrame(union(eachrow.([df1, df2])...))

or

DataFrame(union(Tables.rowtable.([df1, df2])...))

(and the same with intersect and diff)

2 Likes

This is clever!

Didn’t know the union can operate on iterators. Cool!!

This seems not to be working anymore:

df_err = DataFrame(diff(eachrow.([df_diff_dest, df_same_dest])))

(where df_diff_dest and df_same_dest are grouped and filtered DataFrames) seems to throw

ERROR: MethodError: no method matching -(::DataFrameRow{DataFrame, DataFrames.Index}, ::DataFrameRow{DataFrame, DataFrames.Index})
Closest candidates are:
  -(::ChainRulesCore.AbstractThunk, ::Any) at ~/.julia/packages/ChainRulesCore/a4mIA/src/tangent_types/thunks.jl:34
  -(::DataValues.DataValue{T1}, ::T2) where {T1, T2} at ~/.julia/packages/DataValues/N7oeL/src/scalar/operations.jl:65
  -(::ChainRulesCore.ZeroTangent, ::Any) at ~/.julia/packages/ChainRulesCore/a4mIA/src/tangent_arithmetic.jl:101
  ...
Stacktrace:
  [1] _broadcast_getindex_evalf
    @ ./broadcast.jl:670 [inlined]
  [2] _broadcast_getindex
    @ ./broadcast.jl:643 [inlined]
  [3] getindex
    @ ./broadcast.jl:597 [inlined]
  [4] copy
    @ ./broadcast.jl:899 [inlined]
  [5] materialize
    @ ./broadcast.jl:860 [inlined]
  [6] broadcast_preserving_zero_d
    @ ./broadcast.jl:849 [inlined]
  [7] -(A::DataFrames.DataFrameRows{DataFrame}, B::DataFrames.DataFrameRows{DataFrame})
    @ Base ./arraymath.jl:8
  [8] _broadcast_getindex_evalf
    @ ./broadcast.jl:670 [inlined]
  [9] _broadcast_getindex
    @ ./broadcast.jl:643 [inlined]
 [10] getindex
    @ ./broadcast.jl:597 [inlined]
 [11] macro expansion
    @ ./broadcast.jl:961 [inlined]
 [12] macro expansion
    @ ./simdloop.jl:77 [inlined]
 [13] copyto!
    @ ./broadcast.jl:960 [inlined]
 [14] copyto!
    @ ./broadcast.jl:913 [inlined]
 [15] copy(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(-), Tuple{SubArray{DataFrames.DataFrameRows{DataFrame}, 1, Vector{DataFrames.DataFrameRows{DataFrame}}, Tuple{UnitRange{Int64}}, true}, SubArray{DataFrames.DataFrameRows{DataFrame}, 1, Vector{DataFrames.DataFrameRows{DataFrame}}, Tuple{UnitRange{Int64}}, true}}})
    @ Base.Broadcast ./broadcast.jl:885
 [16] materialize
    @ ./broadcast.jl:860 [inlined]
 [17] diff(a::Vector{DataFrames.DataFrameRows{DataFrame}}; dims::Int64)
    @ Base ./multidimensional.jl:1005
 [18] diff(a::Vector{DataFrames.DataFrameRows{DataFrame}})
    @ Base ./multidimensional.jl:965
 [19] top-level scope
    @ ./REPL[36]:29

Has the API changed or did I do something wrong?

you want setdiff, not diff.

It asked me to add the keyword :auto as second argument to the constructor DataFrame, but even like that I got errors (and moreover it doesn’t show what are the rows):

Adc instruction
53Γ—2 DataFrame
 Row β”‚ x1                      x2                     
     β”‚ DataFram…               DataFram…              
─────┼────────────────────────────────────────────────
   1 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   2 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   3 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   4 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   5 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   6 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   7 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   8 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   9 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  10 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  11 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  12 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  13 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  14 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  15 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  16 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  17 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  18 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  19 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  20 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  21 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  22 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  23 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  24 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  25 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  26 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  27 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  28 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  29 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  30 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  31 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  32 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  33 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  34 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  35 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  36 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  37 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  38 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  39 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  40 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  41 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  42 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  43 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  44 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  45 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  46 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  47 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  48 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  49 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  50 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  51 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  52 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  53 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
Adcs instruction
53Γ—2 DataFrame
 Row β”‚ x1                      x2                     
     β”‚ DataFram…               DataFram…              
─────┼────────────────────────────────────────────────
   1 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   2 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   3 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   4 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   5 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   6 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   7 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   8 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
   9 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  10 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  11 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  12 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  13 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  14 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  15 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  16 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  17 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  18 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  19 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  20 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  21 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  22 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  23 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  24 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  25 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  26 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  27 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  28 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  29 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  30 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  31 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  32 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  33 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  34 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  35 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  36 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  37 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  38 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  39 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  40 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  41 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  42 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  43 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  44 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  45 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  46 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  47 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  48 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  49 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  50 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  51 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  52 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
  53 β”‚ 3-element DataFrameRow  3-element DataFrameRow 
Add instruction
ERROR: DimensionMismatch: column :x1 has length 661 and column :x2 has length 541Stacktrace:
 [1] DataFrame(columns::Vector{AbstractVector}, colindex::DataFrames.Index; copycols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/dgZn3/src/dataframe/dataframe.jl:206 [2] DataFrame(columns::Vector{DataFrames.DataFrameRows{DataFrame}}, cnames::Vector{Symbol}; makeunique::Bool, copycols::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/dgZn3/src/dataframe/dataframe.jl:352 [3] #DataFrame#218
   @ ~/.julia/packages/DataFrames/dgZn3/src/dataframe/dataframe.jl:368 [inlined]
 [4] DataFrame(columns::Vector{DataFrames.DataFrameRows{DataFrame}}, cnames::Symbol)
   @ DataFrames ~/.julia/packages/DataFrames/dgZn3/src/dataframe/dataframe.jl:361 [5] top-level scope
   @ ./REPL[8]:29

Can you show a full reproducible example. The code works for me:

julia> df1 = DataFrame(a=[1,2], b=[3,4])
2Γ—2 DataFrame
 Row β”‚ a      b
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     1      3
   2 β”‚     2      4

julia> df2 = DataFrame(a=[1,3], b=[3,5])
2Γ—2 DataFrame
 Row β”‚ a      b
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     1      3
   2 β”‚     3      5

julia> DataFrame(union(eachrow.([df1, df2])...))
3Γ—2 DataFrame
 Row β”‚ a      b
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     1      3
   2 β”‚     2      4
   3 β”‚     3      5

julia> DataFrame(setdiff(eachrow.([df1, df2])...))
1Γ—2 DataFrame
 Row β”‚ a      b
     β”‚ Int64  Int64
─────┼──────────────
   1 β”‚     2      4

(most likely you are not passing correct values)

1 Like

You’re right, I didn’t put the ellipsis and instead trusted to REPL to add :auto. Now it works, sorry for the bother.

Anyway, please confirm me something (that I couldn’t guess from the docs): when calling
DataFrame(setdiff(eachrow.([df1, df2, df3, df4, df5])...)), what is computed is (set-like)
DF1 \setminus DF2 \setminus DF3 \setminus DF4 \setminus DF5 = DF1 \setminus (DF2 \cup DF3 \cup DF4 \cup DF5)
?

This is unrelated to DataFrames.jl, but is part of setdiff from Base Julia, where you have:

setdiff(s, itrs…)

Construct the set of elements in s but not in any of the iterables in itrs.

So you are right :smile::

julia> setdiff(1:6, 2, 4, 6:10)
3-element Vector{Int64}:
 1
 3
 5
1 Like

Many thanks! :smile: :pray: :vulcan_salute: