How to merge two dataframes with the same columns, similar to set operations?
For example, whether there are functions like
df = union(df1, df2, on=:ID)
df = intersect(df1, df2, on=:ID)
df = diff(df1, df2, on=:ID)
How to merge two dataframes with the same columns, similar to set operations?
For example, whether there are functions like
df = union(df1, df2, on=:ID)
df = intersect(df1, df2, on=:ID)
df = diff(df1, df2, on=:ID)
Can you please explain what you mean by βsimilar to set operationsβ? Can you please give some example and an expected output?
You can look up sql-styled joins in julia which will answer some of your questions.
https://juliadata.github.io/DataFrames.jl/stable/man/joins/
An intersect is basically an inner-join.
A union can be accomplished by hcat().
What are you trying to accomplish by diff?
Treat each row as an element in a set. This is viable because we assume the data has unique keys. Then perform set like operations like
julia> a = Set([1 2 3])
Set([2, 3, 1])
julia> b = Set([2 3 4])
Set([4, 2, 3])
julia> union(a, b)
Set([4, 2, 3, 1])
julia> intersect(a, b)
Set([2, 3])
julia> setdiff(a, b)
Set([1])
Note: replace the elements to rows of a dataframe.
Thanks but its not sql-styled joins. Itβs set-styled operations. I gave examples in the above reply
This is what I assumed but your definition has on
keyword argument which I do not understand in the context of your answer.
Without on
argument you can do e.g.:
DataFrame(union(eachrow.([df1, df2])...))
or
DataFrame(union(Tables.rowtable.([df1, df2])...))
(and the same with intersect
and diff
)
This is clever!
Didnβt know the union
can operate on iterators. Cool!!
This seems not to be working anymore:
df_err = DataFrame(diff(eachrow.([df_diff_dest, df_same_dest])))
(where df_diff_dest
and df_same_dest
are grouped and filtered DataFrames) seems to throw
ERROR: MethodError: no method matching -(::DataFrameRow{DataFrame, DataFrames.Index}, ::DataFrameRow{DataFrame, DataFrames.Index})
Closest candidates are:
-(::ChainRulesCore.AbstractThunk, ::Any) at ~/.julia/packages/ChainRulesCore/a4mIA/src/tangent_types/thunks.jl:34
-(::DataValues.DataValue{T1}, ::T2) where {T1, T2} at ~/.julia/packages/DataValues/N7oeL/src/scalar/operations.jl:65
-(::ChainRulesCore.ZeroTangent, ::Any) at ~/.julia/packages/ChainRulesCore/a4mIA/src/tangent_arithmetic.jl:101
...
Stacktrace:
[1] _broadcast_getindex_evalf
@ ./broadcast.jl:670 [inlined]
[2] _broadcast_getindex
@ ./broadcast.jl:643 [inlined]
[3] getindex
@ ./broadcast.jl:597 [inlined]
[4] copy
@ ./broadcast.jl:899 [inlined]
[5] materialize
@ ./broadcast.jl:860 [inlined]
[6] broadcast_preserving_zero_d
@ ./broadcast.jl:849 [inlined]
[7] -(A::DataFrames.DataFrameRows{DataFrame}, B::DataFrames.DataFrameRows{DataFrame})
@ Base ./arraymath.jl:8
[8] _broadcast_getindex_evalf
@ ./broadcast.jl:670 [inlined]
[9] _broadcast_getindex
@ ./broadcast.jl:643 [inlined]
[10] getindex
@ ./broadcast.jl:597 [inlined]
[11] macro expansion
@ ./broadcast.jl:961 [inlined]
[12] macro expansion
@ ./simdloop.jl:77 [inlined]
[13] copyto!
@ ./broadcast.jl:960 [inlined]
[14] copyto!
@ ./broadcast.jl:913 [inlined]
[15] copy(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(-), Tuple{SubArray{DataFrames.DataFrameRows{DataFrame}, 1, Vector{DataFrames.DataFrameRows{DataFrame}}, Tuple{UnitRange{Int64}}, true}, SubArray{DataFrames.DataFrameRows{DataFrame}, 1, Vector{DataFrames.DataFrameRows{DataFrame}}, Tuple{UnitRange{Int64}}, true}}})
@ Base.Broadcast ./broadcast.jl:885
[16] materialize
@ ./broadcast.jl:860 [inlined]
[17] diff(a::Vector{DataFrames.DataFrameRows{DataFrame}}; dims::Int64)
@ Base ./multidimensional.jl:1005
[18] diff(a::Vector{DataFrames.DataFrameRows{DataFrame}})
@ Base ./multidimensional.jl:965
[19] top-level scope
@ ./REPL[36]:29
Has the API changed or did I do something wrong?
you want setdiff
, not diff
.
It asked me to add the keyword :auto
as second argument to the constructor DataFrame
, but even like that I got errors (and moreover it doesnβt show what are the rows):
Adc instruction
53Γ2 DataFrame
Row β x1 x2
β DataFramβ¦ DataFramβ¦
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ
1 β 3-element DataFrameRow 3-element DataFrameRow
2 β 3-element DataFrameRow 3-element DataFrameRow
3 β 3-element DataFrameRow 3-element DataFrameRow
4 β 3-element DataFrameRow 3-element DataFrameRow
5 β 3-element DataFrameRow 3-element DataFrameRow
6 β 3-element DataFrameRow 3-element DataFrameRow
7 β 3-element DataFrameRow 3-element DataFrameRow
8 β 3-element DataFrameRow 3-element DataFrameRow
9 β 3-element DataFrameRow 3-element DataFrameRow
10 β 3-element DataFrameRow 3-element DataFrameRow
11 β 3-element DataFrameRow 3-element DataFrameRow
12 β 3-element DataFrameRow 3-element DataFrameRow
13 β 3-element DataFrameRow 3-element DataFrameRow
14 β 3-element DataFrameRow 3-element DataFrameRow
15 β 3-element DataFrameRow 3-element DataFrameRow
16 β 3-element DataFrameRow 3-element DataFrameRow
17 β 3-element DataFrameRow 3-element DataFrameRow
18 β 3-element DataFrameRow 3-element DataFrameRow
19 β 3-element DataFrameRow 3-element DataFrameRow
20 β 3-element DataFrameRow 3-element DataFrameRow
21 β 3-element DataFrameRow 3-element DataFrameRow
22 β 3-element DataFrameRow 3-element DataFrameRow
23 β 3-element DataFrameRow 3-element DataFrameRow
24 β 3-element DataFrameRow 3-element DataFrameRow
25 β 3-element DataFrameRow 3-element DataFrameRow
26 β 3-element DataFrameRow 3-element DataFrameRow
27 β 3-element DataFrameRow 3-element DataFrameRow
28 β 3-element DataFrameRow 3-element DataFrameRow
29 β 3-element DataFrameRow 3-element DataFrameRow
30 β 3-element DataFrameRow 3-element DataFrameRow
31 β 3-element DataFrameRow 3-element DataFrameRow
32 β 3-element DataFrameRow 3-element DataFrameRow
33 β 3-element DataFrameRow 3-element DataFrameRow
34 β 3-element DataFrameRow 3-element DataFrameRow
35 β 3-element DataFrameRow 3-element DataFrameRow
36 β 3-element DataFrameRow 3-element DataFrameRow
37 β 3-element DataFrameRow 3-element DataFrameRow
38 β 3-element DataFrameRow 3-element DataFrameRow
39 β 3-element DataFrameRow 3-element DataFrameRow
40 β 3-element DataFrameRow 3-element DataFrameRow
41 β 3-element DataFrameRow 3-element DataFrameRow
42 β 3-element DataFrameRow 3-element DataFrameRow
43 β 3-element DataFrameRow 3-element DataFrameRow
44 β 3-element DataFrameRow 3-element DataFrameRow
45 β 3-element DataFrameRow 3-element DataFrameRow
46 β 3-element DataFrameRow 3-element DataFrameRow
47 β 3-element DataFrameRow 3-element DataFrameRow
48 β 3-element DataFrameRow 3-element DataFrameRow
49 β 3-element DataFrameRow 3-element DataFrameRow
50 β 3-element DataFrameRow 3-element DataFrameRow
51 β 3-element DataFrameRow 3-element DataFrameRow
52 β 3-element DataFrameRow 3-element DataFrameRow
53 β 3-element DataFrameRow 3-element DataFrameRow
Adcs instruction
53Γ2 DataFrame
Row β x1 x2
β DataFramβ¦ DataFramβ¦
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ
1 β 3-element DataFrameRow 3-element DataFrameRow
2 β 3-element DataFrameRow 3-element DataFrameRow
3 β 3-element DataFrameRow 3-element DataFrameRow
4 β 3-element DataFrameRow 3-element DataFrameRow
5 β 3-element DataFrameRow 3-element DataFrameRow
6 β 3-element DataFrameRow 3-element DataFrameRow
7 β 3-element DataFrameRow 3-element DataFrameRow
8 β 3-element DataFrameRow 3-element DataFrameRow
9 β 3-element DataFrameRow 3-element DataFrameRow
10 β 3-element DataFrameRow 3-element DataFrameRow
11 β 3-element DataFrameRow 3-element DataFrameRow
12 β 3-element DataFrameRow 3-element DataFrameRow
13 β 3-element DataFrameRow 3-element DataFrameRow
14 β 3-element DataFrameRow 3-element DataFrameRow
15 β 3-element DataFrameRow 3-element DataFrameRow
16 β 3-element DataFrameRow 3-element DataFrameRow
17 β 3-element DataFrameRow 3-element DataFrameRow
18 β 3-element DataFrameRow 3-element DataFrameRow
19 β 3-element DataFrameRow 3-element DataFrameRow
20 β 3-element DataFrameRow 3-element DataFrameRow
21 β 3-element DataFrameRow 3-element DataFrameRow
22 β 3-element DataFrameRow 3-element DataFrameRow
23 β 3-element DataFrameRow 3-element DataFrameRow
24 β 3-element DataFrameRow 3-element DataFrameRow
25 β 3-element DataFrameRow 3-element DataFrameRow
26 β 3-element DataFrameRow 3-element DataFrameRow
27 β 3-element DataFrameRow 3-element DataFrameRow
28 β 3-element DataFrameRow 3-element DataFrameRow
29 β 3-element DataFrameRow 3-element DataFrameRow
30 β 3-element DataFrameRow 3-element DataFrameRow
31 β 3-element DataFrameRow 3-element DataFrameRow
32 β 3-element DataFrameRow 3-element DataFrameRow
33 β 3-element DataFrameRow 3-element DataFrameRow
34 β 3-element DataFrameRow 3-element DataFrameRow
35 β 3-element DataFrameRow 3-element DataFrameRow
36 β 3-element DataFrameRow 3-element DataFrameRow
37 β 3-element DataFrameRow 3-element DataFrameRow
38 β 3-element DataFrameRow 3-element DataFrameRow
39 β 3-element DataFrameRow 3-element DataFrameRow
40 β 3-element DataFrameRow 3-element DataFrameRow
41 β 3-element DataFrameRow 3-element DataFrameRow
42 β 3-element DataFrameRow 3-element DataFrameRow
43 β 3-element DataFrameRow 3-element DataFrameRow
44 β 3-element DataFrameRow 3-element DataFrameRow
45 β 3-element DataFrameRow 3-element DataFrameRow
46 β 3-element DataFrameRow 3-element DataFrameRow
47 β 3-element DataFrameRow 3-element DataFrameRow
48 β 3-element DataFrameRow 3-element DataFrameRow
49 β 3-element DataFrameRow 3-element DataFrameRow
50 β 3-element DataFrameRow 3-element DataFrameRow
51 β 3-element DataFrameRow 3-element DataFrameRow
52 β 3-element DataFrameRow 3-element DataFrameRow
53 β 3-element DataFrameRow 3-element DataFrameRow
Add instruction
ERROR: DimensionMismatch: column :x1 has length 661 and column :x2 has length 541Stacktrace:
[1] DataFrame(columns::Vector{AbstractVector}, colindex::DataFrames.Index; copycols::Bool)
@ DataFrames ~/.julia/packages/DataFrames/dgZn3/src/dataframe/dataframe.jl:206 [2] DataFrame(columns::Vector{DataFrames.DataFrameRows{DataFrame}}, cnames::Vector{Symbol}; makeunique::Bool, copycols::Bool)
@ DataFrames ~/.julia/packages/DataFrames/dgZn3/src/dataframe/dataframe.jl:352 [3] #DataFrame#218
@ ~/.julia/packages/DataFrames/dgZn3/src/dataframe/dataframe.jl:368 [inlined]
[4] DataFrame(columns::Vector{DataFrames.DataFrameRows{DataFrame}}, cnames::Symbol)
@ DataFrames ~/.julia/packages/DataFrames/dgZn3/src/dataframe/dataframe.jl:361 [5] top-level scope
@ ./REPL[8]:29
Can you show a full reproducible example. The code works for me:
julia> df1 = DataFrame(a=[1,2], b=[3,4])
2Γ2 DataFrame
Row β a b
β Int64 Int64
ββββββΌββββββββββββββ
1 β 1 3
2 β 2 4
julia> df2 = DataFrame(a=[1,3], b=[3,5])
2Γ2 DataFrame
Row β a b
β Int64 Int64
ββββββΌββββββββββββββ
1 β 1 3
2 β 3 5
julia> DataFrame(union(eachrow.([df1, df2])...))
3Γ2 DataFrame
Row β a b
β Int64 Int64
ββββββΌββββββββββββββ
1 β 1 3
2 β 2 4
3 β 3 5
julia> DataFrame(setdiff(eachrow.([df1, df2])...))
1Γ2 DataFrame
Row β a b
β Int64 Int64
ββββββΌββββββββββββββ
1 β 2 4
(most likely you are not passing correct values)
Youβre right, I didnβt put the ellipsis and instead trusted to REPL to add :auto
. Now it works, sorry for the bother.
Anyway, please confirm me something (that I couldnβt guess from the docs): when calling
DataFrame(setdiff(eachrow.([df1, df2, df3, df4, df5])...))
, what is computed is (set-like)
DF1 \setminus DF2 \setminus DF3 \setminus DF4 \setminus DF5 = DF1 \setminus (DF2 \cup DF3 \cup DF4 \cup DF5)
?
This is unrelated to DataFrames.jl, but is part of setdiff
from Base Julia, where you have:
setdiff(s, itrsβ¦)
Construct the set of elements in s but not in any of the iterables in itrs.
So you are right :
julia> setdiff(1:6, 2, 4, 6:10)
3-element Vector{Int64}:
1
3
5
Many thanks!