How to filter data conditional on two columns

I’ve a dataframe which has the information about money transactions between customers in each region. I like to filter customers in each region which both receive and send money to each other.

suppose from and to are customer id:

df = DataFrame(
               branch = [1,1,1,1,1,2,2], 
               from = [1,2,3,4,5,1,6],
               to = [4,7,1,1,2,3,9]
               )

the result should only include row 1 and 4.

UPDated

This should work:

cpairs = Set(map(r -> (r.branch, r.from, r.to), eachrow(df)))
filter(r -> (r.branch, r.to, r.from) in cpairs, df)
4 Likes

With the quite specific task

julia> filter(row->row.from in [1,4] && row.to in [1,4], df)
2Γ—3 DataFrame
 Row β”‚ branch  from   to    
     β”‚ Int64   Int64  Int64 
─────┼──────────────────────
   1 β”‚      1      1      4
   2 β”‚      1      4      1

You probably have some more general idea in mind but why solve the harder general problem when you only want a specific answer :slight_smile:

1 Like

I would probably do something like this:

First group by to and from to only get unique pairwise transfers (in this case this doesn’t do much, as all your transfers are unique, but in your real data I presume there are multiple transfers between the same customers):

julia> grouped = combine(groupby(df, [:from, :to], sort = true), nrow)
7Γ—3 DataFrame
 Row β”‚ from   to     nrow  
     β”‚ Int64  Int64  Int64 
─────┼─────────────────────
   1 β”‚     1      3      1
   2 β”‚     1      4      1
   3 β”‚     2      7      1
   4 β”‚     3      1      1
   5 β”‚     4      1      1
   6 β”‚     5      2      1
   7 β”‚     6      9      1

then add a column which indicates the transfer start and end point:

julia> grouped[!, :transfer] = string.(grouped.from) .* string.(grouped.to)
7-element Vector{String}:
 "13"
 "14"
 "27"
 "31"
 "41"
 "52"
 "69"

now your question boils down to finding the rows for which the reverse of this row is also in the data:

julia> in(reverse.(grouped.transfer)).(grouped.transfer)
7-element BitVector:
 1
 1
 0
 1
 1
 0
 0

(note that here each row is kept twice, once for each direction of transfer)

(note also that going via String is probably not the most efficient way, but a quick and simple illustration)

2 Likes

With DataFrameMacros.jl (and Chain):

julia> @chain df begin
           groupby(:branch)
           @subset @c tuple.(:to, :from) .∈ Ref(Set(tuple.(:from, :to)))
       end
2Γ—3 DataFrame
 Row β”‚ branch  from   to    
     β”‚ Int64   Int64  Int64 
─────┼──────────────────────
   1 β”‚      1      1      4
   2 β”‚      1      4      1
3 Likes
tdf=transform(df, [:from,:to]=> ((x,y)->Set.(zip(x,y)))=>:ft)
g=groupby(tdf,[:branch,:ft])
filter(x->nrow(x)==2 ,g)

1 Like

DataFramesMeta.jl is another transformation library. Unlike DataFrameMacros.jl it operates by columns as the default.

julia> @chain df begin
           groupby(:branch)
           @subset tuple.(:to, :from) .∈ Ref(Set(tuple.(:from, :to)))
       end
2Γ—3 DataFrame
 Row β”‚ branch  from   to
     β”‚ Int64   Int64  Int64
─────┼──────────────────────
   1 β”‚      1      1      4
   2 β”‚      1      4      1

1 Like

Thanks all for helpful answers, I just select the first one since it was the first :smiley:
Maybe I should compare their performance :thinking:

@aplavin, your updated magic solution seems to also work if Set() is removed. Is it really needed?

Without Set it works, but has a quadratic complexity: walk through the whole cpairs array for each row. Set makes it O(n).

1 Like

I don’t see where the information on the :branch is used, which, in this case, excludes the pair (1,4) (4,1) from the result.

1 Like

my solution does not solve the following case correctly

df = DataFrame(
               branch = [1,1,1,1,1,2,2], 
               from = [1,2,3,1,5,1,6],
               to = [4,7,1,4,2,3,9]
               )

use

semijoin(df, df, on = [:branch=>:branch, :from=>:to, :to=>:from])
4 Likes

I had also thought of a solution that made use of the join functions, although not as elegantly as yours.
In fact, what I was looking for was an alternative route that was competitive in speed of execution.
To tell the truth, I didn’t even know the existence of the semijoin function.
If I have not misunderstood this take, in the case of many correspondences only one (the first?).
But is that what @xinchin is asking?
Perhaps an example a little richer than the one provided would be useful, with the expected result in the case of many occurrences if this is a case existing within the same branch (and how they would be distinguished in that case)

df = DataFrame(
               branch = [1,1,2,1,1,2,2], 
               from = [1,2,3,4,5,1,6],
               to = [4,7,1,1,2,3,9]
               )
SplitApplyCombine.innerjoin(l->(l.branch,l.from,l.to),r->(r.branch,r.to,r.from),(l,r)->[(l.branch,l.from,l.to),(r.branch,r.from,r.to)],eachrow(df),eachrow(df))
 
 
grp=groupby(df, :branch)
[SplitApplyCombine.innerjoin(l->(l.branch,l.from,l.to),r->(r.branch,r.to,r.from),(l,r)->(l.branch,l.from,l.to),eachrow(g),eachrow(g)) for g in grp]

wow! like it, I never looked at semijoin from this perspective :smiley:

I want all rows taht fit to the conditions.

could there be a case like the following?
if so, what would the expected result be?

df = DataFrame(
               branch = [1,1,2,1,1,2,2], 
               from = [1,2,3,4,1,1,6],
               to = [4,7,1,1,4,3,9]
               )

it would be all 5 rows that meet the conditions, duplicated rows are ok. (hypothetically if they shouldn’t be there, I can use unique to remove them)