How to filter data conditional on two columns

xinchin · January 7, 2022, 3:07am

I’ve a dataframe which has the information about money transactions between customers in each region. I like to filter customers in each region which both receive and send money to each other.

suppose from and to are customer id:

df = DataFrame(
               branch = [1,1,1,1,1,2,2], 
               from = [1,2,3,4,5,1,6],
               to = [4,7,1,1,2,3,9]
               )

the result should only include row 1 and 4.

aplavin · January 7, 2022, 7:07am

UPDated

This should work:

cpairs = Set(map(r -> (r.branch, r.from, r.to), eachrow(df)))
filter(r -> (r.branch, r.to, r.from) in cpairs, df)

lawless-m · January 7, 2022, 2:47pm

With the quite specific task

julia> filter(row->row.from in [1,4] && row.to in [1,4], df)
2×3 DataFrame
 Row │ branch  from   to    
     │ Int64   Int64  Int64 
─────┼──────────────────────
   1 │      1      1      4
   2 │      1      4      1

You probably have some more general idea in mind but why solve the harder general problem when you only want a specific answer

nilshg · January 7, 2022, 4:09pm

I would probably do something like this:

First group by to and from to only get unique pairwise transfers (in this case this doesn’t do much, as all your transfers are unique, but in your real data I presume there are multiple transfers between the same customers):

julia> grouped = combine(groupby(df, [:from, :to], sort = true), nrow)
7×3 DataFrame
 Row │ from   to     nrow  
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      3      1
   2 │     1      4      1
   3 │     2      7      1
   4 │     3      1      1
   5 │     4      1      1
   6 │     5      2      1
   7 │     6      9      1

then add a column which indicates the transfer start and end point:

julia> grouped[!, :transfer] = string.(grouped.from) .* string.(grouped.to)
7-element Vector{String}:
 "13"
 "14"
 "27"
 "31"
 "41"
 "52"
 "69"

now your question boils down to finding the rows for which the reverse of this row is also in the data:

julia> in(reverse.(grouped.transfer)).(grouped.transfer)
7-element BitVector:
 1
 1
 0
 1
 1
 0
 0

(note that here each row is kept twice, once for each direction of transfer)

(note also that going via String is probably not the most efficient way, but a quick and simple illustration)

jules · January 7, 2022, 9:05pm

With DataFrameMacros.jl (and Chain):

julia> @chain df begin
           groupby(:branch)
           @subset @c tuple.(:to, :from) .∈ Ref(Set(tuple.(:from, :to)))
       end
2×3 DataFrame
 Row │ branch  from   to    
     │ Int64   Int64  Int64 
─────┼──────────────────────
   1 │      1      1      4
   2 │      1      4      1

rocco_sprmnt21 · January 7, 2022, 9:36pm

tdf=transform(df, [:from,:to]=> ((x,y)->Set.(zip(x,y)))=>:ft)
g=groupby(tdf,[:branch,:ft])
filter(x->nrow(x)==2 ,g)

pdeffebach · January 8, 2022, 12:42am

DataFramesMeta.jl is another transformation library. Unlike DataFrameMacros.jl it operates by columns as the default.

julia> @chain df begin
           groupby(:branch)
           @subset tuple.(:to, :from) .∈ Ref(Set(tuple.(:from, :to)))
       end
2×3 DataFrame
 Row │ branch  from   to
     │ Int64   Int64  Int64
─────┼──────────────────────
   1 │      1      1      4
   2 │      1      4      1

xinchin · January 8, 2022, 7:42am

Thanks all for helpful answers, I just select the first one since it was the first
Maybe I should compare their performance

rafael.guerra · January 8, 2022, 7:57am

@aplavin, your updated magic solution seems to also work if Set() is removed. Is it really needed?

aplavin · January 8, 2022, 9:47am

Without Set it works, but has a quadratic complexity: walk through the whole cpairs array for each row. Set makes it O(n).

rocco_sprmnt21 · January 8, 2022, 10:40pm

I don’t see where the information on the :branch is used, which, in this case, excludes the pair (1,4) (4,1) from the result.

rocco_sprmnt21 · January 8, 2022, 10:51pm

my solution does not solve the following case correctly

df = DataFrame(
               branch = [1,1,1,1,1,2,2], 
               from = [1,2,3,1,5,1,6],
               to = [4,7,1,4,2,3,9]
               )

DataFrames · January 9, 2022, 5:41am

use

semijoin(df, df, on = [:branch=>:branch, :from=>:to, :to=>:from])

rocco_sprmnt21 · January 9, 2022, 4:53pm

I had also thought of a solution that made use of the join functions, although not as elegantly as yours.
In fact, what I was looking for was an alternative route that was competitive in speed of execution.
To tell the truth, I didn’t even know the existence of the semijoin function.
If I have not misunderstood this take, in the case of many correspondences only one (the first?).
But is that what @xinchin is asking?
Perhaps an example a little richer than the one provided would be useful, with the expected result in the case of many occurrences if this is a case existing within the same branch (and how they would be distinguished in that case)

df = DataFrame(
               branch = [1,1,2,1,1,2,2], 
               from = [1,2,3,4,5,1,6],
               to = [4,7,1,1,2,3,9]
               )
SplitApplyCombine.innerjoin(l->(l.branch,l.from,l.to),r->(r.branch,r.to,r.from),(l,r)->[(l.branch,l.from,l.to),(r.branch,r.from,r.to)],eachrow(df),eachrow(df))
 
 
grp=groupby(df, :branch)
[SplitApplyCombine.innerjoin(l->(l.branch,l.from,l.to),r->(r.branch,r.to,r.from),(l,r)->(l.branch,l.from,l.to),eachrow(g),eachrow(g)) for g in grp]

xinchin · January 10, 2022, 2:40am

wow! like it, I never looked at semijoin from this perspective

xinchin · January 10, 2022, 2:46am

I want all rows taht fit to the conditions.

rocco_sprmnt21 · January 10, 2022, 9:37pm

could there be a case like the following?
if so, what would the expected result be?

df = DataFrame(
               branch = [1,1,2,1,1,2,2], 
               from = [1,2,3,4,1,1,6],
               to = [4,7,1,1,4,3,9]
               )

xinchin · January 11, 2022, 1:01am

it would be all 5 rows that meet the conditions, duplicated rows are ok. (hypothetically if they shouldn’t be there, I can use unique to remove them)

Topic		Replies	Views
DataFrame Filtering New to Julia question , dataframes	10	1094	September 20, 2023
Filtering one DataFrame using parameters of another, with different lengths New to Julia dataframes	5	361	March 8, 2023
Filtering DataFrame based on two conditions General Usage dataframes	1	327	October 25, 2022
Complex filtering of DataFrame Data dataframes	9	1033	July 22, 2022
How to create a new DF with only the rows that contain specific values from two different columns New to Julia question	5	89	September 15, 2024

How to filter data conditional on two columns

Related topics