How would I use the any function to find unique members of any arbitrary amount of dataframes?

Hi how’s it going,

I’m looking to write something where I don’t have to hard code how many groups I’m searching through, and find the unique members of each group. For example:

group1 = DataFrame(:member=>["phil","bob","larry","dustin"])
group2 = DataFrame(:member=>["phil","bob",",mike","george"])
group3 = DataFrame(:member=>["larry","kevin",",phil","george"])

Now obviously here I could do a for loop and say something like

for val in unique(group1[!,:member])
    if !(val in group2[!,:member]) && !(val in group3[!,:member])

but how would I do this if I didn’t know how many groups there were, and just wanted to return the unique members of each group that aren’t in any others.


How about this?

julia> group1 = DataFrame(:member=>["phil","bob","larry","dustin"]);

julia> group2 = DataFrame(:member=>["phil","bob",",mike","george"]);

julia> group3 = DataFrame(:member=>["larry","kevin",",phil","george"]);

julia> dfs = [group1, group2, group3];

julia> uniques = Dict();

julia> for i in 1:length(dfs)
           this_one = dfs[i].member
           other_ones = reduce(vcat, map(t -> t.member, dfs[setdiff(1:length(dfs), i)])) 
           uniques[i] = setdiff(this_one, other_ones)

julia> uniques
Dict{Any,Any} with 3 entries:
  2 => [",mike"]
  3 => ["kevin", ",phil"]
  1 => ["dustin"]

thanks man!

What if I wanted to make it a threshold, so that instead of an absolute set difference, I say, “if any value shows up in greater than 75% of the groups, then it’s non unique, other than that it’s unique.”

So in this case, larry and bob would be considered unique since they show up in 2/3 of the groups , below the 75% threshold

remember that mean can take in a function, so you would do something like

function amount(name, dfs) # dfs is dfs[Not(group1)] for example
    mean(t -> name in t.member, dfs)
1 Like