Efficiently finding the frequency of patterns in DataFrame columns

Hi all! I’m looking for help with what might be an easy task.

Say we have two vectors; A = [0,1] and B = [0,1,2]. Matrix C contains every possible combination of A and B such that it has dimensions 6x2.

Now consider a dataset containing N observations again with features A and B. I want to find
the frequency of each pattern and store number in the same order as matrix C.

Below, I show that I can do this with nested for loops, however, in practice, this becomes time-intensive as the dimensions of matrix C increase (e.g. add another vector, q = [0,1,2]). I’m wondering if there is a more efficient way to accomplish this without nested for loops?

I’ve provided some reproducible code below:


using DataFrames
a = [0,1]
b = [0,1,2]

c = zeros(6,2)
k = 1
for i in 0:1
    for j in 0:2
        c[k,:] = [i,j]
        k+=1
    end
end

# all possible combinations of A,B
@show c


# new data
simple = hcat(rand(a,10), rand(b,10))
data = simple |>
    x->DataFrame(x, :auto)


counter = zeros(6,1);
for i in 0:1
    for j in 0:2
        row = 1 + 3*i + j
        counter[row] = sum( @. (data.x1==i && data.x2 == j))    
    end
end



@show counter


1 Like

Not sure I understand you correctly but if a and b are columns in your DataFrame you might be looking for

combine(groupby(df, [:a, :b]), nrow)

Which will give a count of the number of rows for each distinct a/b combination

Thanks for the reply! I thought about this, but my desired output would be something like:

final = hcat(c,counter)

My problem with combine(groupby(df, [:a, :b]), nrow) is that it will likely result in a number of distinct groups that is less than the number of possible a/b combinations accounted for in matrix c.

It’s like I want to merge the result of the groupby/combine with c and fill the nrow with 0 if a certain a/b combo isn’t apparent in data

1 Like

I think I found it…

y = combine(groupby(df, [:a, :b]), nrow)
final = outerjoin(c,y, on = [:x1,:x2])

I see - in that case you probably want to create a new DataFrame from the result of IterTools.product(df.x, df.y) first and then leftjoin the result of the groupby operation onto that. Finally coalesce the nrow column in the joined DataFrame to replace missing with zero if desired.

Using sort!() ensures the same order but there must be a better way:

using DataFrames
a, b = [0,1], [0,1,2]
df = DataFrame(Iterators.product(a, b))
rename!(df,[:a,:b])
sort!(df, [:a, :b])

data = DataFrame(a = rand(a,10), b = rand(b,10))     # new data
dg = combine(groupby(data, [:a, :b]), nrow => :count)
dfc = outerjoin(dg, df, on = [:a, :b])
sort!(dfc, [:a, :b])

julia> dfc   # dfc sorted by [:a, :b] as df

 Row β”‚ a      b      count   
     β”‚ Int64  Int64  Int64?
─────┼───────────────────────
   1 β”‚     0      0        4
   2 β”‚     0      1        1
   3 β”‚     0      2  missing
   4 β”‚     1      0        4
   5 β”‚     1      1  missing
   6 β”‚     1      2        1
1 Like

Yes, that :point_up: is what I meant when answering from my phone this morning :slight_smile:

To me leftjoin is conceptually the more correct join here, as df should have all combinations, i.e. the right table should not add new rows.

Also if you want to preserve the row order on the left hand side you can use leftjoin!:

julia> df = sort!(rename!(DataFrame(Iterators.product(a, b)), [:a, :b]), [:a, :b])
6Γ—2 DataFrame
 Row β”‚ a      b     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     0      0
   2 β”‚     0      1
   3 β”‚     0      2
   4 β”‚     1      0
   5 β”‚     1      1
   6 β”‚     1      2

julia> leftjoin!(df, 
    combine(groupby(DataFrame(a = rand(a, 10), b = rand(b, 10)), [:a, :b]), nrow => :count), 
    on = [:a, :b])

6Γ—3 DataFrame
 Row β”‚ a      b      count   
     β”‚ Int64  Int64  Int64?  
─────┼───────────────────────
   1 β”‚     0      0        2
   2 β”‚     0      1        2
   3 β”‚     0      2        3
   4 β”‚     1      0        1
   5 β”‚     1      1        2
   6 β”‚     1      2  missing 
2 Likes

The target counts are easy to compute without going through dataframes. Starting with your arrays a, b, c, simple:

julia> using SplitApplyCombine
julia> get.(Ref(groupcount(eachrow(simple))), eachrow(c), 0)
6-element Vector{Int64}:
 2
 4
 0
 2
 1
 1

Further, it can be more convenient to work with c of the 2x3 shape instead of flat 6 length. Counts can then be obtained with the same shape:

julia> c = collect.(Iterators.product(a, b))
2Γ—3 Matrix{Vector{Int64}}:
 [0, 0]  [0, 1]  [0, 2]
 [1, 0]  [1, 1]  [1, 2]

julia> get.(Ref(groupcount(eachrow(simple))), c, 0)
2Γ—3 Matrix{Int64}:
 2  4  0
 2  1  1
2 Likes

@aplavin, using SplitApplyCombine could we get a Dict like below without going through get.(Ref(...), c, 0)?

using SplitApplyCombine
a, b = [0,1], [0,1,2]
data = hcat(rand(a,10), rand(b,10))
c = collect.(Iterators.product(a, b))
d = get.(Ref(groupcount(eachrow(data))), c, 0)
Dict(c .=> d)

Dict{Vector{Int64}, Int64} with 6 entries:
  [0, 0] => 0
  [1, 1] => 1
  [0, 2] => 2
  [1, 0] => 0
  [1, 2] => 4
  [0, 1] => 3

groupcount(eachrow(data)) is already a dictionary like the one you need. It only has values that are actually present in the dataset though, without zero counts - of course, how would the function know the set of all potentially possible values by itself.

1 Like

Thanks for this approach! I’m finding it to be marginally faster than the dataframes approach with dimensions of c =7776x6 and simple = 30274x6

@aplavin, would it be a good idea to initialize the dictionary with all counters at zero, and then update it? For example:

using Dictionaries, SplitApplyCombine
a, b = [0,1], [0,1,2]
c = collect.(Iterators.product(a, b))
data = hcat(rand(a,10), rand(b,10))
dic = merge!(dictionary(c .=> 0), groupcount(eachrow(data)))

6-element Dictionary{Vector{Int64}, Int64}
 [0, 0] β”‚ 4
 [1, 0] β”‚ 4
 [0, 1] β”‚ 0
 [1, 1] β”‚ 0
 [0, 2] β”‚ 0
 [1, 2] β”‚ 2

Well, it can clearly be useful if you need the final counts as a dictionary with zero entries present. But you may want to keep the result β€œsparse”, especially if a small fraction of possible values is actually present. It’s still easy to access counts with get(dict, key, 0).

1 Like