# Efficiently finding the frequency of patterns in DataFrame columns

Hi all! I’m looking for help with what might be an easy task.

Say we have two vectors; A = [0,1] and B = [0,1,2]. Matrix C contains every possible combination of A and B such that it has dimensions 6x2.

Now consider a dataset containing N observations again with features A and B. I want to find
the frequency of each pattern and store number in the same order as matrix C.

Below, I show that I can do this with nested for loops, however, in practice, this becomes time-intensive as the dimensions of matrix C increase (e.g. add another vector, `q = [0,1,2]`). I’m wondering if there is a more efficient way to accomplish this without nested for loops?

I’ve provided some reproducible code below:

``````
using DataFrames
a = [0,1]
b = [0,1,2]

c = zeros(6,2)
k = 1
for i in 0:1
for j in 0:2
c[k,:] = [i,j]
k+=1
end
end

# all possible combinations of A,B
@show c

# new data
simple = hcat(rand(a,10), rand(b,10))
data = simple |>
x->DataFrame(x, :auto)

counter = zeros(6,1);
for i in 0:1
for j in 0:2
row = 1 + 3*i + j
counter[row] = sum( @. (data.x1==i && data.x2 == j))
end
end

@show counter

``````
1 Like

Not sure I understand you correctly but if a and b are columns in your DataFrame you might be looking for

``````combine(groupby(df, [:a, :b]), nrow)
``````

Which will give a count of the number of rows for each distinct a/b combination

`final = hcat(c,counter)`

My problem with `combine(groupby(df, [:a, :b]), nrow)` is that it will likely result in a number of distinct groups that is less than the number of possible a/b combinations accounted for in matrix `c`.

It’s like I want to merge the result of the groupby/combine with c and fill the nrow with 0 if a certain a/b combo isn’t apparent in `data`

1 Like

I think I found it…

``````y = combine(groupby(df, [:a, :b]), nrow)
final = outerjoin(c,y, on = [:x1,:x2])

``````

I see - in that case you probably want to create a new DataFrame from the result of `IterTools.product(df.x, df.y)` first and then `leftjoin` the result of the groupby operation onto that. Finally `coalesce` the nrow column in the joined DataFrame to replace missing with zero if desired.

Using `sort!()` ensures the same order but there must be a better way:

``````using DataFrames
a, b = [0,1], [0,1,2]
df = DataFrame(Iterators.product(a, b))
rename!(df,[:a,:b])
sort!(df, [:a, :b])

data = DataFrame(a = rand(a,10), b = rand(b,10))     # new data
dg = combine(groupby(data, [:a, :b]), nrow => :count)
dfc = outerjoin(dg, df, on = [:a, :b])
sort!(dfc, [:a, :b])

julia> dfc   # dfc sorted by [:a, :b] as df

Row │ a      b      count
│ Int64  Int64  Int64?
─────┼───────────────────────
1 │     0      0        4
2 │     0      1        1
3 │     0      2  missing
4 │     1      0        4
5 │     1      1  missing
6 │     1      2        1
``````
1 Like

Yes, that is what I meant when answering from my phone this morning To me `leftjoin` is conceptually the more correct join here, as `df` should have all combinations, i.e. the right table should not add new rows.

Also if you want to preserve the row order on the left hand side you can use `leftjoin!`:

``````julia> df = sort!(rename!(DataFrame(Iterators.product(a, b)), [:a, :b]), [:a, :b])
6×2 DataFrame
Row │ a      b
│ Int64  Int64
─────┼──────────────
1 │     0      0
2 │     0      1
3 │     0      2
4 │     1      0
5 │     1      1
6 │     1      2

julia> leftjoin!(df,
combine(groupby(DataFrame(a = rand(a, 10), b = rand(b, 10)), [:a, :b]), nrow => :count),
on = [:a, :b])

6×3 DataFrame
Row │ a      b      count
│ Int64  Int64  Int64?
─────┼───────────────────────
1 │     0      0        2
2 │     0      1        2
3 │     0      2        3
4 │     1      0        1
5 │     1      1        2
6 │     1      2  missing
``````
2 Likes

The target counts are easy to compute without going through dataframes. Starting with your arrays `a, b, c, simple`:

``````julia> using SplitApplyCombine
julia> get.(Ref(groupcount(eachrow(simple))), eachrow(c), 0)
6-element Vector{Int64}:
2
4
0
2
1
1
``````

Further, it can be more convenient to work with `c` of the 2x3 shape instead of flat 6 length. Counts can then be obtained with the same shape:

``````julia> c = collect.(Iterators.product(a, b))
2×3 Matrix{Vector{Int64}}:
[0, 0]  [0, 1]  [0, 2]
[1, 0]  [1, 1]  [1, 2]

julia> get.(Ref(groupcount(eachrow(simple))), c, 0)
2×3 Matrix{Int64}:
2  4  0
2  1  1
``````
2 Likes

@aplavin, using SplitApplyCombine could we get a `Dict` like below without going through `get.(Ref(...), c, 0)`?

``````using SplitApplyCombine
a, b = [0,1], [0,1,2]
data = hcat(rand(a,10), rand(b,10))
c = collect.(Iterators.product(a, b))
d = get.(Ref(groupcount(eachrow(data))), c, 0)
Dict(c .=> d)

Dict{Vector{Int64}, Int64} with 6 entries:
[0, 0] => 0
[1, 1] => 1
[0, 2] => 2
[1, 0] => 0
[1, 2] => 4
[0, 1] => 3
``````

`groupcount(eachrow(data))` is already a dictionary like the one you need. It only has values that are actually present in the dataset though, without zero counts - of course, how would the function know the set of all potentially possible values by itself.

1 Like

Thanks for this approach! I’m finding it to be marginally faster than the dataframes approach with dimensions of `c =7776x6` and `simple = 30274x6`

@aplavin, would it be a good idea to initialize the dictionary with all counters at zero, and then update it? For example:

``````using Dictionaries, SplitApplyCombine
a, b = [0,1], [0,1,2]
c = collect.(Iterators.product(a, b))
data = hcat(rand(a,10), rand(b,10))
dic = merge!(dictionary(c .=> 0), groupcount(eachrow(data)))

6-element Dictionary{Vector{Int64}, Int64}
[0, 0] │ 4
[1, 0] │ 4
[0, 1] │ 0
[1, 1] │ 0
[0, 2] │ 0
[1, 2] │ 2
``````

Well, it can clearly be useful if you need the final counts as a dictionary with zero entries present. But you may want to keep the result “sparse”, especially if a small fraction of possible values is actually present. It’s still easy to access counts with `get(dict, key, 0)`.

1 Like