Efficiently finding the frequency of patterns in DataFrame columns

Matthew · December 30, 2021, 10:05pm

Hi all! I’m looking for help with what might be an easy task.

Say we have two vectors; A = [0,1] and B = [0,1,2]. Matrix C contains every possible combination of A and B such that it has dimensions 6x2.

Now consider a dataset containing N observations again with features A and B. I want to find
the frequency of each pattern and store number in the same order as matrix C.

Below, I show that I can do this with nested for loops, however, in practice, this becomes time-intensive as the dimensions of matrix C increase (e.g. add another vector, q = [0,1,2]). I’m wondering if there is a more efficient way to accomplish this without nested for loops?

I’ve provided some reproducible code below:


using DataFrames
a = [0,1]
b = [0,1,2]

c = zeros(6,2)
k = 1
for i in 0:1
    for j in 0:2
        c[k,:] = [i,j]
        k+=1
    end
end

# all possible combinations of A,B
@show c


# new data
simple = hcat(rand(a,10), rand(b,10))
data = simple |>
    x->DataFrame(x, :auto)


counter = zeros(6,1);
for i in 0:1
    for j in 0:2
        row = 1 + 3*i + j
        counter[row] = sum( @. (data.x1==i && data.x2 == j))    
    end
end



@show counter

nilshg · December 30, 2021, 11:04pm

Not sure I understand you correctly but if a and b are columns in your DataFrame you might be looking for

combine(groupby(df, [:a, :b]), nrow)

Which will give a count of the number of rows for each distinct a/b combination

Matthew · December 30, 2021, 11:37pm

Thanks for the reply! I thought about this, but my desired output would be something like:

final = hcat(c,counter)

My problem with combine(groupby(df, [:a, :b]), nrow) is that it will likely result in a number of distinct groups that is less than the number of possible a/b combinations accounted for in matrix c.

It’s like I want to merge the result of the groupby/combine with c and fill the nrow with 0 if a certain a/b combo isn’t apparent in data

Matthew · December 30, 2021, 11:41pm

I think I found it…

y = combine(groupby(df, [:a, :b]), nrow)
final = outerjoin(c,y, on = [:x1,:x2])

nilshg · December 31, 2021, 6:14am

I see - in that case you probably want to create a new DataFrame from the result of IterTools.product(df.x, df.y) first and then leftjoin the result of the groupby operation onto that. Finally coalesce the nrow column in the joined DataFrame to replace missing with zero if desired.

rafael.guerra · December 31, 2021, 9:28am

Using sort!() ensures the same order but there must be a better way:

using DataFrames
a, b = [0,1], [0,1,2]
df = DataFrame(Iterators.product(a, b))
rename!(df,[:a,:b])
sort!(df, [:a, :b])

data = DataFrame(a = rand(a,10), b = rand(b,10))     # new data
dg = combine(groupby(data, [:a, :b]), nrow => :count)
dfc = outerjoin(dg, df, on = [:a, :b])
sort!(dfc, [:a, :b])

julia> dfc   # dfc sorted by [:a, :b] as df

 Row │ a      b      count   
     │ Int64  Int64  Int64?
─────┼───────────────────────
   1 │     0      0        4
   2 │     0      1        1
   3 │     0      2  missing
   4 │     1      0        4
   5 │     1      1  missing
   6 │     1      2        1

nilshg · December 31, 2021, 10:47am

Yes, that is what I meant when answering from my phone this morning

To me leftjoin is conceptually the more correct join here, as df should have all combinations, i.e. the right table should not add new rows.

Also if you want to preserve the row order on the left hand side you can use leftjoin!:

julia> df = sort!(rename!(DataFrame(Iterators.product(a, b)), [:a, :b]), [:a, :b])
6×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     0      0
   2 │     0      1
   3 │     0      2
   4 │     1      0
   5 │     1      1
   6 │     1      2

julia> leftjoin!(df, 
    combine(groupby(DataFrame(a = rand(a, 10), b = rand(b, 10)), [:a, :b]), nrow => :count), 
    on = [:a, :b])

6×3 DataFrame
 Row │ a      b      count   
     │ Int64  Int64  Int64?  
─────┼───────────────────────
   1 │     0      0        2
   2 │     0      1        2
   3 │     0      2        3
   4 │     1      0        1
   5 │     1      1        2
   6 │     1      2  missing

aplavin · December 31, 2021, 11:23am

The target counts are easy to compute without going through dataframes. Starting with your arrays a, b, c, simple:

julia> using SplitApplyCombine
julia> get.(Ref(groupcount(eachrow(simple))), eachrow(c), 0)
6-element Vector{Int64}:
 2
 4
 0
 2
 1
 1

Further, it can be more convenient to work with c of the 2x3 shape instead of flat 6 length. Counts can then be obtained with the same shape:

julia> c = collect.(Iterators.product(a, b))
2×3 Matrix{Vector{Int64}}:
 [0, 0]  [0, 1]  [0, 2]
 [1, 0]  [1, 1]  [1, 2]

julia> get.(Ref(groupcount(eachrow(simple))), c, 0)
2×3 Matrix{Int64}:
 2  4  0
 2  1  1

rafael.guerra · December 31, 2021, 12:02pm

@aplavin, using SplitApplyCombine could we get a Dict like below without going through get.(Ref(...), c, 0)?

using SplitApplyCombine
a, b = [0,1], [0,1,2]
data = hcat(rand(a,10), rand(b,10))
c = collect.(Iterators.product(a, b))
d = get.(Ref(groupcount(eachrow(data))), c, 0)
Dict(c .=> d)

Dict{Vector{Int64}, Int64} with 6 entries:
  [0, 0] => 0
  [1, 1] => 1
  [0, 2] => 2
  [1, 0] => 0
  [1, 2] => 4
  [0, 1] => 3

aplavin · December 31, 2021, 12:19pm

groupcount(eachrow(data)) is already a dictionary like the one you need. It only has values that are actually present in the dataset though, without zero counts - of course, how would the function know the set of all potentially possible values by itself.

Matthew · December 31, 2021, 4:14pm

Thanks for this approach! I’m finding it to be marginally faster than the dataframes approach with dimensions of c =7776x6 and simple = 30274x6

rafael.guerra · January 1, 2022, 11:25am

@aplavin, would it be a good idea to initialize the dictionary with all counters at zero, and then update it? For example:

using Dictionaries, SplitApplyCombine
a, b = [0,1], [0,1,2]
c = collect.(Iterators.product(a, b))
data = hcat(rand(a,10), rand(b,10))
dic = merge!(dictionary(c .=> 0), groupcount(eachrow(data)))

6-element Dictionary{Vector{Int64}, Int64}
 [0, 0] │ 4
 [1, 0] │ 4
 [0, 1] │ 0
 [1, 1] │ 0
 [0, 2] │ 0
 [1, 2] │ 2

aplavin · January 1, 2022, 9:27pm

Well, it can clearly be useful if you need the final counts as a dictionary with zero entries present. But you may want to keep the result “sparse”, especially if a small fraction of possible values is actually present. It’s still easy to access counts with get(dict, key, 0).

Topic		Replies	Views
Creating a 3d frequency array for categorical variables from a dataframe New to Julia dataframes	3	682	February 12, 2021
Counts of unique values per group in a DataFrame Data question , dataframes	3	10215	May 25, 2020
Compute frequency or proportions on grouped dataframes Data dataframes	10	1165	June 14, 2021
How to find frequency count for consecutive actions? Data dataframes , inmemorydatasets	10	616	May 1, 2022
Dataframes: Split combined result to different columns General Usage dataframes	3	310	December 13, 2021

Efficiently finding the frequency of patterns in DataFrame columns

Related topics