Hi all! Iβm looking for help with what might be an easy task.
Say we have two vectors; A = [0,1] and B = [0,1,2]. Matrix C contains every possible combination of A and B such that it has dimensions 6x2.
Now consider a dataset containing N observations again with features A and B. I want to find
the frequency of each pattern and store number in the same order as matrix C.
Below, I show that I can do this with nested for loops, however, in practice, this becomes time-intensive as the dimensions of matrix C increase (e.g. add another vector, q = [0,1,2]). Iβm wondering if there is a more efficient way to accomplish this without nested for loops?
Iβve provided some reproducible code below:
using DataFrames
a = [0,1]
b = [0,1,2]
c = zeros(6,2)
k = 1
for i in 0:1
for j in 0:2
c[k,:] = [i,j]
k+=1
end
end
# all possible combinations of A,B
@show c
# new data
simple = hcat(rand(a,10), rand(b,10))
data = simple |>
x->DataFrame(x, :auto)
counter = zeros(6,1);
for i in 0:1
for j in 0:2
row = 1 + 3*i + j
counter[row] = sum( @. (data.x1==i && data.x2 == j))
end
end
@show counter
Thanks for the reply! I thought about this, but my desired output would be something like:
final = hcat(c,counter)
My problem with combine(groupby(df, [:a, :b]), nrow) is that it will likely result in a number of distinct groups that is less than the number of possible a/b combinations accounted for in matrix c.
Itβs like I want to merge the result of the groupby/combine with c and fill the nrow with 0 if a certain a/b combo isnβt apparent in data
I see - in that case you probably want to create a new DataFrame from the result of IterTools.product(df.x, df.y) first and then leftjoin the result of the groupby operation onto that. Finally coalesce the nrow column in the joined DataFrame to replace missing with zero if desired.
groupcount(eachrow(data)) is already a dictionary like the one you need. It only has values that are actually present in the dataset though, without zero counts - of course, how would the function know the set of all potentially possible values by itself.
Well, it can clearly be useful if you need the final counts as a dictionary with zero entries present. But you may want to keep the result βsparseβ, especially if a small fraction of possible values is actually present. Itβs still easy to access counts with get(dict, key, 0).