Creating a 3d frequency array for categorical variables from a dataframe

I have a dataframe with three categorical variables a,b,c. taking values in 1:2, 1:5, and 1:7 respectively. Then I calculate the number rows with each of the possible 2*5*7=70 combinations of a,b,c:

df = transform(groupby(df, [:a,:b,:c]), nrow => :count)
df = unique(df[:, [:a, :b, :c, :count])

Now I want to create 3d 2*5*7 array called frequency such that frequency[i,j,k] is equal to the value of count in the row corresponding to a=i,b=j,c=k.

How can I do this? I suspect the unstack function may help but I don’t understand that function well.

Here’s what I would do using DataFramesMeta and Chain.jl

julia> using Chain, DataFramesMeta;

julia> df = DataFrame(a = rand(1:2, N), b = rand(1:5, N), c = rand(1:7, N));

julia> df_count = @chain df begin 
           groupby([:a, :b, :c])
           @combine(count = length(:a))
       end;

julia> combinations = reshape(collect(Iterators.product(1:2, 1:5, 1:7)), :);

julia> df_full = @chain combinations begin 
           DataFrame
           rename!([:a, :b, :c])
       end;

julia> df_fraction = @chain df_full begin 
           leftjoin(df_count, on = [:a, :b, :c])
           @transform(count = coalesce(:count, 0))
           @transform(fraction = :count ./ sum(:count))
       end;

Thanks but I’m not seeing how to access the frequency at particular indices. Suppose for I want to know the frequency with which a=1,b=2,c=3 occurs. I guess I could convert the dataframe into a matrix and then loop over every row in the matrix and say if matrix[ _,1] = 1 && matrix[ _,2]=2, && matrix[ _,3] = 3, then give me matrix[_,4]. But I figured there would be a more systematic way.

Ah yes I see now.

This is is a commonly requested feature in DataFrames. There is no way to index columns for an easy lookup like that.

You can make a GroupedDataFrame grouping on [:a, :b, :c] and then index like

gd[(a = 1, b = 2, c = 3)]

This will be fast since a GroupedDataFrame creates a hash for lookup just like a Dict.

However this will return a SubDataFrame, which is itself not the nicest object.

I would do the following

julia> gd = groupby(df_fraction, [:a, :b, :c]);

julia> function get_fraction(gd, ;a = nothing, b = nothing, c = nothing)
           only(gd[(;a, b, c)]).fraction
       end
get_fraction (generic function with 1 method)

julia> get_fraction(gd; a = 1, b = 2, c = 3)
0.018
2 Likes