I have a dataframe with three categorical variables
a,b,c. taking values in 1:2, 1:5, and 1:7 respectively. Then I calculate the number rows with each of the possible 2*5*7=70 combinations of a,b,c:
df = transform(groupby(df, [:a,:b,:c]), nrow => :count)
df = unique(df[:, [:a, :b, :c, :count])
Now I want to create 3d 2*5*7 array called
frequency such that frequency[i,j,k] is equal to the value of
count in the row corresponding to a=i,b=j,c=k.
How can I do this? I suspect the
unstack function may help but I don’t understand that function well.
Here’s what I would do using DataFramesMeta and Chain.jl
julia> using Chain, DataFramesMeta;
julia> df = DataFrame(a = rand(1:2, N), b = rand(1:5, N), c = rand(1:7, N));
julia> df_count = @chain df begin
groupby([:a, :b, :c])
@combine(count = length(:a))
julia> combinations = reshape(collect(Iterators.product(1:2, 1:5, 1:7)), :);
julia> df_full = @chain combinations begin
rename!([:a, :b, :c])
julia> df_fraction = @chain df_full begin
leftjoin(df_count, on = [:a, :b, :c])
@transform(count = coalesce(:count, 0))
@transform(fraction = :count ./ sum(:count))
Thanks but I’m not seeing how to access the frequency at particular indices. Suppose for I want to know the frequency with which
a=1,b=2,c=3 occurs. I guess I could convert the dataframe into a matrix and then loop over every row in the matrix and say
if matrix[ _,1] = 1 && matrix[ _,2]=2, && matrix[ _,3] = 3, then give me
matrix[_,4]. But I figured there would be a more systematic way.
Ah yes I see now.
This is is a commonly requested feature in DataFrames. There is no way to index columns for an easy lookup like that.
You can make a
GroupedDataFrame grouping on
[:a, :b, :c] and then index like
gd[(a = 1, b = 2, c = 3)]
This will be fast since a
GroupedDataFrame creates a hash for lookup just like a
However this will return a
SubDataFrame, which is itself not the nicest object.
I would do the following
julia> gd = groupby(df_fraction, [:a, :b, :c]);
julia> function get_fraction(gd, ;a = nothing, b = nothing, c = nothing)
only(gd[(;a, b, c)]).fraction
get_fraction (generic function with 1 method)
julia> get_fraction(gd; a = 1, b = 2, c = 3)