Writing Multidimensional FreqTable to CSV

mrufsvold · March 27, 2024, 4:56pm

I need to take a cross-tab of a DataFrame and write it out to CSV. This cross-tab has 3 dimensions. If I were to use pandas for this, it would create a multi-index, but FreqTables.jl creates a multidimensional NamedArray. Is there a Julian way to horizontally concatenate the 3rd dimension and note “col_3 = val” for each slice before writing it out to a table?

pdeffebach · March 27, 2024, 7:25pm

Here is an absolutely indecipherable method I just came up with. It will get the job done no matter the number of dimensions you are tabbing over

julia> function iteratenames(t)
           dns = (:count, dimnames(t)...)
           nms = names(t)
           out = map(vec(CartesianIndices(t))) do i
               val = t[i]
               indices = ntuple(length(nms)) do j
                  nms[j][i[j]]
               end
               vals = (val, indices...)
               NamedTuple{dns}(vals)
           end
           DataFrame(out)
       end

pdeffebach · March 27, 2024, 7:30pm

Honestly, It’s probably best to just use a split-apply-combine operation

julia> using DataFramesMeta

julia> @by df [:x, :y, :z] :count = length($1)

mrufsvold · March 27, 2024, 7:35pm

I’m currently rewriting a Python/R project into Julia to demonstrate some of its advantages. I was hoping to maximize the amount of one-to-one translation by using Tidier.jl to emulate the R code and generally reduce the amount of Julia-lookin’ code like DataFrames DSL or DataFramesMeta when translating the Python. I’m hoping to optimize for readability for someone who knows the previous project well.

All this to say, I wish that FreqTables or DataFrames had a crosstab method just so it is clear what I’m doing, but you’re absolutely write that split-apply-combine is more elegant!

pdeffebach · March 27, 2024, 7:40pm

Some might say the DataFramesMeta.jl DSL has some advantages! An opportunity to improve on R, if you will.

But yeah, at the very least something like

DataFrame(tablelike(freqtable(df))

should be easy to implement.

mrufsvold · March 27, 2024, 7:45pm

And by “some” you mean me! I always use DataFramesMeta (and now a little Tidier.jl, it’s really quite nice).

I just want to lean a little bit more towards “you can have your familiar syntax and eat some yummy performance too!” for this round. If I get buy in for folks, then there’s the second push of “Oh, and we have better ways of doing things around here”

mrufsvold · March 27, 2024, 8:37pm

Here is what I ended up going with

using DataFrames
using TidierData

function cross_tab(df, var1, vars...)
    make_col_name = (row) -> begin
        return join(["$var = $(row[var])" for var in vars], ",")
    end
    
    counts = @chain df begin
        @select(!!var1, !!vars...)
        @group_by(!!vars..., !!var1)
        @mutate(n = n())
        @slice(1)
        @ungroup
        transform(
            AsTable(:) => ByRow(make_col_name) => :column_name
            )
        select(Not(vars...))
        @pivot_wider(
            names_from = :column_name,
            values_from = :n
        )
    end
end