Bad performance of group_by of DataFrames - updated -

Hello,

I am working on a small dataset with 30 rows and 19 columns.

I do a group_by on 3 colums and aggregate between one and all
colums, using sum or maximum as aggregation functions.

The number of colums that are queried varies, also their
order.

As a result the DataFrame.by function is getting recompiled
for each query, which takes about 50 to 80 ms on my laptop.

While this doesn’t sound much, in reality we might have queries
with 200 columns and we might need four calls to group_by per
web request, in the result the response time gets way too
slow.

I think improving the performance of my function “aggregate”
by a factor of 10 should be possible by avoiding any
recompilation.

I already tried to use the macro “@nospecialize”, but it
did not help.

Any suggestions how to get a better performance?

The test code:

# Purpose of this test:
# Reproduce the bad performance of aggregate, when the number and the order of the named_tuple argument changes.
using DataFrames

# TODO: find an implementation that doesn't recompile when the number of colums or their order change
function aggregate(table, agg_clause, (named_tuple), prn=false)
    # println(named_tuple)
    result, t, bytes, gctime, memallocs = @timed by(table, agg_clause, named_tuple, sort=true)
    if prn
        println("time: $(round(t, digits=3))")
    end
    return result
end

# Testdata
ALL = Symbol[:CAT_UTRECHT_CNT, :CAT_MOUSE, :WW_OP, :QWE_GRP, :HWR_KT, :CAT_WEIGHT, :HIGH_GRP, :CAT_WEIGHT_C, :CAT_CMI_C, :CAT_MOUSE_C, :CAT_FREIG_WGT_CNT, :CAT_FREIG_WGT_CNT_C, :AMS_CMI_TOT, :CAT_UTRECHT_DIS_CNT, :AMS_CMI_HRV, 
:GVZ_WXK, :AMS_CMI_DIFF, :CAT_UTRECHT_CNT_C]

agg_clause = Symbol[:HIGH_GRP, :QWE_GRP, :WW_OP]
NAMED_TUPLE = (CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_CMI_C_T = :CAT_CMI_C => maximum, 
                 CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, 
                 CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_MOUSE_T = :CAT_MOUSE => maximum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, 
                 CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, WW_OP = :WW_OP => maximum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, 
                 HWR_KT = :HWR_KT => maximum, CAT_WEIGHT = :CAT_WEIGHT => sum, CAT_MOUSE = :CAT_MOUSE => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, 
                 CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_CMI_C = :CAT_CMI_C => sum, 
                 CAT_MOUSE_C = :CAT_MOUSE_C => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, GVZ_WXK = :GVZ_WXK => sum, 
                 CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum)

# add a column with random Float64 values and the name col to the table
function add_column(table, col::Symbol)
    COL = rand(Float64, nrow(table))
    table[!, col] = COL 
end

# build a named tuple with the given length, using NAMED_TUPLE as input
function make_named_tuple(len)
    nt = NAMED_TUPLE
    @assert len <= length(nt)
    t_names = propertynames(nt)
    pairs = Vector{Pair}(undef, len)
    indices = []
    new_names = []
    for i in 1:len
        index = rand(setdiff(1:length(nt), indices))
        push!(indices, index)
        pairs[i] = values(nt)[index]
        push!(new_names, t_names[index])
    end
    return NamedTuple{Tuple(new_names)}(Tuple(pairs))
end

# create data frame with 30 rows
jtab = DataFrame([1:30])

# add columns, using random values
for col in ALL
    global jtab
    add_column(jtab, col)
end

# create random aggregates of this dataframe and determine the time
for i = 1:length(NAMED_TUPLE)
    nt = make_named_tuple(i)
    res = aggregate(jtab, agg_clause, nt, true)
end
nothing

Example output, second run:

julia> include("local_tests/test_aggregate2.jl")
time: 0.046
time: 0.054
time: 0.062
time: 0.057
time: 0.057
time: 0.061
time: 0.06
time: 0.063
time: 0.061
time: 0.066
time: 0.065
time: 0.074
time: 0.072
time: 0.075
time: 0.072
time: 0.069
time: 0.069
time: 0.071
time: 0.074
time: 0.075
time: 0.076
time: 0.078
time: 0.076

Any help appreciated!

Why are you profiling when add_column! creates a random values? How is that helping your aggregation problem?

Well, the data that is queried might change, and if you query always the same data the aggregation function might use some caching that gives misleading benchmark results.

I understand now.

I think you are doing everything right in terms of by, and I don’t think there is any particular DataFrames solution that will make this faster.

A lot of work has been put into making the concatenation step and the groupby step very optimized,so you are probably right that function compilation is the bottleneck here.

I’m sure you know this despite a NamedTuple being typed, DataFrames inputting a named tuple doesn’t provide any performance benefit, since the NamedTuple just stores the functions used, and none of the information about how the functions will be called.

It would be great if you could benchmark this using JuliaDB.

Using legacy API of by avoids recompilation. Here is how you can do it (this is a quick hack and I did not try to squeeze out maximum performance but I hope it is clear what is going on):

julia> function agg1(table, agg_clause, x, prn=false)
           # println(named_tuple)
           result, t, bytes, gctime, memallocs = @timed by(table, agg_clause, sort=true) do sdf
               r = DataFrame()
               for (a,b,c) in x
                   r[!, a] = [c(sdf[!, b])]
               end
               r
           end
           if prn
               println("time: $(round(t, digits=3))")
           end
           return result
       end
agg1 (generic function with 2 methods)

julia> for i = 1:length(NAMED_TUPLE)
           nt = make_named_tuple(i)
           x = []
           for (a,(b, c)) in pairs(nt)
               push!(x, [a, b, c])
           end
           res = agg1(jtab, agg_clause, x, true)
       end
time: 0.012
time: 0.0
time: 0.0
time: 0.0
time: 0.0
time: 0.001
time: 0.001
time: 0.0
time: 0.001
time: 0.001
time: 0.001
time: 0.001
time: 0.001
time: 0.001
time: 0.001
time: 0.001
time: 0.001
time: 0.001
time: 0.001
time: 0.001
time: 0.001
time: 0.002
time: 0.002
1 Like

Very good! Thank you very much!
I tried a slightly different version:

function aggregate1(table, agg_clause, @nospecialize(named_tuple), prn=false)
    # println(named_tuple)
    x = []
    for (a,(b, c)) in pairs(named_tuple)
        push!(x, [a, b, c])
    end
    result, t, bytes, gctime, memallocs = @timed by(table, agg_clause, sort=true) do sdf
        r = DataFrame()
        for (a,b,c) in x
            r[!, a] = [c(sdf[!, b])]
        end
        r
    end
    if prn
        println("time: $(round(t, digits=3))")
    end
    return result
end

Still need to check if the result is correct, but it is very fast (about 40 times faster than the original version) :slight_smile:

JuliaDB is much worse. At least the last time I tried.

In JuliaDB you cannot avoid recompilation. Your use case is one of the key scenarios where type-instability of DataFrames.jl pays off :smile:.

Regarding the fix - yes - this will also work, but if you do not want to specialize on this NamedTuple then equally you could just avoid creating it in the first place (I left creation of the named tuple, as removing it required to much refactoring of your original code so I introduced this “NamedTuple deconstructing” loop).

Any intuition for why the anonymous function syntax is faster? I thought it wasn’t supposed to be that way

It is slower than type-stable case, but it does not require recompilation. As the processed data frame is very small the cost of compilation is of orders of magnitude larger than the cost of function execution (even if it is type unstable).

It would be a different case if the data frame had like 10,000,000 rows, as then recompilation cost would be negligible.

1 Like

Unluckily the problem re-appeared after fixing a bug in the test-data generation. I have three columns now of type String (they are also used for group_by), and now the original solution does not work anymore. Well, it helps a little bit, but not always.
New code:

# Purpose of this test:
# Reproduce the bad performance of aggregate, when the number and the order of the named_tuple argument changes.

using DataFrames

# This function works fine if all colums are Float64, but not so good if it is a mix
# of String and Float64
function aggregate1(table, agg_clause, @nospecialize(named_tuple), prn=false)
    # println(named_tuple)
    x = []
    for (a,(b, c)) in pairs(named_tuple)
        push!(x, [a, b, c])
    end
    result, t, bytes, gctime, memallocs = @timed by(table, agg_clause, sort=true) do sdf
        r = DataFrame()
        for (a,b,c) in x
            r[!, a] = [c(sdf[!, b])]
        end
        r
    end
    if prn
        println("time: $(round(t, digits=3))")
    end
    return result
end

# Testdata
ALL = Symbol[:CAT_UTRECHT_CNT, :CAT_MOUSE, :WW_OP, :QWE_GRP, :HWR_KT, :CAT_WEIGHT, :HIGH_GRP, :CAT_WEIGHT_C, :CAT_CMI_C, :CAT_MOUSE_C, :CAT_FREIG_WGT_CNT, :CAT_FREIG_WGT_CNT_C, :AMS_CMI_TOT, :CAT_UTRECHT_DIS_CNT, :AMS_CMI_HRV, 
:GVZ_WXK, :AMS_CMI_DIFF, :CAT_UTRECHT_CNT_C]

agg_clause = Symbol[:HIGH_GRP, :QWE_GRP, :WW_OP]
NAMED_TUPLE = (CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_CMI_C_T = :CAT_CMI_C => maximum, 
                 CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, 
                 CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_MOUSE_T = :CAT_MOUSE => maximum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, 
                 CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, WW_OP = :WW_OP => maximum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, 
                 HWR_KT = :HWR_KT => maximum, CAT_WEIGHT = :CAT_WEIGHT => sum, CAT_MOUSE = :CAT_MOUSE => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, 
                 CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_CMI_C = :CAT_CMI_C => sum, 
                 CAT_MOUSE_C = :CAT_MOUSE_C => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, GVZ_WXK = :GVZ_WXK => sum, 
                 CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum)

# add a column with random Float64 values and the name col to the table
function add_column(table, col::Symbol)
    COL = rand(Float64, nrow(table))
    table[!, col] = COL 
end

# add a column with random Float64 values and the name col to the table
function add_string_column(table, col::Symbol)
    COL = String[]
    for i in 1:nrow(table)
        if i % 3 == 0 
            push!(COL, "HUND")
        elseif i % 3 == 1
            push!(COL, "HASE")
        elseif i % 3 == 2
            push!(COL, "IGEL")
        end
    end
    table[!, col] = COL 
end

# build a named tuple with the given length, using NAMED_TUPLE as input
function make_named_tuple(len)
    nt = NAMED_TUPLE
    @assert len <= length(nt)
    t_names = propertynames(nt)
    pairs = Vector{Pair}(undef, len)
    indices = []
    new_names = []
    for i in 1:len
        index = rand(setdiff(1:length(nt), indices))
        push!(indices, index)
        pairs[i] = values(nt)[index]
        push!(new_names, t_names[index])
    end
    return NamedTuple{Tuple(new_names)}(Tuple(pairs))
end

# create data frame with 30 rows
jtab = DataFrame([1:30])

# add columns, using random values
for col in ALL
    global jtab
    if col in agg_clause
        add_string_column(jtab, col)
    else
        add_column(jtab, col)
    end
end

# create random aggregates of this dataframe and determine the time
for i = 1:length(NAMED_TUPLE)
    nt = make_named_tuple(i)
    res = aggregate1(jtab, agg_clause, nt, true)
end
nothing

Test result, second run:

time: 0.011
time: 0.0
time: 0.0
time: 0.051
time: 0.0
time: 0.0
time: 0.0
time: 0.0
time: 0.0
time: 0.0
time: 0.0
time: 0.05
time: 0.053
time: 0.001
time: 0.062
time: 0.051
time: 0.055
time: 0.058
time: 0.049
time: 0.049
time: 0.058
time: 0.049
time: 0.05

Now it is sometimes recompiling and sometimes not. Any idea how to fix this?

Not sure if it’s recompiling or GCing.

There’s way too much code for me to consider looking into it. Can you maybe describe succinctly what you are trying to do and give an MWE?

Maybe something like show how your input looks like and how you want your output?

Well, I construct a dataframe with colums of type String and type Float64. Than I do a group_by operation on the three string colums, aggregating the Float64 numbers with either sum or maximum.

That’s it.

The tricky part is that the columns selected for the group_by operation is random. And this is sometimes causing a recompilation.

So your function selects at random one, two, or three String columns as the group-by column and compute either sum or max chosen at random as well?

Not exactly. The named tuples, used for the aggregation look like this:

julia> include("local_tests/test_aggregate2.jl")
(CAT_CMI_C_T = :CAT_CMI_C => maximum,)
time: 0.011
(CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, GVZ_WXK = :GVZ_WXK => sum)
time: 0.0
(CAT_CMI_C_T = :CAT_CMI_C => maximum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum)
time: 0.0
(CAT_MOUSE_C = :CAT_MOUSE_C => sum, CAT_CMI_C_T = :CAT_CMI_C => maximum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, GVZ_WXK = :GVZ_WXK => sum)
time: 0.046
(CAT_CMI_C_T = :CAT_CMI_C => maximum, CAT_MOUSE_C = :CAT_MOUSE_C => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum)
time: 0.0
(CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_MOUSE_C = :CAT_MOUSE_C => sum, CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, CAT_WEIGHT = :CAT_WEIGHT => sum, GVZ_WXK = :GVZ_WXK => sum)
time: 0.0
(CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, CAT_CMI_C = :CAT_CMI_C => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, WW_OP = :WW_OP => maximum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum)
time: 0.048
(WW_OP = :WW_OP => maximum, CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_MOUSE_C = :CAT_MOUSE_C => sum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, CAT_CMI_C_T = :CAT_CMI_C => maximum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum)
time: 0.0
(CAT_MOUSE = :CAT_MOUSE => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_CMI_C_T = :CAT_CMI_C => maximum, CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, GVZ_WXK = :GVZ_WXK => sum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum)
time: 0.0
(CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, CAT_CMI_C_T = :CAT_CMI_C => maximum, CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, CAT_MOUSE_C = :CAT_MOUSE_C => sum, CAT_WEIGHT = :CAT_WEIGHT => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum)
time: 0.0
(CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, GVZ_WXK = :GVZ_WXK => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, CAT_CMI_C_T = :CAT_CMI_C => maximum, CAT_MOUSE = :CAT_MOUSE => sum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum)
time: 0.0
(CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, GVZ_WXK = :GVZ_WXK => sum, CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, CAT_WEIGHT = :CAT_WEIGHT => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, CAT_CMI_C_T = :CAT_CMI_C => maximum, CAT_CMI_C = :CAT_CMI_C => sum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_MOUSE = :CAT_MOUSE => sum, WW_OP = :WW_OP => maximum)
time: 0.0
(AMS_CMI_TOT = :AMS_CMI_TOT => sum, CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_MOUSE_T = :CAT_MOUSE => maximum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, WW_OP = :WW_OP => maximum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, HWR_KT = :HWR_KT => maximum, GVZ_WXK = :GVZ_WXK => sum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum)
time: 0.0
(CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_CMI_C = :CAT_CMI_C => sum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_MOUSE_T = :CAT_MOUSE => maximum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, CAT_MOUSE_C = :CAT_MOUSE_C => sum, GVZ_WXK = :GVZ_WXK => sum)
time: 0.046
(CAT_MOUSE = :CAT_MOUSE => sum, GVZ_WXK = :GVZ_WXK => sum, WW_OP = :WW_OP => maximum, CAT_WEIGHT = :CAT_WEIGHT => sum, CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum, HWR_KT = :HWR_KT => maximum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, CAT_MOUSE_T = :CAT_MOUSE => maximum, CAT_MOUSE_C = :CAT_MOUSE_C => sum)
time: 0.001
(CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_CMI_C = :CAT_CMI_C => sum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, WW_OP = :WW_OP => maximum, CAT_MOUSE = :CAT_MOUSE => sum, CAT_MOUSE_T = :CAT_MOUSE => maximum, GVZ_WXK = :GVZ_WXK => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_CMI_C_T = :CAT_CMI_C => maximum)
time: 0.0
(CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_CMI_C = :CAT_CMI_C => sum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_CMI_C_T = :CAT_CMI_C => maximum, CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_MOUSE_C = :CAT_MOUSE_C => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, CAT_MOUSE_T = :CAT_MOUSE => maximum, HWR_KT = :HWR_KT => maximum, CAT_WEIGHT = :CAT_WEIGHT => sum, GVZ_WXK = :GVZ_WXK => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum)
time: 0.046
(CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, GVZ_WXK = :GVZ_WXK => sum, CAT_WEIGHT = :CAT_WEIGHT => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum, HWR_KT = :HWR_KT => maximum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, CAT_MOUSE_C = :CAT_MOUSE_C => sum, CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_CMI_C = :CAT_CMI_C => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, CAT_MOUSE = :CAT_MOUSE => sum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_MOUSE_T = :CAT_MOUSE => maximum)
time: 0.049
(CAT_WEIGHT = :CAT_WEIGHT => sum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, CAT_CMI_C = :CAT_CMI_C => sum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum, CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, CAT_MOUSE_C = :CAT_MOUSE_C => sum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, WW_OP = :WW_OP => maximum, CAT_MOUSE_T = :CAT_MOUSE => maximum, GVZ_WXK = :GVZ_WXK => sum, CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, HWR_KT = :HWR_KT => maximum, CAT_MOUSE = :CAT_MOUSE => sum, CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum)
time: 0.048
(HWR_KT = :HWR_KT => maximum, CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum, CAT_WEIGHT = :CAT_WEIGHT => sum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, GVZ_WXK = :GVZ_WXK => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, CAT_MOUSE = :CAT_MOUSE => sum, CAT_CMI_C = :CAT_CMI_C => sum, WW_OP = :WW_OP => maximum, CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, CAT_MOUSE_C = :CAT_MOUSE_C => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum)
time: 0.0
(CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, WW_OP = :WW_OP => maximum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, HWR_KT = :HWR_KT => maximum, CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_MOUSE_T = :CAT_MOUSE => maximum, CAT_CMI_C = :CAT_CMI_C => sum, CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, GVZ_WXK = :GVZ_WXK => sum, CAT_CMI_C_T = :CAT_CMI_C => maximum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, CAT_WEIGHT = :CAT_WEIGHT => sum)
time: 0.048
(HWR_KT = :HWR_KT => maximum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_MOUSE_T = :CAT_MOUSE => maximum, CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_CMI_C_T = :CAT_CMI_C => maximum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, CAT_CMI_C = :CAT_CMI_C => sum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, CAT_MOUSE_C = :CAT_MOUSE_C => sum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_WEIGHT = :CAT_WEIGHT => sum, GVZ_WXK = :GVZ_WXK => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, WW_OP = :WW_OP => maximum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum, CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum)
time: 0.0
(HWR_KT = :HWR_KT => maximum, AMS_CMI_HRV = :AMS_CMI_HRV => sum, CAT_WEIGHT_C = :CAT_WEIGHT_C => sum, AMS_CMI_TOT = :AMS_CMI_TOT => sum, CAT_FREIG_WGT_CNT_T = :CAT_FREIG_WGT_CNT => maximum, CAT_MOUSE_C = :CAT_MOUSE_C => sum, CAT_FREIG_WGT_CNT = :CAT_FREIG_WGT_CNT => sum, CAT_CMI_C = :CAT_CMI_C => sum, CAT_UTRECHT_CNT_T = :CAT_UTRECHT_CNT => maximum, CAT_UTRECHT_CNT_C = :CAT_UTRECHT_CNT_C => sum, WW_OP = :WW_OP => maximum, CAT_FREIG_WGT_CNT_C_T = :CAT_FREIG_WGT_CNT_C => maximum, CAT_UTRECHT_CNT = :CAT_UTRECHT_CNT => sum, AMS_CMI_DIFF = :AMS_CMI_DIFF => sum, CAT_FREIG_WGT_CNT_C = :CAT_FREIG_WGT_CNT_C => sum, CAT_WEIGHT_C_T = :CAT_WEIGHT_C => maximum, CAT_MOUSE = :CAT_MOUSE => sum, CAT_CMI_C_T = :CAT_CMI_C => maximum, CAT_MOUSE_C_T = :CAT_MOUSE_C => maximum, CAT_WEIGHT = :CAT_WEIGHT => sum, GVZ_WXK = :GVZ_WXK => sum, CAT_MOUSE_T = :CAT_MOUSE => maximum, CAT_UTRECHT_DIS_CNT = :CAT_UTRECHT_DIS_CNT => sum)
time: 0.0

This is artificial data, in reality this are user queries, but what the user queries is kind of random. And this is really recompilation that is sometimes triggered. I turned of the GC to be sure. And the time needed becomes shorter if I use Julia -O0, this is also a hint that the time is needed for recompilation.

You seem to be doing lots of group-by the same set of variables so perhap you can try to group them just once. But it sounds like your group-by variabels can change as well.

Suppose it’s the former, perhaps you can to group by once and then do all those operations to speed things up e.g. (probably not helping)

jtab_grouped = groupby(jtab, agg_clause)

now each [i] is a subdataframe

jtab_grouped[1]

and you can use map(fn, grouped_data_frame) to apply the same function to each sub_data_frame

Another small thing is that

function add_string_column!(table, col::Symbol) 
  table[!, col] = [("HUND", "HASE", "IGEL")[(i % 3)+1] for i in 1:nrow(table)]
end

is more succint.

If u mutate the content then you should add ! to your function name

I feel that this is more what you want

function add_string_column!(table, col::Symbol)
  table[!, col] = rand(["HUND", "HASE", "IGEL"], nrow(table))
end

I am trying to use groupby and map instead of by. But I did not find out how to do that correctly (yet). My current code:

function aggregate2(table, agg_clause, @nospecialize(named_tuple), prn=false)
    x = []
    for (a,(b, c)) in pairs(named_tuple)
        push!(x, [a, b, c])
    end
    
    table_grouped = groupby(table, agg_clause)
    r = DataFrame()
    for sub_df in table_grouped
        # println(sub_df)
        for (a,b,c) in x
            map(b => c, table_grouped)
        end
    end
    return r
end

But how can I combine the results of the map operation to a resulting dataframe?

Created an issue: https://github.com/JuliaData/DataFrames.jl/issues/1988