DataFrames operation scales badly

I’m finally making a proper effort to replace my pandas workflows with DataFrames, but currently hitting a massive performance issue which makes me think I’m missing something fundamental.

The following snippet of code calculates some statistics of a DataFrame which is grouped by two columns (organisation and either service level or subservice level). There are c. 200 services and 3,000 subservices in the data set.

When running it at service level, the code takes 0.6 seconds to run, slightly faster than an equivalent pandas code. When moving to the more granular subservice level though, runtime goes up by a factor of almost 100 (code takes 58 seconds), while in pandas the increase is only a factor of 4 (0.8 to 3.5 seconds).

Any ideas why this code scales so badly to a larger number of groups? Apologies that this isn’t really an MWE, I’m hoping someone spots something that’s obviously off - although I could make the data and full code available if that’s helpful.

function opportunity2(df::DataFrame; level=:service_code)
    
    # Group dataframe by level
    gr = by(rc, [level, :org_code]) do df
              DataFrame(activity = sum(df.activity), 
                        actual_cost = sum(df.actual_cost))
        end

    # Merge on group-level average for activity
    gr = join(gr, by(gr, level) do df
        DataFrame(mean_activity = mean(skipmissing(df.activity)))
            end, on = level);
    
    # Calculate unit cost at aggregation level
    gr[:unit_cost] = gr.actual_cost ./ gr.activity
                
    # Calculate opportunities 
    l = size(gr, 1)
    gr = [gr DataFrame(uc_mean=rand(l), uc_median=rand(l), uc_lq=rand(l), uc_ld=rand(l))]
    
    for s in Set(gr[level])
        cond = (gr[level].==s) 
        gr[cond, :uc_mean] = mean(skipmissing(gr[cond, :].unit_cost))
        gr[cond, :uc_median] = median(skipmissing(gr[cond, :].unit_cost))
        gr[cond, :uc_lq] = quantile(skipmissing(gr[cond, :].unit_cost), 0.25)
        gr[cond, :uc_ld] = quantile(skipmissing(gr[cond, :].unit_cost), 0.1)
    end
    
    gr[:cost_opp_lq] = gr.activity .* max.(gr.unit_cost .- gr.uc_lq, 0.0)
    gr[:cost_opp_ld] = gr.activity .* max.(gr.unit_cost .- gr.uc_ld, 0.0)
    
    return gr
end
1 Like

I don’t have an idea yet, but my first thought was: why not taking those comments as function names, put the code below them into the functions body and time them to find the bottlenecks? :wink:

3 Likes

That’s actually a good point - I was going to try some profiling in Juno, but wrapping individual steps in functions might make it easier!

Dataframes group by are not ootimised. See Group-by performance benchmarks and recommendations for some background

Thanks will have a read!

I have followed @tamasgal’s suggestion and benchmarked the different steps separately. Results:

# Group dataframe by level
403.786 ms --> 7.872 s  (~20x)

# Merge on group-level average for activity
  4.607 ms --> 146.446 ms  (~30x)

# Calculate unit cost at aggregation level
  572.308 µs --> 17.174 ms (~30x)

# Calculate opportunities
  135.559 ms --> 44.528 s (~328x)

# Total opportunity (the bit after the loop in code above)
  1.409 ms --> 46.898 ms (~33x) 

So it appears there’s two things:

Firstly, the simply groupby operation takes 20 times longer, which is in line with the other bits of the code but alone would amount to more than the increase between fewer and more subgroups in the pandas code (here +7s, in pandas total time only goes up by c. 2.5 seconds)

Secondly, the loop runtime completely explodes - and I’m not sure why. In pandas I’m doing this step with a groupby operation, but can’t do the same in DataFrames due to the way missings are handled (in my real code I end up with subgroups which don’t have any data in, causing mean/median/quantile to fail, so I’m testing for that in the loop).

Any ideas why the loop scales so badly?

This allocates a new dataframe. Consider

    for s in Set(gr[level])
        cond = (gr[level].==s)
        subframe = view(gr, cond, :) 
        uc = collect(skipmissing(subframe.unit_cost))
        subframe.uc_mean .= mean(uc)
        subframe.uc_median .= median(uc)
        subframe.uc_lq .= quantile(uc, 0.25)
        subframe.uc_ld .= quantile(uc, 0.1)
    end

Thanks, that’s super helpful - cut down execution time of the loop by roughly 60%!

Given that I need the results to be saved in gr down the line, I’ve amended the lines in your loop from:

subframe.uc_mean .= mean(uc)

to

gr[cond, uc_mean] .= mean(uc)

as I’m assuming the problem with the allocation happens on the right hand side rather than the left?

Well, meanuc = mean(uc); subframe.uc_mean .= meanuc will iterate over the subframe, for length(findall(cond)) many iterations, and gr[cond, uc_mean] .= meanuc will iterate over the entire frame for length(gr.uc_mean) many iterations. In other words, no reason to repeat the work spent on findall(cond).

I’m really just counting how often you run over your data (presumably you are bound by memory bandwidth because your data doesn’t fit into cache, your computations are trivial and CPUs are damn good at hiding memory latency), and proposing to reuse / fuse things, whenever there is a one-line way of doing so.

Hmm, that makes sense (and I can save a further c. 20% with the first approach), but ultimately I will need to write the results to gr rather than to subframe, no? I might miss something but in your original proposal the mean/median etc values written to subframe are not available outside the loop?

Writing to a view mutates the original object. The assignment gr[cond, :uc_mean] = mean_uc is effectively constructing a view to write to. Constructing a view is not free when using logical indexing (need to iterate over cond), so better reuse it.

You can also play with using quantile! and median!: First, this avoids temp copies of uc, and second it is silly to do three partial sorts on the same uc instead of using fact that your last call already performed a partial sort. But I’m not sure which partial sorts are run internally for which invocation, and the partial quicksort can degenerate into a worse complexity class for inputs that are partially sorted in just the wrong way.

Out of curiosity, how well / poorly does the performance critical bit fare in IndexedTables?

    gr = by(rc, [level, :org_code]) do df
              DataFrame(activity = sum(df.activity), 
                        actual_cost = sum(df.actual_cost))
        end

would correspond to:

using IndexedTables
gr = summarize(sum, rc, (level, :org_code), select = (:activity, :actual_cost))

And you can translate from DataFrames to IndexedTables with:

using DataFrames, IndexedTables, IterableTables
t = table(df)

Gives me a MethodError in the groupby function that summarize calls:

MethodError: no method matching groupby(::ApplyColwise{typeof(sum)}, ::DataFrame, ::Tuple{Symbol,Symbol}; select=(:activity, :actual_cost), flatten=false)

Not 100% sure what’s going on here - rc is just a plain DataFrame read in through CSV.read

Sorry, I was a bit unclear, you need to give it the converted version already:

using DataFrames, IndexedTables, IterableTables
t = table(rc)
gr = summarize(sum, t, (level, :org_code), select = (:activity, :actual_cost))

EDIT: just to clarify, IndexedTables is a distinct framework from DataFrames (though things can be converted easily from one to the other). I’m not necessarily recommending that you use it instead, but given that it uses a very different algorithm from DataFrames for grouping, I was curious to see some performance comparison in real life use.

Seems to be doing worse - just the summarize vs by is c. 1.5 seconds vs. 0.5 seconds, but for some reason the conversion via table takes extremely long, so any speed difference would be immaterial (although I suppose I could work with IndexedTables throughout so the conversion cost wouldn’t matter).

So to clarify my takeaway from this thread is that currently there is no fast way of doing large-ish groupby operations where there are a few thousand groups in DataFrames?

If you want to be extra thorough, you could always checkout PR #1520 by @nalimilan. It provides a lot of perfomance improvements for this kind of operation, and is pretty far along, I think.

1 Like

See: https://github.com/JuliaData/DataFrames.jl/pull/1520 and https://github.com/JuliaData/DataFramesMeta.jl/pull/101 for relevant PRs.

1 Like

You could also try to use Query.jl for this. I’m not sure it would do well on this kind of problem, but who knows, would be great if you could run this code on your dataset and report back :slight_smile:

For the first group step, the simplest code would look like this:

gr = rc |>
    @groupby({_.subservice, _.org_code}) |>
    @map({key(_).subservice, key(_).org_code, activity = sum(_.activity), actual_cost = sum(_.actual_cost)}) |>
    DataFrame

A more efficient version would only put the two columns that you actually need into the groups:

gr = rc |>
    @groupby({_.subservice, _.org_code}, {_.activity, _.actual_cost}) |>
    @map({key(_).subservice, key(_).org_code, activity = sum(_.activity), actual_cost = sum(_.actual_cost)}) |>
    DataFrame

Note that you’ll have to hard code whether you want to group by the sub service or service level column.

3 Likes

Can you try with my nl/grouping2 branch (the one @bkamins mentioned)? ]add DataFrames#nl/grouping2 should do the trick. Then in code like DataFrame(activity = sum(df.activity), actual_cost = sum(df.actual_cost)) you can just drop the DataFrame part to return a NamedTuple, which is much faster to create. That PR currently triggers a crash due to a bug in Julia, but since that’s a corner case maybe we can merge it anyway.

But that won’t help with the majority of the time, which is spent in the loop (almost completely outside of DataFrames code, using vector operations). It sounds too bad to have to write this manually. Can’t you use by instead, but instead of calling quantile directly use an anonymous function wrapper which returns NaN when all(ismissing, x)? Also, you can use quantile(x, [0.1, 0.25, 0.5]) instead of doing the work three times.

You haven’t mentioned what’s the type of the grouping column. If that’s a string, doing gr[level].==s for each group is going to be really slow. If that’s an integer it should be less of a problem. But DataFrames’ by uses a dict for that which avoids going over the full vector for each group.

The fact that median and quantile throw an error in this case is annoying. We should do something about it, maybe just by adding an argument specifying which value should be returned when it’s empty. The problem has been discussed in this issue.

1 Like

Thanks all, that’s a ton of helpful additional things to try out, which I’ll definitely do and report back - I’m really committed to finally ditch pandas, which is pretty much the only thing that ties me to Python at this stage, so I’ll try what I can to get this to an acceptable level of performance.

Just one question - how does the workflow with different versions of the same package work practically? Is the tagged version of DataFrames removed when I move to the grouping branch? Or do then have different verions installed simultaneously and somehow need to switch between them?

Also @nalimilan I’m indeed grouping on strings, so I suppose one way to improve things here would be to create a lookup table and then replace the strings with Ints for the groupby and then re-replace them at some point later.

If you have a limited set of strings that are repeated very often, you could probably use CategoricalArrays which gives you an automated way of doing what you suggest (you don’t need to manually perform any replacement).

2 Likes