I’m finally making a proper effort to replace my pandas workflows with DataFrames, but currently hitting a massive performance issue which makes me think I’m missing something fundamental.
The following snippet of code calculates some statistics of a DataFrame which is grouped by two columns (organisation and either service level or subservice level). There are c. 200 services and 3,000 subservices in the data set.
When running it at service level, the code takes 0.6 seconds to run, slightly faster than an equivalent pandas code. When moving to the more granular subservice level though, runtime goes up by a factor of almost 100 (code takes 58 seconds), while in pandas the increase is only a factor of 4 (0.8 to 3.5 seconds).
Any ideas why this code scales so badly to a larger number of groups? Apologies that this isn’t really an MWE, I’m hoping someone spots something that’s obviously off - although I could make the data and full code available if that’s helpful.
function opportunity2(df::DataFrame; level=:service_code)
# Group dataframe by level
gr = by(rc, [level, :org_code]) do df
DataFrame(activity = sum(df.activity),
actual_cost = sum(df.actual_cost))
end
# Merge on group-level average for activity
gr = join(gr, by(gr, level) do df
DataFrame(mean_activity = mean(skipmissing(df.activity)))
end, on = level);
# Calculate unit cost at aggregation level
gr[:unit_cost] = gr.actual_cost ./ gr.activity
# Calculate opportunities
l = size(gr, 1)
gr = [gr DataFrame(uc_mean=rand(l), uc_median=rand(l), uc_lq=rand(l), uc_ld=rand(l))]
for s in Set(gr[level])
cond = (gr[level].==s)
gr[cond, :uc_mean] = mean(skipmissing(gr[cond, :].unit_cost))
gr[cond, :uc_median] = median(skipmissing(gr[cond, :].unit_cost))
gr[cond, :uc_lq] = quantile(skipmissing(gr[cond, :].unit_cost), 0.25)
gr[cond, :uc_ld] = quantile(skipmissing(gr[cond, :].unit_cost), 0.1)
end
gr[:cost_opp_lq] = gr.activity .* max.(gr.unit_cost .- gr.uc_lq, 0.0)
gr[:cost_opp_ld] = gr.activity .* max.(gr.unit_cost .- gr.uc_ld, 0.0)
return gr
end