DataFrames operation scales badly

nalimilan · October 13, 2018, 7:51pm

You only have a single version at a given time in a given environment (but you could have multiple environments if you want). Anyway switching between them is very fast and easy.

Yes but at that point you’ll have reimplemented by/groupby, so I’m not sure it make sense. @piever is right that in theory using CategoricalArrays should do exactly that, but currently we don’t implement optimized methods for them unfortunately. EDIT: this PR already improves things a lot.

nilshg · December 10, 2018, 8:30pm

EDIT: The timings posted originally were largely driven by deprecation warnings on the old syntax, so I’ve re-run everything with --depwarn = no. Updated timings below:

The picture has changed quite dramatically with the release of DataFrames 0.15:

First grouping by service_code:

function group_by_level(rc::DataFrame; level = :service_code)
  gr = by(rc, [level, :org_code]) do df
                DataFrame(activity = sum(df.activity), 
                          actual_cost = sum(df.actual_cost))
            end          
end

julia> @btime group_by_level(rc);
  385.627 ms (1037981 allocations: 199.55 MiB)

julia> @btime by(rc, [:org_code, :service_code], activity = :activity => sum,
                 mffd_actual_cost = :mffd_actual_cost => sum);
  221.555 ms (497680 allocations: 87.92 MiB)   # 57% of runtime of the old syntax

Second, grouping by currency code:

julia> @btime group_by_level(rc, level = :currency_code);
  4.141 s (22915651 allocations: 1.51 GiB) # c.11-fold increase vs service level groupby

julia> @btime by(rc, [:org_code, :currency_code], activity = :activity => sum,
                 mffd_actual_cost = :mffd_actual_cost => sum);
  1.484 s (10998813 allocations: 394.77 MiB)  # c.6-fold increase vs service level groupby

So the old syntax has gotten faster compared to my post from Oct 12 (note that they were run on 0.7 while these are run on 1.0.1 so there might be changes in base Julia), but more importantly the new syntax, while being a lot easier on the eye, provides a significant speedup.

Most importantly for me, the “scaling” of the operation with the number of groups seems to work a lot better now, with an increase in groups from 13k to 300k leading to a 6-fold increase in the groupby timing compared to an 11-fold increase before.

Topic		Replies	Views
Help with performance tuning this dataframe aggregation Performance	10	733	September 23, 2018
Help improving the speed of a DataFrames operation Performance performance , query , dataframes	37	1228	February 22, 2024
DataFrame transformation is so slow, what am I doing wrong? Performance compilation , dataframes	17	328	May 19, 2024
Understanding the performance issue in combine() [DataFrames.jl] Performance dataframes	1	329	April 18, 2021
Bad performance of group_by of DataFrames - updated - General Usage performance	21	1239	October 23, 2019

DataFrames operation scales badly

Related topics