DataFrames operation scales badly


#21

You only have a single version at a given time in a given environment (but you could have multiple environments if you want). Anyway switching between them is very fast and easy.

Yes but at that point you’ll have reimplemented by/groupby, so I’m not sure it make sense. @piever is right that in theory using CategoricalArrays should do exactly that, but currently we don’t implement optimized methods for them unfortunately. EDIT: this PR already improves things a lot.


#22

EDIT: The timings posted originally were largely driven by deprecation warnings on the old syntax, so I’ve re-run everything with --depwarn = no. Updated timings below:

The picture has changed quite dramatically with the release of DataFrames 0.15:

First grouping by service_code:

function group_by_level(rc::DataFrame; level = :service_code)
  gr = by(rc, [level, :org_code]) do df
                DataFrame(activity = sum(df.activity), 
                          actual_cost = sum(df.actual_cost))
            end          
end

julia> @btime group_by_level(rc);
  385.627 ms (1037981 allocations: 199.55 MiB)

julia> @btime by(rc, [:org_code, :service_code], activity = :activity => sum,
                 mffd_actual_cost = :mffd_actual_cost => sum);
  221.555 ms (497680 allocations: 87.92 MiB)   # 57% of runtime of the old syntax

Second, grouping by currency code:

julia> @btime group_by_level(rc, level = :currency_code);
  4.141 s (22915651 allocations: 1.51 GiB) # c.11-fold increase vs service level groupby

julia> @btime by(rc, [:org_code, :currency_code], activity = :activity => sum,
                 mffd_actual_cost = :mffd_actual_cost => sum);
  1.484 s (10998813 allocations: 394.77 MiB)  # c.6-fold increase vs service level groupby

So the old syntax has gotten faster compared to my post from Oct 12 (note that they were run on 0.7 while these are run on 1.0.1 so there might be changes in base Julia), but more importantly the new syntax, while being a lot easier on the eye, provides a significant speedup.

Most importantly for me, the “scaling” of the operation with the number of groups seems to work a lot better now, with an increase in groups from 13k to 300k leading to a 6-fold increase in the groupby timing compared to an 11-fold increase before.