Various by-group strategies compared

xiaodai · October 21, 2017, 1:29pm

Update 12th Nov 2017
2nd Update Implemented a multi-threaded that is almost 2x faster than data.table!

Implemented a new radixgroup that is even faster than the previous method by 18% on 250 million element arrays.

Update 29th Oct 2017

I managed to repurpose Radixsort to make Julia’s groupby performance be on par with data.table and can beat it in some cases!

https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

Original post

I am preparing to release some new group by benchmarks of Julia vs R data.table group by performance. I am now investigating a number of strategies to compute the mean of a value by group.

I have contained my MWE in the gist below. I basically came up with 4 approaches, please do suggest improvements to the code as I am new to Julia.

I am trying to find the mean value of v1 by distinct values of id4. Normally v1 and id4 would be part a dataframe or table of some sort, but for now I have included them as vectors.

Update (2017-10-22)
Thanks to @nalimilan’s helpful tips and comments I managed to get a meany_best approach = Approach

Approach 1 (`meany_best`) - zip with Dict

In this approach I use one Dicts that maps to the values and the count and use Base internal ht_keyindex to access values in a Dict quickly. This approach is general and works for any data type so it superior to Approach 4. The only approach that still beats this is Approach 3 which requires large amounts of prior knowledge.

timing ~10 seconds

Approach 2 (`meany2`) - enumerate with Dict

Similar to approach 1 instead of zipping the two I simply enumerate on one of them.

timing ~60 seconds

Approach 3 (`meany3`) - what if I know the number unique values in id4 beforehand?

I can just use two arrays res and wt, enumerate through id4 and and keep adding to it and then summarise to get the mean

timing ~0.865 seconds

Approach 4 (`meany4`) - what if I work out the unique values in id4 first?

I use countmap to get the unique values in id4 and then follow approach 3

timing ~18 seconds

some comments & questions

Clearly approach 3 only works in limited situations, but approach 4 is more generalisable. I understand approach 1 & 2 to be slow because there were lots of mem allocation there? How can I reduce the number of allocations?

Are there even faster methods I have missed? Multithreading?

https://gist.github.com/xiaodaigh/11e748609edc860a2b3896b27c7f5a7b#file-mwe

nalimilan · October 21, 2017, 1:47pm

You should type your Dicts to reduce (dramatically) the number of allocations. Then you can add @inbounds around the loops.

It’s worth noting that if you have CategoricalArray data, then you’re in the meany3 case even with originally string data, since you can work directly with the integer codes.

ChrisRackauckas · October 21, 2017, 2:56pm

Dicts are also just a bad choice for things which have integer ids. There are much more common structures for doing A[1] = 1.0 that are much faster. Is there a reason why those aren’t just Array{Float64}(N) or sparse arrays?

nalimilan · October 21, 2017, 7:35pm

That doesn’t work if you don’t know in advance the values you can get. For example, if there are only two different values but one of them is typemax(Int64), your attempt to allocate a vector will fail badly. Ideally the data will be stored in a CategoricalArray or PooledArray, and you can use the consecutive integer codes which have been computed when constructing the vector (using a Dict, but only once for all).

xiaodai · October 21, 2017, 10:08pm

I am trying to add Dict types to the function meany. However the elapsed time on my computer is still 30 seconds which still slower than meany4. I assume this Dict approach will never be as fast? Which is perplexing as I looked into the source of countmap which is using Dict behind the scenes to do the counting, so am wondering why my code is much slower?

function meany(id4::Vector{T2} where T2 <: Integer, v1::Vector{T} where T <: Real)
  T2 = typeof(id4[1])
  T = typeof(v1[1])
  res = Dict{T2, T}()
  wt = Dict{T2, Int64}()
  for iv in zip(id4,v1)
    @inbounds res[iv[1]] = get(res,iv[1], 0) + iv[2]
    @inbounds wt[iv[1]] = get(wt,iv[1], 0) + 1
  end
  return Dict(k => res[k]/wt[k] for k in keys(res))
end

nalimilan · October 21, 2017, 10:48pm

One problem is that things like res[iv[1]] = get(res,iv[1], 0) + iv[2] are unnecessarily slow because the dict lookup is performed once by get and once by setindex!. Unfortunately, even if that’s under discussion, there’s no official API to avoid this for now. You can use internal functions to work around it if you want, see what FreqTables does.

Since you have two dicts with the same keys, it would also help to store (count, sum) tuples inside a single dict to avoid another lookup. Overall you could get an almost 4× faster code.

xiaodai · October 21, 2017, 11:20pm

@nalimilan Thanks for the awesome tips!! This is my code now.

import Base.ht_keyindex

const N = Int64(2e9/8);
const K = 100;
srand(1);
@time const id4 = rand(1:K, N);# large groups (int)
@time const v1 =  rand(1:5, N); # int in range [1,5]

timings = Dict();
function meany_best{S,T}(id4::Vector{T}, v1::Vector{S})::Dict{T,Real}
  res = Dict{T, Tuple{S, Int64}}()
  szero = zero(S)
  for (id, val) in zip(id4,v1)
    index = ht_keyindex(res, id)
    if index > 0
      @inbounds vw = res.vals[index]
      new_vw = (vw[1] + val, vw[2] + 1)
      @inbounds res.vals[index] = new_vw
    else
      @inbounds res[id] = (val, 1)
    end

  end
  return Dict(k => res[k][1]/res[k][2] for k in keys(res))
end

@time res = meany_best(rand(1:100,2), rand(1:100,2)) #warmup
timings[:zip_dict] = @elapsed res = meany_best(id4, v1) #timings - compilation time

xiaodai · October 21, 2017, 11:41pm

What @nalimilan said + trying to do a more general algorithm that can deal with all types. As approach 3 shows, under certain conditions, using arrays is 10x faster.

nalimilan · October 22, 2017, 10:38am

How fast is the new algorithm?

FWIW, @andyferris has proposed general APIs for split-apply-combine operations which seem to cover you case. See SAC.jl.

xiaodai · October 22, 2017, 11:02am

I have updated the original post. It’s sub 10 seconds now, which is closer to R’s data.table (7~8 seconds). This is fast enough, altough I would love to see a general algorithm in Julia that can beat data.table.

andyferris · October 22, 2017, 9:43pm

I hope I understood the gist. This is what I get (2.95 seconds):

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.7.0-DEV.2244 (2017-10-20 22:35 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 4253e56* (1 day old master)
|__/                   |  x86_64-linux-gnu

julia> using SAC

julia> const N = Int64(2e9/8);

julia> const K = 100;

julia> srand(1);

julia> @time const id4 = rand(1:K, N);# large groups (int)
  3.613966 seconds (3.51 k allocations: 1.863 GiB, 0.11% gc time)

julia> @time const v1 =  rand(1:5, N); # int in range [1,5]
  4.032845 seconds (8 allocations: 1.863 GiB, 0.11% gc time)

julia> function f()
           tmp = groupreduce(x->x[1], x->(x[2],1), (x,y)->(x[1]+y[1], x[2]+y[2]), zip(id4, v1))
           return Dict{Int,Float64}(kv.first=>(kv.second[1]/kv.second[2]) for kv in tmp)
       end
f (generic function with 1 method)

julia> f();

julia> @time f()
  2.958926 seconds (26 allocations: 16.141 KiB)
Dict{Int64,Float64} with 100 entries:
  68 => 3.00046
  2  => 2.99942
  89 => 3.00203
  11 => 3.0003
  39 => 3.00009
  46 => 2.99894
  85 => 3.00164
  25 => 3.00016
  55 => 2.9998
  42 => 3.00069
  66 => 3.00036
  58 => 2.99921
  29 => 3.00007
  59 => 2.99829
  8  => 3.00209
  74 => 3.00189
  95 => 3.00184
  57 => 2.99979
  90 => 2.99987
  20 => 3.00071
  14 => 2.99942
  31 => 2.99912
  78 => 3.00164
  70 => 3.00076
  52 => 2.99969
  18 => 3.00145
  33 => 3.00011
  69 => 2.99885
  96 => 2.99956
  26 => 2.99968
  35 => 3.00108
  83 => 3.00045
  65 => 2.99963
  64 => 2.99874
  17 => 2.99962
  49 => 3.0001
  84 => 3.00116
  44 => 2.99967
  4  => 2.99946
  37 => 3.00003
  45 => 3.00138
  13 => 2.99993
  86 => 2.99958
  67 => 3.00017
  99 => 3.00074
  93 => 3.00098
  94 => 3.00132
  30 => 2.99852
  1  => 2.99979
  47 => 2.9982
  54 => 3.00061
  50 => 2.99983
  77 => 2.99953
  80 => 3.00003
  40 => 3.00009
  32 => 2.99897
  82 => 2.99933
  91 => 3.0002
  ⋮  => ⋮

It’s worth running this timing on the same machine as the other examples.

Of course there is a goal in SAC to have the reduce functions apply a generic “final” aggregator so that e.g. mean can be done in a single reduction.

xiaodai · October 22, 2017, 10:01pm

This line of code feels like magic. I understsnd first arg is the group by, 2nd is map, third is reduce with accumulator.

I can’t find the groupreduce function in the SAC.jl repo.

andyferris · October 22, 2017, 10:33pm

I just saw there is another SAC.jl for siesmic data! Haha. I was referring to my private, unregistered reopositiry at https://github.com/andyferris/SAC.jl. I’ll probably have to rename it…

xiaodai · October 23, 2017, 1:55am

@andyferris I just tested your code vs FastGroupBy.jl’s meanby function. Given our common interests/goals do you want to collaborate together? We can beat R’s data.table together!

On next.juliabox.com for small data set (1 million rows) meanby is faster and for large dataset (10 million rows) both are about the same (see gist below). But of course SAC is more general.

I generated some test data using my DataBench.jl package

https://gist.github.com/xiaodaigh/6696a94f1b2a4ffd8ade8960f5fd3ac6

andyferris · October 23, 2017, 3:15am

Cool, sure.

I see you are just starting out with FastGroupBy.jl - would you like to contribute to SAC.jl? (We should rename it…). PRs and comments are always very welcome. In this case a groupmean function would fit in well. I haven’t tuned the methods that much for speed, so any improvements there are obviously welcome.

So far, the work has focused on getting the semantics of the functions correct (so that they may be applied to generic data structures like AbstractArray and Associative (and nestings thereof) as well as for tables/dataframes, and that have a possibility of being fast (i.e. no design flaws that would make it impossible to optimize)). There is a bit of a difficulty here where the function signature for a table might naturally include the column name or something, while for generic containers it seems natural to use a function. (I’d like to move in the direction of groupreduce(containers...; by = ..., f = ..., op = ..., v0 = ...) or groupmean(v1, by = id4) or something (that’s last one’s not quite right), but we need to wait for fast keyword arguments first).

xiaodai · October 23, 2017, 3:43am

Sure happy to contribute and close down the FastGroupBy.jl and work on SAC.jl. First task is to choose a name. We will continue this discussion on github then https://github.com/andyferris/SAC.jl

mkborregaard · October 23, 2017, 6:49am

@andyferris not really my business but would SplitApplyCombines.jl not be a better name, and better in line with the julia convention of naming packages by what they do rather than as small unintelligible acronyms?

xiaodai · October 23, 2017, 7:34am

@mkborregaard That’s exactly what’s been suggested in the github repo already.

andyferris · October 23, 2017, 7:43am

Yes of course. The package began as my playground so I gave it a stupid name (I was also doing MVTs - minimally viable tables, at the time).

mkborregaard · October 23, 2017, 8:05am

Right Sorry for the noise!

Topic		Replies	Views
Julia performs poorly on group-by benchmarks Data performance	48	5804	January 23, 2019
Who does "better" than DataFrames? Performance dataframes	43	2021	April 6, 2023
Bad performance of group_by of DataFrames - updated - General Usage performance	21	1249	October 23, 2019
Group-by performance benchmarks and recommendations Data	12	3518	September 2, 2019
How is the data ecosystem right now for large datasets? Data	35	6723	July 13, 2017

Various by-group strategies compared

Approach 1 (meany_best) - zip with Dict

timing ~10 seconds

Approach 2 (meany2) - enumerate with Dict