# Various by-group strategies compared

#1

Update 12th Nov 2017
2nd Update Implemented a multi-threaded that is almost 2x faster than data.table!

Implemented a new radixgroup that is even faster than the previous method by 18% on 250 million element arrays.

Update 29th Oct 2017

I managed to repurpose Radixsort to make Julia’s groupby performance be on par with data.table and can beat it in some cases!

Original post

I am preparing to release some new group by benchmarks of Julia vs R data.table group by performance. I am now investigating a number of strategies to compute the mean of a value by group.

I have contained my MWE in the gist below. I basically came up with 4 approaches, please do suggest improvements to the code as I am new to Julia.

I am trying to find the mean value of v1 by distinct values of id4. Normally v1 and id4 would be part a dataframe or table of some sort, but for now I have included them as vectors.

Update (2017-10-22)
Thanks to @nalimilan’s helpful tips and comments I managed to get a `meany_best` approach = Approach

# Approach 1 (`meany_best`) - zip with Dict

In this approach I use one Dicts that maps to the values and the count and use Base internal ht_keyindex to access values in a Dict quickly. This approach is general and works for any data type so it superior to Approach 4. The only approach that still beats this is Approach 3 which requires large amounts of prior knowledge.

# Approach 2 (`meany2`) - enumerate with Dict

Similar to approach 1 instead of zipping the two I simply enumerate on one of them.

# Approach 3 (`meany3`) - what if I know the number unique values in id4 beforehand?

I can just use two arrays `res` and `wt`, enumerate through id4 and and keep adding to it and then summarise to get the mean

# Approach 4 (`meany4`) - what if I work out the unique values in id4 first?

I use countmap to get the unique values in `id4` and then follow approach 3

## timing ~18 seconds

Clearly approach 3 only works in limited situations, but approach 4 is more generalisable. I understand approach 1 & 2 to be slow because there were lots of mem allocation there? How can I reduce the number of allocations?

Are there even faster methods I have missed? Multithreading?

#2

You should type your `Dict`s to reduce (dramatically) the number of allocations. Then you can add `@inbounds` around the loops.

It’s worth noting that if you have `CategoricalArray` data, then you’re in the `meany3` case even with originally string data, since you can work directly with the integer codes.

#3

`Dict`s are also just a bad choice for things which have integer ids. There are much more common structures for doing `A[1] = 1.0` that are much faster. Is there a reason why those aren’t just `Array{Float64}(N)` or sparse arrays?

#4

That doesn’t work if you don’t know in advance the values you can get. For example, if there are only two different values but one of them is `typemax(Int64)`, your attempt to allocate a vector will fail badly. Ideally the data will be stored in a `CategoricalArray` or `PooledArray`, and you can use the consecutive integer codes which have been computed when constructing the vector (using a `Dict`, but only once for all).

#5

I am trying to add Dict types to the function `meany`. However the elapsed time on my computer is still 30 seconds which still slower than `meany4`. I assume this Dict approach will never be as fast? Which is perplexing as I looked into the source of countmap which is using Dict behind the scenes to do the counting, so am wondering why my code is much slower?

``````function meany(id4::Vector{T2} where T2 <: Integer, v1::Vector{T} where T <: Real)
T2 = typeof(id4[1])
T = typeof(v1[1])
res = Dict{T2, T}()
wt = Dict{T2, Int64}()
for iv in zip(id4,v1)
@inbounds res[iv[1]] = get(res,iv[1], 0) + iv[2]
@inbounds wt[iv[1]] = get(wt,iv[1], 0) + 1
end
return Dict(k => res[k]/wt[k] for k in keys(res))
end
``````

#6

One problem is that things like `res[iv[1]] = get(res,iv[1], 0) + iv[2]` are unnecessarily slow because the dict lookup is performed once by `get` and once by `setindex!`. Unfortunately, even if that’s under discussion, there’s no official API to avoid this for now. You can use internal functions to work around it if you want, see what FreqTables does.

Since you have two dicts with the same keys, it would also help to store `(count, sum)` tuples inside a single dict to avoid another lookup. Overall you could get an almost 4× faster code.

#7

@nalimilan Thanks for the awesome tips!! This is my code now.

``````import Base.ht_keyindex

const N = Int64(2e9/8);
const K = 100;
srand(1);
@time const id4 = rand(1:K, N);# large groups (int)
@time const v1 =  rand(1:5, N); # int in range [1,5]

timings = Dict();
function meany_best{S,T}(id4::Vector{T}, v1::Vector{S})::Dict{T,Real}
res = Dict{T, Tuple{S, Int64}}()
szero = zero(S)
for (id, val) in zip(id4,v1)
index = ht_keyindex(res, id)
if index > 0
@inbounds vw = res.vals[index]
new_vw = (vw[1] + val, vw[2] + 1)
@inbounds res.vals[index] = new_vw
else
@inbounds res[id] = (val, 1)
end

end
return Dict(k => res[k][1]/res[k][2] for k in keys(res))
end

@time res = meany_best(rand(1:100,2), rand(1:100,2)) #warmup
timings[:zip_dict] = @elapsed res = meany_best(id4, v1) #timings - compilation time
``````

#8

What @nalimilan said + trying to do a more general algorithm that can deal with all types. As approach 3 shows, under certain conditions, using arrays is 10x faster.

#9

How fast is the new algorithm?

FWIW, @andyferris has proposed general APIs for split-apply-combine operations which seem to cover you case. See SAC.jl.

#10

I have updated the original post. It’s sub 10 seconds now, which is closer to R’s data.table (7~8 seconds). This is fast enough, altough I would love to see a general algorithm in Julia that can beat data.table.

#11

I hope I understood the gist. This is what I get (2.95 seconds):

``````   _       _ _(_)_     |  A fresh approach to technical computing
(_)     | (_) (_)    |  Documentation: https://docs.julialang.org
_ _   _| |_  __ _   |  Type "?help" for help.
| | | | | | |/ _` |  |
| | |_| | | | (_| |  |  Version 0.7.0-DEV.2244 (2017-10-20 22:35 UTC)
_/ |\__'_|_|_|\__'_|  |  Commit 4253e56* (1 day old master)
|__/                   |  x86_64-linux-gnu

julia> using SAC

julia> const N = Int64(2e9/8);

julia> const K = 100;

julia> srand(1);

julia> @time const id4 = rand(1:K, N);# large groups (int)
3.613966 seconds (3.51 k allocations: 1.863 GiB, 0.11% gc time)

julia> @time const v1 =  rand(1:5, N); # int in range [1,5]
4.032845 seconds (8 allocations: 1.863 GiB, 0.11% gc time)

julia> function f()
tmp = groupreduce(x->x[1], x->(x[2],1), (x,y)->(x[1]+y[1], x[2]+y[2]), zip(id4, v1))
return Dict{Int,Float64}(kv.first=>(kv.second[1]/kv.second[2]) for kv in tmp)
end
f (generic function with 1 method)

julia> f();

julia> @time f()
2.958926 seconds (26 allocations: 16.141 KiB)
Dict{Int64,Float64} with 100 entries:
68 => 3.00046
2  => 2.99942
89 => 3.00203
11 => 3.0003
39 => 3.00009
46 => 2.99894
85 => 3.00164
25 => 3.00016
55 => 2.9998
42 => 3.00069
66 => 3.00036
58 => 2.99921
29 => 3.00007
59 => 2.99829
8  => 3.00209
74 => 3.00189
95 => 3.00184
57 => 2.99979
90 => 2.99987
20 => 3.00071
14 => 2.99942
31 => 2.99912
78 => 3.00164
70 => 3.00076
52 => 2.99969
18 => 3.00145
33 => 3.00011
69 => 2.99885
96 => 2.99956
26 => 2.99968
35 => 3.00108
83 => 3.00045
65 => 2.99963
64 => 2.99874
17 => 2.99962
49 => 3.0001
84 => 3.00116
44 => 2.99967
4  => 2.99946
37 => 3.00003
45 => 3.00138
13 => 2.99993
86 => 2.99958
67 => 3.00017
99 => 3.00074
93 => 3.00098
94 => 3.00132
30 => 2.99852
1  => 2.99979
47 => 2.9982
54 => 3.00061
50 => 2.99983
77 => 2.99953
80 => 3.00003
40 => 3.00009
32 => 2.99897
82 => 2.99933
91 => 3.0002
⋮  => ⋮
``````

It’s worth running this timing on the same machine as the other examples.

Of course there is a goal in SAC to have the `reduce` functions apply a generic “final” aggregator so that e.g. `mean` can be done in a single reduction.

#12

This line of code feels like magic. I understsnd first arg is the group by, 2nd is map, third is reduce with accumulator.

I can’t find the groupreduce function in the SAC.jl repo.

#13

I just saw there is another SAC.jl for siesmic data! Haha. I was referring to my private, unregistered reopositiry at https://github.com/andyferris/SAC.jl. I’ll probably have to rename it…

#14

@andyferris I just tested your code vs FastGroupBy.jl’s meanby function. Given our common interests/goals do you want to collaborate together? We can beat R’s data.table together!

On next.juliabox.com for small data set (1 million rows) meanby is faster and for large dataset (10 million rows) both are about the same (see gist below). But of course SAC is more general.

I generated some test data using my DataBench.jl package

#15

Cool, sure.

I see you are just starting out with FastGroupBy.jl - would you like to contribute to SAC.jl? (We should rename it…). PRs and comments are always very welcome. In this case a `groupmean` function would fit in well. I haven’t tuned the methods that much for speed, so any improvements there are obviously welcome.

So far, the work has focused on getting the semantics of the functions correct (so that they may be applied to generic data structures like `AbstractArray` and `Associative` (and nestings thereof) as well as for tables/dataframes, and that have a possibility of being fast (i.e. no design flaws that would make it impossible to optimize)). There is a bit of a difficulty here where the function signature for a table might naturally include the column name or something, while for generic containers it seems natural to use a function. (I’d like to move in the direction of `groupreduce(containers...; by = ..., f = ..., op = ..., v0 = ...)` or `groupmean(v1, by = id4)` or something (that’s last one’s not quite right), but we need to wait for fast keyword arguments first).

#16

Sure happy to contribute and close down the FastGroupBy.jl and work on SAC.jl. First task is to choose a name. We will continue this discussion on github then https://github.com/andyferris/SAC.jl

#17

@andyferris not really my business but would SplitApplyCombines.jl not be a better name, and better in line with the julia convention of naming packages by what they do rather than as small unintelligible acronyms?

#18

@mkborregaard That’s exactly what’s been suggested in the github repo already.

#19

Yes of course. The package began as my playground so I gave it a stupid name (I was also doing MVTs - minimally viable tables, at the time).

#20

Right Sorry for the noise!