Various by-group strategies compared

andyferris · October 23, 2017, 9:44am

No probs. I slightly worry if SplitApplyCombine is too long and something like DataOps might be briefer and too the point. Any opinions?

andyferris · October 23, 2017, 9:45am

(And sorry for derailing this thread! The new SAC.jl repo is https://github.com/JuliaData/SplitApplyCombine.jl)

mkborregaard · October 23, 2017, 1:43pm

Both are great IMHO

xiaodai · October 28, 2017, 3:28am

Update 29th Oct 2017

I managed to repurpose Radixsort to make Julia’s groupby performance be on par with data.table and can beat it in some cases!

Original Post
I’ve written a new post to compare sumby strategies. Please comment and let me know how to improve the code within.

Thanks

https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

bernhard · November 2, 2017, 8:07am

That is very interesting comparison.
Is the code you used availabe somewhere?
I would like to look into it (and have not found it on your blog)

xiaodai · November 2, 2017, 11:38am

@bernhard I have put the code below. Most of the underlying code is in FastGroupBy.jl

Also I am working on DataBench.jl to collate all these tests into easy to rerun packages. I have put all my experiments in a folder. Progress will be slow as I can only work on it when I have spare time.

https://gist.github.com/xiaodaigh/fbdbbfd32062e33a20b1895af24e1542

And added it to my post. My oversight.

xiaodai · November 12, 2017, 1:39am

Update Implemented a multi-threaded that is almost 2x faster than data.table.

Original post
Implemented a new radixgroup method that is even faster than [the previous method by 18%]
(An empirical study of group-by strategies in Julia | Codementor) on 250 million element arrays.

In general the algorithm is faster starting from around the 100 million elements mark but uses slightly more memory. This is clearly beating data.table now for bits types!

xiaodai · January 30, 2018, 2:04pm

I have been playing around with group-by for two strings vectors and couldn’t get much better results than about 3x slower than R (which is fine given R has string interning). So I have started playing around CategoricalArrays as many people have suggested. I have two string-come-categorical vectors, y and z and each has 100 unique values and both are of length 100m, I am summing another vector based on the unique groups of y and z. In SQL-speak, I am doing

select
  y, z, sum(val)
from
  df
group by
  y, z;

And the performance is really good! It looks like R’s data.table hasn’t implemented an optimization for factors/categoricals yet; but the technique I have employed can be implemented in R via C/C++. But Julia is beating R comprehensively here!

The results are pretty crazily good, and it’s 1 am in Australia so I better sleep before checking it further, but so far the results look legit to me. I am salivating over the prospect of doing serious data manipulation in Julia in the near future; especially once there is a faster string toolbox in pure Julia e.g. fast string compare and hashing.

See benchmarking code below, it relies on the unpublished (as of 31 Jan 2018) master branch of FastGroupBy.jl

#######################################################################
# setting up
#######################################################################
using Revise
using FastGroupBy, BenchmarkTools, SortingLab, CategoricalArrays, Base.Test
tic()
# import Base: getindex, similar, setindex!, size
N = 100_000_000; K = 100
srand(1);
# val = rand(round.(rand(K)*100,4), N);
val = rand(1:5, N);
pool = "id".*dec.(1:100,3);
fn = sum;

#######################################################################
# convert to CategoricalVector
#######################################################################
y = CategoricalArray{String, 1}(rand(UInt32(1):UInt32(length(pool)), N), CategoricalPool(pool, true));
y = compress(y);
# @benchmark sort($y)

z = CategoricalArray{String, 1}(rand(UInt32(1):UInt32(length(pool)), N), CategoricalPool(pool, true));
z = compress(z);
byveccv = (y, z);
toc() # 2mins for 2b length vectors

using DataFrames

df = DataFrame(y = y, z = z, val = val)

@benchmark aggregate($df, $[:y,:z], $sum)
# BenchmarkTools.Trial:
#   memory estimate:  6.28 GiB
#   allocs estimate:  101028306
#   --------------
#   minimum time:     20.904 s (4.59% GC)
#   median time:      20.904 s (4.59% GC)
#   mean time:        20.904 s (4.59% GC)
#   maximum time:     20.904 s (4.59% GC)
#   --------------
#   samples:          1
#   evals/sample:     1

@benchmark by($df, $[:y,:z], df1->sum(df1[:val]))
# BenchmarkTools.Trial:
# memory estimate:  6.25 GiB
# allocs estimate:  100628265
# --------------
# minimum time:     29.894 s (3.82% GC)
# median time:      29.894 s (3.82% GC)
# mean time:        29.894 s (3.82% GC)
# maximum time:     29.894 s (3.82% GC)
# --------------
# samples:          1
# evals/sample:     1

@benchmark fgroupreduce($+, $byveccv, $val) samples = 5 seconds = 20 # 7.5 seconds for 2billion
# BenchmarkTools.Trial:
# memory estimate:  311.27 KiB
# allocs estimate:  1338
# --------------
# minimum time:     366.854 ms (0.00% GC)
# median time:      387.955 ms (0.00% GC)
# mean time:        401.369 ms (0.00% GC)
# maximum time:     443.034 ms (0.00% GC)
# --------------
# samples:          13
# evals/sample:     1

using IndexedTables, IterableTables
t = table(df)
ti = reindex(t,(:y,:z))

@benchmark IndexedTables.groupreduce(+, ti, (:y,:z), select = :val) samples = 5 seconds = 20
# BenchmarkTools.Trial:
# memory estimate:  1.07 MiB
# allocs estimate:  10243
# --------------
# minimum time:     2.465 s (0.00% GC)
# median time:      2.561 s (0.00% GC)
# mean time:        2.555 s (0.00% GC)
# maximum time:     2.660 s (0.00% GC)
# --------------
# samples:          5
# evals/sample:     1

using RCall

r_timing = R"""
memory.limit(2^31-1)
N=1e8; K=100
set.seed(1)
library(data.table)
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
v1 =  sample(5, N, TRUE)                          # int in range [1,5]
)

system.time( DT[, sum(v1), keyby="id1,id2"] )
"""
# RCall.RObject{RCall.RealSxp}
# user  system elapsed
# 4.75    0.29    5.12

r_timing2 = R"""
setkey(DT, id1, id2)
system.time( DT[, sum(v1), keyby="id1,id2"])
"""
# user  system elapsed 
# 1.24    0.07    1.33 

using Plots
bar(
    ["FastGroupBy.jl", "DataFrames.jl","IndexedTables.jl (indexed)","R data.table", "R data.table (indexed)"],
    [0.401, 20.9, 2.555, r_timing[1], r_timing2[3]],
    title = "Group-by 2 factors (uniques: 100; len: 100m)",
    label = "seconds",
    ylabel = "mean run time"
)
savefig("Group_by2_cate_perf.png")

ChrisRackauckas · January 30, 2018, 3:22pm

We should really consider how to get DataFrames to default to use the “better” types. Or maybe someone should write a readdlm_fast that builds the “fast types table”, since all of the pieces pretty much exist but most people probably don’t know what to use where.

nalimilan · January 30, 2018, 3:32pm

As I’ve noted elsewhere, CSV.jl already uses CategoricalArray columns by default when the proportion of unique values is small. What is lacking is an optimized grouping method for that type, and more generally a faster grouping API (probably based on NamedTuple, cf. this issue).

piever · January 30, 2018, 3:37pm

A comparison with IndexedTables’s grouping could be interesting both in terms of performance/implementation and in terms of API, from the bar plot above it seems it’s already doing a pretty good job.

My understanding (I could very well be wrong) is that a slowish sortperm in Julia is the limiting factor for IndexedTables groupby performance and any optimization to sortperm would lead to a faster groupby.

xiaodai · January 30, 2018, 8:23pm

Yep. Julia has a flexible type system so we can implement many optimizations. I think it would be nice to introduce a DataFrameVector. It’s basically a vector but can store useful metadata e.g. is sorted, number of unique values, is grouped, frequency distincts (stored on disk), range. Any thing that might allow for faster algorithms to applied should be stored. For group-by and merging, if we know the vector is already sorted then there are faster algorithms. Chris’s post about using algoritms based on knowledge of data/domain absolutely applies here.

xiaodai · January 30, 2018, 8:29pm

Yep, IndexedTables/JuliaDB is pretty fast if the data is indexed. But it takes a long time to create the index (again I am comparing to R’s data.table here). But it’s still very promising.

I can see why a general group-by algorithm for arbitrary data types might rely on sortperm but sometimes there are much faster algorithms (as in the example above) that don’t need it. I think it just takes time to implement all thess “fast paths” into the ecosystem.

pdeffebach · January 30, 2018, 9:26pm

If we are including a DataFrameVector type, perhaps this is also a place to consider attributes like labels and notes along with unique values etc.

nalimilan · January 30, 2018, 9:33pm

I’d rather store information either in CategoricalArray when it makes sense (lots are already there, we could add longer value labels if needed), or in the DataFrame (by marking columns as being indexes.

xiaodai · January 30, 2018, 9:39pm

yeah. i have always wanted a dataframes that stores metadata e.g. a description the column etc

xiaodai · January 30, 2018, 11:28pm

some of the metadata should be kept once merge with another dataframe. So feels better to keep them with the vector

Topic		Replies	Views
Group-by performance benchmarks and recommendations Data	12	3501	September 2, 2019
What functions/packages should I use to sort and "group by" as fast as possible...? Performance sort , dataframes	4	1433	December 16, 2018
Julia performs poorly on group-by benchmarks Data performance	48	5776	January 23, 2019
A living post of Julia vs R's data manipulation tasks speeds Data data	21	7769	August 27, 2021
Review of presentation Data	12	1637	December 8, 2017

Various by-group strategies compared

Related topics