Various by-group strategies compared

No probs. I slightly worry if SplitApplyCombine is too long and something like DataOps might be briefer and too the point. Any opinions?

(And sorry for derailing this thread! The new SAC.jl repo is https://github.com/JuliaData/SplitApplyCombine.jl)

Both are great IMHO :slight_smile:

1 Like

Update 29th Oct 2017

I managed to repurpose Radixsort to make Juliaā€™s groupby performance be on par with data.table and can beat it in some cases!

Original Post
Iā€™ve written a new post to compare sumby strategies. Please comment and let me know how to improve the code within.

Thanks

https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

1 Like

That is very interesting comparison.
Is the code you used availabe somewhere?
I would like to look into it (and have not found it on your blog)

1 Like

@bernhard I have put the code below. Most of the underlying code is in FastGroupBy.jl

Also I am working on DataBench.jl to collate all these tests into easy to rerun packages. I have put all my experiments in a folder. Progress will be slow as I can only work on it when I have spare time.

https://gist.github.com/xiaodaigh/fbdbbfd32062e33a20b1895af24e1542

And added it to my post. My oversight.

Update Implemented a multi-threaded that is almost 2x faster than data.table.

Original post
Implemented a new radixgroup method that is even faster than [the previous method by 18%]
(An empirical study of group-by strategies in Julia | Codementor) on 250 million element arrays.

In general the algorithm is faster starting from around the 100 million elements mark but uses slightly more memory. This is clearly beating data.table now for bits types!

2 Likes

I have been playing around with group-by for two strings vectors and couldnā€™t get much better results than about 3x slower than R (which is fine given R has string interning). So I have started playing around CategoricalArrays as many people have suggested. I have two string-come-categorical vectors, y and z and each has 100 unique values and both are of length 100m, I am summing another vector based on the unique groups of y and z. In SQL-speak, I am doing

select
  y, z, sum(val)
from
  df
group by
  y, z;

And the performance is really good! It looks like Rā€™s data.table hasnā€™t implemented an optimization for factors/categoricals yet; but the technique I have employed can be implemented in R via C/C++. But Julia is beating R comprehensively here!

The results are pretty crazily good, and itā€™s 1 am in Australia so I better sleep before checking it further, but so far the results look legit to me. I am salivating over the prospect of doing serious data manipulation in Julia in the near future; especially once there is a faster string toolbox in pure Julia e.g. fast string compare and hashing.

See benchmarking code below, it relies on the unpublished (as of 31 Jan 2018) master branch of FastGroupBy.jl

#######################################################################
# setting up
#######################################################################
using Revise
using FastGroupBy, BenchmarkTools, SortingLab, CategoricalArrays, Base.Test
tic()
# import Base: getindex, similar, setindex!, size
N = 100_000_000; K = 100
srand(1);
# val = rand(round.(rand(K)*100,4), N);
val = rand(1:5, N);
pool = "id".*dec.(1:100,3);
fn = sum;

#######################################################################
# convert to CategoricalVector
#######################################################################
y = CategoricalArray{String, 1}(rand(UInt32(1):UInt32(length(pool)), N), CategoricalPool(pool, true));
y = compress(y);
# @benchmark sort($y)

z = CategoricalArray{String, 1}(rand(UInt32(1):UInt32(length(pool)), N), CategoricalPool(pool, true));
z = compress(z);
byveccv = (y, z);
toc() # 2mins for 2b length vectors

using DataFrames

df = DataFrame(y = y, z = z, val = val)

@benchmark aggregate($df, $[:y,:z], $sum)
# BenchmarkTools.Trial:
#   memory estimate:  6.28 GiB
#   allocs estimate:  101028306
#   --------------
#   minimum time:     20.904 s (4.59% GC)
#   median time:      20.904 s (4.59% GC)
#   mean time:        20.904 s (4.59% GC)
#   maximum time:     20.904 s (4.59% GC)
#   --------------
#   samples:          1
#   evals/sample:     1

@benchmark by($df, $[:y,:z], df1->sum(df1[:val]))
# BenchmarkTools.Trial:
# memory estimate:  6.25 GiB
# allocs estimate:  100628265
# --------------
# minimum time:     29.894 s (3.82% GC)
# median time:      29.894 s (3.82% GC)
# mean time:        29.894 s (3.82% GC)
# maximum time:     29.894 s (3.82% GC)
# --------------
# samples:          1
# evals/sample:     1

@benchmark fgroupreduce($+, $byveccv, $val) samples = 5 seconds = 20 # 7.5 seconds for 2billion
# BenchmarkTools.Trial:
# memory estimate:  311.27 KiB
# allocs estimate:  1338
# --------------
# minimum time:     366.854 ms (0.00% GC)
# median time:      387.955 ms (0.00% GC)
# mean time:        401.369 ms (0.00% GC)
# maximum time:     443.034 ms (0.00% GC)
# --------------
# samples:          13
# evals/sample:     1

using IndexedTables, IterableTables
t = table(df)
ti = reindex(t,(:y,:z))

@benchmark IndexedTables.groupreduce(+, ti, (:y,:z), select = :val) samples = 5 seconds = 20
# BenchmarkTools.Trial:
# memory estimate:  1.07 MiB
# allocs estimate:  10243
# --------------
# minimum time:     2.465 s (0.00% GC)
# median time:      2.561 s (0.00% GC)
# mean time:        2.555 s (0.00% GC)
# maximum time:     2.660 s (0.00% GC)
# --------------
# samples:          5
# evals/sample:     1

using RCall

r_timing = R"""
memory.limit(2^31-1)
N=1e8; K=100
set.seed(1)
library(data.table)
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
id2 = sample(sprintf("id%03d",1:K), N, TRUE),      # large groups (char)
v1 =  sample(5, N, TRUE)                          # int in range [1,5]
)

system.time( DT[, sum(v1), keyby="id1,id2"] )
"""
# RCall.RObject{RCall.RealSxp}
# user  system elapsed
# 4.75    0.29    5.12

r_timing2 = R"""
setkey(DT, id1, id2)
system.time( DT[, sum(v1), keyby="id1,id2"])
"""
# user  system elapsed 
# 1.24    0.07    1.33 

using Plots
bar(
    ["FastGroupBy.jl", "DataFrames.jl","IndexedTables.jl (indexed)","R data.table", "R data.table (indexed)"],
    [0.401, 20.9, 2.555, r_timing[1], r_timing2[3]],
    title = "Group-by 2 factors (uniques: 100; len: 100m)",
    label = "seconds",
    ylabel = "mean run time"
)
savefig("Group_by2_cate_perf.png")
2 Likes

We should really consider how to get DataFrames to default to use the ā€œbetterā€ types. Or maybe someone should write a readdlm_fast that builds the ā€œfast types tableā€, since all of the pieces pretty much exist but most people probably donā€™t know what to use where.

As Iā€™ve noted elsewhere, CSV.jl already uses CategoricalArray columns by default when the proportion of unique values is small. What is lacking is an optimized grouping method for that type, and more generally a faster grouping API (probably based on NamedTuple, cf. this issue).

2 Likes

A comparison with IndexedTablesā€™s grouping could be interesting both in terms of performance/implementation and in terms of API, from the bar plot above it seems itā€™s already doing a pretty good job.

My understanding (I could very well be wrong) is that a slowish sortperm in Julia is the limiting factor for IndexedTables groupby performance and any optimization to sortperm would lead to a faster groupby.

1 Like

Yep. Julia has a flexible type system so we can implement many optimizations. I think it would be nice to introduce a DataFrameVector. Itā€™s basically a vector but can store useful metadata e.g. is sorted, number of unique values, is grouped, frequency distincts (stored on disk), range. Any thing that might allow for faster algorithms to applied should be stored. For group-by and merging, if we know the vector is already sorted then there are faster algorithms. Chrisā€™s post about using algoritms based on knowledge of data/domain absolutely applies here.

Yep, IndexedTables/JuliaDB is pretty fast if the data is indexed. But it takes a long time to create the index (again I am comparing to Rā€™s data.table here). But itā€™s still very promising.

I can see why a general group-by algorithm for arbitrary data types might rely on sortperm but sometimes there are much faster algorithms (as in the example above) that donā€™t need it. I think it just takes time to implement all thess ā€œfast pathsā€ into the ecosystem.

If we are including a DataFrameVector type, perhaps this is also a place to consider attributes like labels and notes along with unique values etc.

Iā€™d rather store information either in CategoricalArray when it makes sense (lots are already there, we could add longer value labels if needed), or in the DataFrame (by marking columns as being indexes.

yeah. i have always wanted a dataframes that stores metadata e.g. a description the column etc

some of the metadata should be kept once merge with another dataframe. So feels better to keep them with the vector