Tables package for fast grouping and filtering?

robsmith11 · December 6, 2019, 10:25pm

I already use Julia for much of my simulation and model calibration work, but am considering using it for some of the more basic data preparation I do.

This work is primarly filtering, grouping, and applying time series joins to large tables of financial market data. I currently use kdb+/q, which is a highly refined in-memory database and query language.

I’ve started by looking at the popular DataFrames.jl package, which seems to support most of the features I would like, but a simple aggregation seems to take 6.7x times longer.

Am I doing something wrong? Should I try a different Tables package?

julia> n=10^7; t=DataFrame(x=rand(10^7), y=rand(Bool,10^7))
julia> @btime by($t, :y, :x => sum)
  147.204 ms (130 allocations: 356.89 MiB)
2×2 DataFrame
│ Row │ y    │ x_sum     │ 
│     │ Bool │ Float64   │
├─────┼──────┼───────────┤
│ 1   │ 1    │ 2.49808e6 │
│ 2   │ 0    │ 2.50187e6 │

(I’m using a nightly Julia started with --optimize=3 --inline=yes --check-bounds=no --math-mode=fast)

q)n:prd 7#10; t:([]x:n?1f; y:n?0b)
q)\ts select sum x by y from t
22 134218464

kdb+/q is also using 64bit floats and the timing results with units are 22ms and 134MB of allocations.

Yifan_Liu · December 6, 2019, 10:32pm

RCall + data.table

robsmith11 · December 6, 2019, 10:38pm

If I wanted to go that route, I’d just call kdb+/q from julia. I used R’s data.table many years ago before I moved to kdb+/q. I’d rather not go back.

The main reason I’d like to write everything in Julia is the simplicity of having everything in one language, espeically for running functions in parallel. I’d imagine using RCall isn’t compatibile with Julia’s multithreading.

baggepinnen · December 6, 2019, 10:39pm

https://github.com/xiaodaigh/FastGroupBy.jl

robsmith11 · December 6, 2019, 10:45pm

That’s an improvement, but at 75ms, it’s still 3.4x slower.

Yifan_Liu · December 6, 2019, 11:49pm

just curious. could you benchmark kdb+/q against data.table?

robsmith11 · December 7, 2019, 12:11am

It’s been a while, so there may be a faster way:

> library(data.table)
data.table 1.12.6 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> n=10^7; t=data.table(x=runif(n),y=runif(n)>0.5)
> system.time(t[,sum(x),by=y])
   user  system elapsed
  0.532   0.084   0.161
> setDTthreads(1)
> system.time(t[,sum(x),by=y])
   user  system elapsed
  0.203   0.050   0.252

Yifan_Liu · December 7, 2019, 12:17am

could you change by to keyby in your data.table example?

robsmith11 · December 7, 2019, 12:19am

Same results +/- 5 ms.

Yifan_Liu · December 7, 2019, 12:24am

kdb+ really fast

robsmith11 · December 7, 2019, 12:33am

Wins what? The results are:

R data.table 1-thread:  252 ms
R data.table 4-threads: 161 ms
Julia DataFrames:       147 ms
Julia FastGroupBy:       75 ms
kdb+/q:                  22 ms

baggepinnen · December 7, 2019, 1:01am

@xiaodai, maybe this is a new benchmark for you to use in the pursuit of a faster groupby?

xiaodai · December 7, 2019, 1:19am

Are you able to test it out on a large example? Or you need to do these small group by millions of times? If u only need it 20times then I consider the gap to be in the “who cares” range. But it’s interesting to see what can done to match it though. Can kdb/q support strings? y is a boolean so I think you can use one of the internal functions of FastGroupBy to make it faster. I will look at it when I get home.

robsmith11 · December 7, 2019, 1:45am

With n=10^9:

Julia DataFrames:  13.57 seconds
Julia FastGroupBy:  7.48 seconds
kdb+/q:             2.30 seconds

You’re right that in this trivial example I don’t care about performance. A more realistic example where things get slow, even in kdb+/q, is a rolling time window join by symbol between two tables while applying custom aggregation functions to multiple columns.

kdb+/q does support strings, as well as symbols (enumerated strings):

xiaodai · December 7, 2019, 1:53am

How slow is it compared to data.table::foverlaps?

xiaodai · December 7, 2019, 10:06am

@robsmith11

fgroupreduce(by::Vector{Bool}, x::Vector{S}) where {T, S} = begin
    @inbounds for (b, x1) in zip(by, x)
        x[b + 1] += x1
    end
end

@time fgroupreduce(t[!, :y], t[!, :x])

This is the fastest way to do the group_by in Julia. Which is a groupreduce. I was in the process of implementing this in FastGroupBy.jl` before I got side tracked by other projects. I might prioritise these feature if u want to use them.

nalimilan · December 7, 2019, 10:21am

Currently DataFrames doesn’t have an optimized method for Bool grouping keys, nor for integers unfortunately. But you can try converting it to PooledArray or CategoricalArray and it should be quite faster. I’m still planning to improve this by avoiding some unnecessary computations when you just want to compute the sum.

robsmith11 · December 7, 2019, 10:51pm

Wow, that is indeed faster. 11.6 ms and 0 allocations for n=10^7, twice as fast as kdb+/q!

I don’t have a specific need for this feature, but good to know that Julia can compete with kdb+/q’s black magic.

@nalimilan
Using CategoricalArray with DataFrames’ by also produces a decent speed-up: 81 ms and 152 MB allocations, so almost 2x faster than with a Bool array.

xiaodai · December 8, 2019, 6:40am

For something simple like this we can expect Julia to be as fast as possible. Really should have that in DataFrames at soe point.

Topic		Replies	Views
Group-by performance benchmarks and recommendations Data	12	3514	September 2, 2019
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	466	August 27, 2020
What functions/packages should I use to sort and "group by" as fast as possible...? Performance sort , dataframes	4	1439	December 16, 2018
Julia's DataFrames.jl performance on join benchmark Community dataframes	1	1341	November 6, 2019
Julia performs poorly on group-by benchmarks Data performance	48	5800	January 23, 2019

Tables package for fast grouping and filtering?

Related topics