Tables package for fast grouping and filtering?

I already use Julia for much of my simulation and model calibration work, but am considering using it for some of the more basic data preparation I do.

This work is primarly filtering, grouping, and applying time series joins to large tables of financial market data. I currently use kdb+/q, which is a highly refined in-memory database and query language.

I’ve started by looking at the popular DataFrames.jl package, which seems to support most of the features I would like, but a simple aggregation seems to take 6.7x times longer.

Am I doing something wrong? Should I try a different Tables package?

julia> n=10^7; t=DataFrame(x=rand(10^7), y=rand(Bool,10^7))
julia> @btime by($t, :y, :x => sum)
  147.204 ms (130 allocations: 356.89 MiB)
2×2 DataFrame
│ Row │ y    │ x_sum     │ 
│     │ Bool │ Float64   │
├─────┼──────┼───────────┤
│ 1   │ 1    │ 2.49808e6 │
│ 2   │ 0    │ 2.50187e6 │

(I’m using a nightly Julia started with --optimize=3 --inline=yes --check-bounds=no --math-mode=fast)

q)n:prd 7#10; t:([]x:n?1f; y:n?0b)
q)\ts select sum x by y from t
22 134218464

kdb+/q is also using 64bit floats and the timing results with units are 22ms and 134MB of allocations.

1 Like

RCall + data.table

1 Like

If I wanted to go that route, I’d just call kdb+/q from julia. I used R’s data.table many years ago before I moved to kdb+/q. I’d rather not go back. :wink:

The main reason I’d like to write everything in Julia is the simplicity of having everything in one language, espeically for running functions in parallel. I’d imagine using RCall isn’t compatibile with Julia’s multithreading.

https://github.com/xiaodaigh/FastGroupBy.jl

1 Like

That’s an improvement, but at 75ms, it’s still 3.4x slower.

1 Like

just curious. could you benchmark kdb+/q against data.table?

It’s been a while, so there may be a faster way:

> library(data.table)
data.table 1.12.6 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
> n=10^7; t=data.table(x=runif(n),y=runif(n)>0.5)
> system.time(t[,sum(x),by=y])
   user  system elapsed
  0.532   0.084   0.161
> setDTthreads(1)
> system.time(t[,sum(x),by=y])
   user  system elapsed
  0.203   0.050   0.252

could you change by to keyby in your data.table example?

1 Like

Same results +/- 5 ms.

kdb+ really fast

Wins what? The results are:

R data.table 1-thread:  252 ms
R data.table 4-threads: 161 ms
Julia DataFrames:       147 ms
Julia FastGroupBy:       75 ms
kdb+/q:                  22 ms

@xiaodai, maybe this is a new benchmark for you to use in the pursuit of a faster groupby?

3 Likes

Are you able to test it out on a large example? Or you need to do these small group by millions of times? If u only need it 20times then I consider the gap to be in the “who cares” range. But it’s interesting to see what can done to match it though. Can kdb/q support strings? y is a boolean so I think you can use one of the internal functions of FastGroupBy to make it faster. I will look at it when I get home.

With n=10^9:

Julia DataFrames:  13.57 seconds
Julia FastGroupBy:  7.48 seconds
kdb+/q:             2.30 seconds

You’re right that in this trivial example I don’t care about performance. A more realistic example where things get slow, even in kdb+/q, is a rolling time window join by symbol between two tables while applying custom aggregation functions to multiple columns.

kdb+/q does support strings, as well as symbols (enumerated strings):

1 Like

How slow is it compared to data.table::foverlaps?

@robsmith11

fgroupreduce(by::Vector{Bool}, x::Vector{S}) where {T, S} = begin
    @inbounds for (b, x1) in zip(by, x)
        x[b + 1] += x1
    end
end

@time fgroupreduce(t[!, :y], t[!, :x])

This is the fastest way to do the group_by in Julia. Which is a groupreduce. I was in the process of implementing this in FastGroupBy.jl` before I got side tracked by other projects. I might prioritise these feature if u want to use them.

5 Likes

Currently DataFrames doesn’t have an optimized method for Bool grouping keys, nor for integers unfortunately. But you can try converting it to PooledArray or CategoricalArray and it should be quite faster. I’m still planning to improve this by avoiding some unnecessary computations when you just want to compute the sum.

4 Likes

Wow, that is indeed faster. 11.6 ms and 0 allocations for n=10^7, twice as fast as kdb+/q!

I don’t have a specific need for this feature, but good to know that Julia can compete with kdb+/q’s black magic.

@nalimilan
Using CategoricalArray with DataFrames’ by also produces a decent speed-up: 81 ms and 152 MB allocations, so almost 2x faster than with a Bool array.

3 Likes

For something simple like this we can expect Julia to be as fast as possible. Really should have that in DataFrames at soe point.

4 Likes