I already use Julia for much of my simulation and model calibration work, but am considering using it for some of the more basic data preparation I do.
This work is primarly filtering, grouping, and applying time series joins to large tables of financial market data. I currently use kdb+/q, which is a highly refined in-memory database and query language.
I’ve started by looking at the popular DataFrames.jl package, which seems to support most of the features I would like, but a simple aggregation seems to take 6.7x times longer.
Am I doing something wrong? Should I try a different Tables package?
If I wanted to go that route, I’d just call kdb+/q from julia. I used R’s data.table many years ago before I moved to kdb+/q. I’d rather not go back.
The main reason I’d like to write everything in Julia is the simplicity of having everything in one language, espeically for running functions in parallel. I’d imagine using RCall isn’t compatibile with Julia’s multithreading.
Are you able to test it out on a large example? Or you need to do these small group by millions of times? If u only need it 20times then I consider the gap to be in the “who cares” range. But it’s interesting to see what can done to match it though. Can kdb/q support strings? y is a boolean so I think you can use one of the internal functions of FastGroupBy to make it faster. I will look at it when I get home.
Julia DataFrames: 13.57 seconds
Julia FastGroupBy: 7.48 seconds
kdb+/q: 2.30 seconds
You’re right that in this trivial example I don’t care about performance. A more realistic example where things get slow, even in kdb+/q, is a rolling time window join by symbol between two tables while applying custom aggregation functions to multiple columns.
kdb+/q does support strings, as well as symbols (enumerated strings):
fgroupreduce(by::Vector{Bool}, x::Vector{S}) where {T, S} = begin
@inbounds for (b, x1) in zip(by, x)
x[b + 1] += x1
end
end
@time fgroupreduce(t[!, :y], t[!, :x])
This is the fastest way to do the group_by in Julia. Which is a groupreduce. I was in the process of implementing this in FastGroupBy.jl` before I got side tracked by other projects. I might prioritise these feature if u want to use them.
Currently DataFrames doesn’t have an optimized method for Bool grouping keys, nor for integers unfortunately. But you can try converting it to PooledArray or CategoricalArray and it should be quite faster. I’m still planning to improve this by avoiding some unnecessary computations when you just want to compute the sum.
Wow, that is indeed faster. 11.6 ms and 0 allocations for n=10^7, twice as fast as kdb+/q!
I don’t have a specific need for this feature, but good to know that Julia can compete with kdb+/q’s black magic.
@nalimilan
Using CategoricalArray with DataFrames’ by also produces a decent speed-up: 81 ms and 152 MB allocations, so almost 2x faster than with a Bool array.