I already use Julia for much of my simulation and model calibration work, but am considering using it for some of the more basic data preparation I do.
This work is primarly filtering, grouping, and applying time series joins to large tables of financial market data. I currently use kdb+/q, which is a highly refined in-memory database and query language.
I’ve started by looking at the popular DataFrames.jl package, which seems to support most of the features I would like, but a simple aggregation seems to take 6.7x times longer.
Am I doing something wrong? Should I try a different Tables package?
julia> n=10^7; t=DataFrame(x=rand(10^7), y=rand(Bool,10^7)) julia> @btime by($t, :y, :x => sum) 147.204 ms (130 allocations: 356.89 MiB) 2×2 DataFrame │ Row │ y │ x_sum │ │ │ Bool │ Float64 │ ├─────┼──────┼───────────┤ │ 1 │ 1 │ 2.49808e6 │ │ 2 │ 0 │ 2.50187e6 │
(I’m using a nightly Julia started with
--optimize=3 --inline=yes --check-bounds=no --math-mode=fast)
q)n:prd 7#10; t:(x:n?1f; y:n?0b) q)\ts select sum x by y from t 22 134218464
kdb+/q is also using 64bit floats and the timing results with units are 22ms and 134MB of allocations.