Wanted to raise a performance issue at Query.jl but got directed here. I think Query.jl group-by is really inefficient and that’s why I never warmed to it. I use FileIO.jl but never Query.jl due to performance issues. E.g.
using Query
@time qa = a |>
@groupby(_.Column1) |>
@map({Count = length(_)}) |>
DataFrame;
to 71 seconds and the same operation in DataFrames.jl is only 1.3 seconds, see below
using DataFramesMeta
@time by(a, :Column1, :Column1 => length)
Can you still tell me which file you tried that generated the results you showed above and on which column you tried to group? I’m happy to help you write that query in a better way (the way it is written right now is far from ideal), and also explain where we are in general with performance. But I’d like to try it out before I write a response here.
Download the 7 year dataset. Unzip it and you will a performance folder. Inside there are many files, choose the largest one which is 2004Q3.
For a start, just download the 1 year dataset, go to the performance folder once unzipped, read any of the files in without header and with delim = ‘|’. Then run my code as presented. It’s running group-by Column1, which is the actual name of the column
Also, I don’t get LightQuery. The syntax feels verbose, e.g. Rows, make_columns.
Also, github README is the first thing I look at. But there is no information on LightQuery.jl there. It doesn’t tell me anything about what it’s for and how to use it.
In general, I think dplyr(LINQ)-like should really focus on performance, especially group-by.
@xiaodai I checked with the real data and you’re right. Do you have any idea what’s going on? I’m surprised that LightQuery can be so much faster for my benchmarks but not in real data. I’d expect that increasing the number of columns to have no effect on performance, because the only work that needs to get done is counting the length of repeated values in the first column, but it seems to make a huge difference:
But given that for one column LightQuery is so much faster, and I’ve made meticulous care to make sure that all column-wise operations are type-stable, it seems like Base is just not making the optimizations I’d hope it would make. Some combination of not inlining and allocating views, maybe?