Serious group-by performance issue with Query.jl

Wanted to raise a performance issue at Query.jl but got directed here. I think Query.jl group-by is really inefficient and that’s why I never warmed to it. I use FileIO.jl but never Query.jl due to performance issues. E.g.

using Query
@time qa = a |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

to 71 seconds and the same operation in DataFrames.jl is only 1.3 seconds, see below

using DataFramesMeta

@time by(a, :Column1, :Column1 => length)

The dataset I used is from here.

https://docs.rapids.ai/datasets/mortgage-data

Can I understand, if Query.jl’s group-by performance can be improved? I am asking for possibility. I wrote this post more than 1 year ago

1 Like

Which dataset exactly are you using from that site?

Any. Query.jl will have poor performance relative to DataFrames.jl on large datasets

Can you still tell me which file you tried that generated the results you showed above and on which column you tried to group? I’m happy to help you write that query in a better way (the way it is written right now is far from ideal), and also explain where we are in general with performance. But I’d like to try it out before I write a response here.

Download the 7 year dataset. Unzip it and you will a performance folder. Inside there are many files, choose the largest one which is 2004Q3.

For a start, just download the 1 year dataset, go to the performance folder once unzipped, read any of the files in without header and with delim = ‘|’. Then run my code as presented. It’s running group-by Column1, which is the actual name of the column

Could you try LightQuery? The corresponding code would be something like:

using LightQuery
@name @> a |>
    named_tuple |>
    Rows |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)))) |>
    make_columns

if pre-sorted and

using LightQuery
@name @> a |>
    named_tuple |>
    Rows |>
    order(_, :Column1)) |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)))) |>
    make_columns

if not

Well, it’s also very slow. And failed

Also, I don’t get LightQuery. The syntax feels verbose, e.g. Rows, make_columns.

Also, github README is the first thing I look at. But there is no information on LightQuery.jl there. It doesn’t tell me anything about what it’s for and how to use it.

In general, I think dplyr(LINQ)-like should really focus on performance, especially group-by.

There’s a link to the documentation from the readme…I’ll take a look and try work out the bug

Sorry missing a comma. Could you try this?

using DataFrames
a = DataFrame(Column1 = [1, 1, 1, 2, 2, 3, 3], Column2 = [1, 1, 1, 2, 2, 3, 3])

using LightQuery 
@name @> a |> 
    named_tuple |> 
    Rows |> 
    Group(By(_, :Column1)) |> 
    over(_, @_ (Count = length(value(_)),)) |> 
    make_columns

you are on 150s, twice as slow as Query.jl

Ok, thanks

@xiaodai here’s what my benchmarks show:

using DataFrames
using Query
using LightQuery
using DataFramesMeta
using Pkg
using BenchmarkTools

Column1 = vcat(
    repeat([1], 1000000),
    repeat([2], 1000000)
)

a = DataFrame(Column1 = Column1, Column2 = Column1)

println("LightQuery")
@btime @name @> a |>
  named_tuple |>
  Rows |>
  Group(By(_, :Column1)) |>
  over(_, @_ (Count = length(value(_)),)) |>
  make_columns

println("Query")
@btime a |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta")
@btime by(a, :Column1, :Column1 => length)
LightQuery
  1.329 ms (33 allocations: 1.16 KiB)
Query
  51.768 ms (134 allocations: 34.01 MiB)
DataFramesMeta
  24.013 ms (150 allocations: 61.79 MiB)

Maybe try it on real-world data? https://docs.rapids.ai/datasets/mortgage-data

There are more columns in the DataFrame and there are many more groups. Like 1 million different groups

@xiaodai I checked with the real data and you’re right. Do you have any idea what’s going on? I’m surprised that LightQuery can be so much faster for my benchmarks but not in real data. I’d expect that increasing the number of columns to have no effect on performance, because the only work that needs to get done is counting the length of repeated values in the first column, but it seems to make a huge difference:

using DataFrames
using Query
using LightQuery
using DataFramesMeta
using CSV
using BenchmarkTools

cd("/home/brandon")
run(`wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000_1gb.tgz`)
run(`tar xzvf mortgage_2000_1gb.tgz`)
cd("perf")

result = CSV.read("Performance_2000Q1.txt_0", 
  delim = '|', 
  header = Symbol.(string.("Column", 1:31)), 
  missingstrings = ["NULL", ""],
  dateformat = "mm/dd/yyyy",
  truestrings = ["Y"],
  falsestrings = ["N"]
)

result2 = result[:, 1:1]

println("LightQuery")
@btime @name @> result2 |>
  named_tuple |>
  Rows |>
  Group(By(_, :Column1)) |>
  over(_, @_ (Count = length(value(_)),)) |>
  make_columns

println("Query")
@btime result |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta")
@btime by(result, :Column1, :Column1 => length)

gives results

LightQuery
  10.740 ms (35 allocations: 3.00 MiB)
Query
  523.135 ms (1562450 allocations: 250.19 MiB)
DataFramesMeta
  273.793 ms (163 allocations: 349.35 MiB)

But given that for one column LightQuery is so much faster, and I’ve made meticulous care to make sure that all column-wise operations are type-stable, it seems like Base is just not making the optimizations I’d hope it would make. Some combination of not inlining and allocating views, maybe?

group-by is generally quite hard. I wrote a few articles about them.

Have you tried R’s dplyr on this data? If so how does that fare?

Good question. Just tried.

DataFrames.jl is similar to dplyr but slower than data.table which has some optimization

image

1 Like

That’s why I’ve been saying Julia’s data ecosystem now is pretty good.

1 Like

If you are iterating named tuples, I suspect there can be two possible issues:

  • too many fields and the compiler gives up
  • named tuple mixed with missing don’t behave well at the moment
1 Like