Serious group-by performance issue with Query.jl

xiaodai · September 23, 2019, 12:38pm

Wanted to raise a performance issue at Query.jl but got directed here. I think Query.jl group-by is really inefficient and that’s why I never warmed to it. I use FileIO.jl but never Query.jl due to performance issues. E.g.

using Query
@time qa = a |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

to 71 seconds and the same operation in DataFrames.jl is only 1.3 seconds, see below

using DataFramesMeta

@time by(a, :Column1, :Column1 => length)

The dataset I used is from here.

https://docs.rapids.ai/datasets/mortgage-data

Can I understand, if Query.jl’s group-by performance can be improved? I am asking for possibility. I wrote this post more than 1 year ago

davidanthoff · September 23, 2019, 4:07pm

Which dataset exactly are you using from that site?

xiaodai · September 23, 2019, 9:40pm

Any. Query.jl will have poor performance relative to DataFrames.jl on large datasets

davidanthoff · September 23, 2019, 9:53pm

Can you still tell me which file you tried that generated the results you showed above and on which column you tried to group? I’m happy to help you write that query in a better way (the way it is written right now is far from ideal), and also explain where we are in general with performance. But I’d like to try it out before I write a response here.

xiaodai · September 23, 2019, 10:01pm

Download the 7 year dataset. Unzip it and you will a performance folder. Inside there are many files, choose the largest one which is 2004Q3.

For a start, just download the 1 year dataset, go to the performance folder once unzipped, read any of the files in without header and with delim = ‘|’. Then run my code as presented. It’s running group-by Column1, which is the actual name of the column

bramtayl · September 23, 2019, 11:36pm

Could you try LightQuery? The corresponding code would be something like:

using LightQuery
@name @> a |>
    named_tuple |>
    Rows |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)))) |>
    make_columns

if pre-sorted and

using LightQuery
@name @> a |>
    named_tuple |>
    Rows |>
    order(_, :Column1)) |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)))) |>
    make_columns

if not

xiaodai · September 23, 2019, 11:45pm

Well, it’s also very slow. And failed

Also, I don’t get LightQuery. The syntax feels verbose, e.g. Rows, make_columns.

Also, github README is the first thing I look at. But there is no information on LightQuery.jl there. It doesn’t tell me anything about what it’s for and how to use it.

In general, I think dplyr(LINQ)-like should really focus on performance, especially group-by.

bramtayl · September 24, 2019, 1:29pm

There’s a link to the documentation from the readme…I’ll take a look and try work out the bug

bramtayl · September 24, 2019, 1:39pm

Sorry missing a comma. Could you try this?

using DataFrames
a = DataFrame(Column1 = [1, 1, 1, 2, 2, 3, 3], Column2 = [1, 1, 1, 2, 2, 3, 3])

using LightQuery 
@name @> a |> 
    named_tuple |> 
    Rows |> 
    Group(By(_, :Column1)) |> 
    over(_, @_ (Count = length(value(_)),)) |> 
    make_columns

xiaodai · September 25, 2019, 10:36am

you are on 150s, twice as slow as Query.jl

bramtayl · September 25, 2019, 8:47pm

Ok, thanks

bramtayl · September 26, 2019, 7:07am

@xiaodai here’s what my benchmarks show:

using DataFrames
using Query
using LightQuery
using DataFramesMeta
using Pkg
using BenchmarkTools

Column1 = vcat(
    repeat([1], 1000000),
    repeat([2], 1000000)
)

a = DataFrame(Column1 = Column1, Column2 = Column1)

println("LightQuery")
@btime @name @> a |>
  named_tuple |>
  Rows |>
  Group(By(_, :Column1)) |>
  over(_, @_ (Count = length(value(_)),)) |>
  make_columns

println("Query")
@btime a |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta")
@btime by(a, :Column1, :Column1 => length)

LightQuery
  1.329 ms (33 allocations: 1.16 KiB)
Query
  51.768 ms (134 allocations: 34.01 MiB)
DataFramesMeta
  24.013 ms (150 allocations: 61.79 MiB)

xiaodai · September 26, 2019, 7:14am

Maybe try it on real-world data? https://docs.rapids.ai/datasets/mortgage-data

There are more columns in the DataFrame and there are many more groups. Like 1 million different groups

bramtayl · September 26, 2019, 10:04am

@xiaodai I checked with the real data and you’re right. Do you have any idea what’s going on? I’m surprised that LightQuery can be so much faster for my benchmarks but not in real data. I’d expect that increasing the number of columns to have no effect on performance, because the only work that needs to get done is counting the length of repeated values in the first column, but it seems to make a huge difference:

using DataFrames
using Query
using LightQuery
using DataFramesMeta
using CSV
using BenchmarkTools

cd("/home/brandon")
run(`wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000_1gb.tgz`)
run(`tar xzvf mortgage_2000_1gb.tgz`)
cd("perf")

result = CSV.read("Performance_2000Q1.txt_0", 
  delim = '|', 
  header = Symbol.(string.("Column", 1:31)), 
  missingstrings = ["NULL", ""],
  dateformat = "mm/dd/yyyy",
  truestrings = ["Y"],
  falsestrings = ["N"]
)

result2 = result[:, 1:1]

println("LightQuery")
@btime @name @> result2 |>
  named_tuple |>
  Rows |>
  Group(By(_, :Column1)) |>
  over(_, @_ (Count = length(value(_)),)) |>
  make_columns

println("Query")
@btime result |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta")
@btime by(result, :Column1, :Column1 => length)

gives results

LightQuery
  10.740 ms (35 allocations: 3.00 MiB)
Query
  523.135 ms (1562450 allocations: 250.19 MiB)
DataFramesMeta
  273.793 ms (163 allocations: 349.35 MiB)

bramtayl · September 26, 2019, 10:14am

But given that for one column LightQuery is so much faster, and I’ve made meticulous care to make sure that all column-wise operations are type-stable, it seems like Base is just not making the optimizations I’d hope it would make. Some combination of not inlining and allocating views, maybe?

xiaodai · September 26, 2019, 10:54am

group-by is generally quite hard. I wrote a few articles about them.

DoktorMike · September 26, 2019, 3:06pm

Have you tried R’s dplyr on this data? If so how does that fare?

xiaodai · September 26, 2019, 3:13pm

Good question. Just tried.

DataFrames.jl is similar to dplyr but slower than data.table which has some optimization

xiaodai · September 26, 2019, 3:13pm

That’s why I’ve been saying Julia’s data ecosystem now is pretty good.

piever · September 26, 2019, 5:11pm

If you are iterating named tuples, I suspect there can be two possible issues:

too many fields and the compiler gives up
named tuple mixed with missing don’t behave well at the moment

Topic		Replies	Views
ANN: LightQuery tutorial Package Announcements tutorials	43	3140	July 11, 2019
Julia performs poorly on group-by benchmarks Data performance	48	5800	January 23, 2019
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	466	August 27, 2020
How is the data ecosystem right now for large datasets? Data	35	6696	July 13, 2017
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9367	January 1, 2025

Serious group-by performance issue with Query.jl

Related topics