Serious group-by performance issue with Query.jl

There’s a link to the documentation from the readme…I’ll take a look and try work out the bug

Sorry missing a comma. Could you try this?

using DataFrames
a = DataFrame(Column1 = [1, 1, 1, 2, 2, 3, 3], Column2 = [1, 1, 1, 2, 2, 3, 3])

using LightQuery 
@name @> a |> 
    named_tuple |> 
    Rows |> 
    Group(By(_, :Column1)) |> 
    over(_, @_ (Count = length(value(_)),)) |> 
    make_columns

you are on 150s, twice as slow as Query.jl

Ok, thanks

@xiaodai here’s what my benchmarks show:

using DataFrames
using Query
using LightQuery
using DataFramesMeta
using Pkg
using BenchmarkTools

Column1 = vcat(
    repeat([1], 1000000),
    repeat([2], 1000000)
)

a = DataFrame(Column1 = Column1, Column2 = Column1)

println("LightQuery")
@btime @name @> a |>
  named_tuple |>
  Rows |>
  Group(By(_, :Column1)) |>
  over(_, @_ (Count = length(value(_)),)) |>
  make_columns

println("Query")
@btime a |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta")
@btime by(a, :Column1, :Column1 => length)
LightQuery
  1.329 ms (33 allocations: 1.16 KiB)
Query
  51.768 ms (134 allocations: 34.01 MiB)
DataFramesMeta
  24.013 ms (150 allocations: 61.79 MiB)

Maybe try it on real-world data? https://docs.rapids.ai/datasets/mortgage-data

There are more columns in the DataFrame and there are many more groups. Like 1 million different groups

@xiaodai I checked with the real data and you’re right. Do you have any idea what’s going on? I’m surprised that LightQuery can be so much faster for my benchmarks but not in real data. I’d expect that increasing the number of columns to have no effect on performance, because the only work that needs to get done is counting the length of repeated values in the first column, but it seems to make a huge difference:

using DataFrames
using Query
using LightQuery
using DataFramesMeta
using CSV
using BenchmarkTools

cd("/home/brandon")
run(`wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000_1gb.tgz`)
run(`tar xzvf mortgage_2000_1gb.tgz`)
cd("perf")

result = CSV.read("Performance_2000Q1.txt_0", 
  delim = '|', 
  header = Symbol.(string.("Column", 1:31)), 
  missingstrings = ["NULL", ""],
  dateformat = "mm/dd/yyyy",
  truestrings = ["Y"],
  falsestrings = ["N"]
)

result2 = result[:, 1:1]

println("LightQuery")
@btime @name @> result2 |>
  named_tuple |>
  Rows |>
  Group(By(_, :Column1)) |>
  over(_, @_ (Count = length(value(_)),)) |>
  make_columns

println("Query")
@btime result |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta")
@btime by(result, :Column1, :Column1 => length)

gives results

LightQuery
  10.740 ms (35 allocations: 3.00 MiB)
Query
  523.135 ms (1562450 allocations: 250.19 MiB)
DataFramesMeta
  273.793 ms (163 allocations: 349.35 MiB)

But given that for one column LightQuery is so much faster, and I’ve made meticulous care to make sure that all column-wise operations are type-stable, it seems like Base is just not making the optimizations I’d hope it would make. Some combination of not inlining and allocating views, maybe?

group-by is generally quite hard. I wrote a few articles about them.

Have you tried R’s dplyr on this data? If so how does that fare?

Good question. Just tried.

DataFrames.jl is similar to dplyr but slower than data.table which has some optimization

image

1 Like

That’s why I’ve been saying Julia’s data ecosystem now is pretty good.

1 Like

If you are iterating named tuples, I suspect there can be two possible issues:

  • too many fields and the compiler gives up
  • named tuple mixed with missing don’t behave well at the moment
1 Like

@piever I tried adding @inline annotations to every function in LightQuery and it didn’t seem to help. I’m just going to give up and punt the problem to base. I don’t think there’s anything in the code I’ve written that requires paying attention to any column except the first one. Seems like I fall off a performance cliff with 5 columns or more.

@xiaodai @piever through careful inlining I’ve been able to increase the performance cliff to 30 columns. I’m guessing that’s reaching the limits of inference’s ability to infer tuples.

using DataFrames
using Query
using LightQuery
using DataFramesMeta
using CSV
using BenchmarkTools

cd("/home/brandon")
run(`wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000_1gb.tgz`)
run(`tar xzvf mortgage_2000_1gb.tgz`)
cd("perf")

result = CSV.read("Performance_2000Q1.txt_0",
  delim = '|',
  header = Symbol.(string.("Column", 1:31)),
  missingstrings = ["NULL", ""],
  dateformat = "mm/dd/yyyy",
  truestrings = ["Y"],
  falsestrings = ["N"]
)

println("LightQuery")
@btime @name @> result |>
    named_tuple |>
    Rows |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)),)) |>
    make_columns
end

println("Query")
@btime result |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta")
@btime by(result, :Column1, :Column1 => length)

result2 = result[:, 1:30]

println("LightQuery 30")
@btime @name @> result2 |>
    named_tuple |>
    Rows |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)),)) |>
    make_columns
end

println("Query 30")
@btime result2 |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta 30")
@btime by(result2, :Column1, :Column1 => length)
LightQuery
  63.396 s (171558858 allocations: 3.65 GiB)
Query
  8.062 s (47035866 allocations: 5.12 GiB)
DataFramesMeta
  405.624 ms (193 allocations: 349.36 MiB)
LightQuery 30
  168.517 ms (73 allocations: 3.00 MiB)
Query 30
  8.197 s (47035866 allocations: 4.72 GiB)
DataFramesMeta 30
  316.825 ms (192 allocations: 349.36 MiB)
1 Like

@xiaodai I’ve discovered the problem disappears when I copy the data. When I copy the data, the columns turn from CSV.Columns a mix of Arrays and PooledArrays. So I’m thinking now that the problem is might be related to CSV? @quinnj

result2 = copy(result)

@btime @name @> result2 |>
    named_tuple |>
    Rows |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)),)) |>
    make_columns

@xiaodai I’ve updated LightQuery with a /test/benchmarks.jl file and I think the benchmarks are fair, and it doesn’t require copying. The results are below

LightQuery
  484.950 ms (9527587 allocations: 366.02 MiB)
DataFrames
  3.443 s (27283614 allocations: 2.75 GiB)
3 Likes

The fastest option in R is the package data.table

Here are some random comments on the Query.jl situation:

  • One thing that will help is to not put all columns into the groups, in particular if all one is doing is counting the number of rows per group. You would modify the group clause to read @groupby(_.Column1, _.Column2) for example. So in this case the groups are now just arrays of the content of column 2, rather than arrays of named tuples of all columns.
  • I had the same experience that @bramtayl had with LightQuery.jl: if you read the file with CSV.read, you get a custom column type in your DataFrame, and then queries become quite a bit slower than with a “normal” DataFrame. You can either make a copy of the DataFrame, like @bramtayl suggested, or read the data with CSV.jl and the copycols=true argument, or use CSVFiles.jl to read if you want to avoid that issue. The latter always returns just standard array types for the columns.
  • I did some tests whether things go off a cliff beyond a certain number of columns. At some point I thought it did, then I couldn’t really replicate it and then I ran out of time to dig in further :slight_smile: So, I’d say I’m just not sure right now.
  • The long-term plan for Query.jl is to augment the current row based processing backend with a column based processing backend that also includes a full query optimizer, more like what you have in a database. Don’t expect any movement on that anytime soon, though. My best guess is that I might be able to start working on this next summer. But the core design of Query.jl has been careful done from the get-go to eventually allow something like that because the current row iterator based processing approach is obviously not generally the best option (and in particular for a query like the one you showed, really quite bad).

Hmmm, row based designs feel anachronistic for modern data workloads (most new data system are column based) . Seems to be a curious designdecision.

I do think that a row based logical data model is the right conceptual choice. And then the easiest and most obvious implementation is a volcano style implementation. That is where we are right now, but I designed Query from the beginning such that I can eventually add a column based processing engine. But building that kind of stuff takes time, and so at least for Query.jl we’ll just have to be patient.

There are also benefits to a row based story: for some queries that is the best strategy, and it allows a very natural extension of queries for purely a tabular data model to queries over arbitrary data structures, a core design goal that Query.jl copied from LINQ and that I think is a natural match with julia’s type system.