Serious group-by performance issue with Query.jl

bramtayl · September 26, 2019, 7:42pm

@piever I tried adding @inline annotations to every function in LightQuery and it didn’t seem to help. I’m just going to give up and punt the problem to base. I don’t think there’s anything in the code I’ve written that requires paying attention to any column except the first one. Seems like I fall off a performance cliff with 5 columns or more.

bramtayl · September 27, 2019, 6:42pm

@xiaodai @piever through careful inlining I’ve been able to increase the performance cliff to 30 columns. I’m guessing that’s reaching the limits of inference’s ability to infer tuples.

using DataFrames
using Query
using LightQuery
using DataFramesMeta
using CSV
using BenchmarkTools

cd("/home/brandon")
run(`wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000_1gb.tgz`)
run(`tar xzvf mortgage_2000_1gb.tgz`)
cd("perf")

result = CSV.read("Performance_2000Q1.txt_0",
  delim = '|',
  header = Symbol.(string.("Column", 1:31)),
  missingstrings = ["NULL", ""],
  dateformat = "mm/dd/yyyy",
  truestrings = ["Y"],
  falsestrings = ["N"]
)

println("LightQuery")
@btime @name @> result |>
    named_tuple |>
    Rows |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)),)) |>
    make_columns
end

println("Query")
@btime result |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta")
@btime by(result, :Column1, :Column1 => length)

result2 = result[:, 1:30]

println("LightQuery 30")
@btime @name @> result2 |>
    named_tuple |>
    Rows |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)),)) |>
    make_columns
end

println("Query 30")
@btime result2 |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta 30")
@btime by(result2, :Column1, :Column1 => length)

LightQuery
  63.396 s (171558858 allocations: 3.65 GiB)
Query
  8.062 s (47035866 allocations: 5.12 GiB)
DataFramesMeta
  405.624 ms (193 allocations: 349.36 MiB)
LightQuery 30
  168.517 ms (73 allocations: 3.00 MiB)
Query 30
  8.197 s (47035866 allocations: 4.72 GiB)
DataFramesMeta 30
  316.825 ms (192 allocations: 349.36 MiB)

bramtayl · September 29, 2019, 1:14am

@xiaodai I’ve discovered the problem disappears when I copy the data. When I copy the data, the columns turn from CSV.Columns a mix of Arrays and PooledArrays. So I’m thinking now that the problem is might be related to CSV? @quinnj

result2 = copy(result)

@btime @name @> result2 |>
    named_tuple |>
    Rows |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)),)) |>
    make_columns

@xiaodai I’ve updated LightQuery with a /test/benchmarks.jl file and I think the benchmarks are fair, and it doesn’t require copying. The results are below

LightQuery
  484.950 ms (9527587 allocations: 366.02 MiB)
DataFrames
  3.443 s (27283614 allocations: 2.75 GiB)

Juan · October 11, 2019, 6:47pm

The fastest option in R is the package data.table

davidanthoff · October 12, 2019, 6:04pm

Here are some random comments on the Query.jl situation:

One thing that will help is to not put all columns into the groups, in particular if all one is doing is counting the number of rows per group. You would modify the group clause to read @groupby(_.Column1, _.Column2) for example. So in this case the groups are now just arrays of the content of column 2, rather than arrays of named tuples of all columns.
I had the same experience that @bramtayl had with LightQuery.jl: if you read the file with CSV.read, you get a custom column type in your DataFrame, and then queries become quite a bit slower than with a “normal” DataFrame. You can either make a copy of the DataFrame, like @bramtayl suggested, or read the data with CSV.jl and the copycols=true argument, or use CSVFiles.jl to read if you want to avoid that issue. The latter always returns just standard array types for the columns.
I did some tests whether things go off a cliff beyond a certain number of columns. At some point I thought it did, then I couldn’t really replicate it and then I ran out of time to dig in further So, I’d say I’m just not sure right now.
The long-term plan for Query.jl is to augment the current row based processing backend with a column based processing backend that also includes a full query optimizer, more like what you have in a database. Don’t expect any movement on that anytime soon, though. My best guess is that I might be able to start working on this next summer. But the core design of Query.jl has been careful done from the get-go to eventually allow something like that because the current row iterator based processing approach is obviously not generally the best option (and in particular for a query like the one you showed, really quite bad).

xiaodai · October 13, 2019, 3:28am

Hmmm, row based designs feel anachronistic for modern data workloads (most new data system are column based) . Seems to be a curious designdecision.

davidanthoff · October 13, 2019, 5:34am

I do think that a row based logical data model is the right conceptual choice. And then the easiest and most obvious implementation is a volcano style implementation. That is where we are right now, but I designed Query from the beginning such that I can eventually add a column based processing engine. But building that kind of stuff takes time, and so at least for Query.jl we’ll just have to be patient.

There are also benefits to a row based story: for some queries that is the best strategy, and it allows a very natural extension of queries for purely a tabular data model to queries over arbitrary data structures, a core design goal that Query.jl copied from LINQ and that I think is a natural match with julia’s type system.

Topic		Replies	Views
ANN: LightQuery tutorial Package Announcements tutorials	43	3227	July 11, 2019
Julia performs poorly on group-by benchmarks Data performance	48	5855	January 23, 2019
How is the data ecosystem right now for large datasets? Data	35	6775	July 13, 2017
A minor group-by benchmark - DataFrames.jl plenty fast General Usage	5	480	August 27, 2020
Group-by performance benchmarks and recommendations Data	12	3547	September 2, 2019

Serious group-by performance issue with Query.jl

Related topics