Serious group-by performance issue with Query.jl

@piever I tried adding @inline annotations to every function in LightQuery and it didn’t seem to help. I’m just going to give up and punt the problem to base. I don’t think there’s anything in the code I’ve written that requires paying attention to any column except the first one. Seems like I fall off a performance cliff with 5 columns or more.

@xiaodai @piever through careful inlining I’ve been able to increase the performance cliff to 30 columns. I’m guessing that’s reaching the limits of inference’s ability to infer tuples.

using DataFrames
using Query
using LightQuery
using DataFramesMeta
using CSV
using BenchmarkTools

cd("/home/brandon")
run(`wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000_1gb.tgz`)
run(`tar xzvf mortgage_2000_1gb.tgz`)
cd("perf")

result = CSV.read("Performance_2000Q1.txt_0",
  delim = '|',
  header = Symbol.(string.("Column", 1:31)),
  missingstrings = ["NULL", ""],
  dateformat = "mm/dd/yyyy",
  truestrings = ["Y"],
  falsestrings = ["N"]
)

println("LightQuery")
@btime @name @> result |>
    named_tuple |>
    Rows |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)),)) |>
    make_columns
end

println("Query")
@btime result |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta")
@btime by(result, :Column1, :Column1 => length)

result2 = result[:, 1:30]

println("LightQuery 30")
@btime @name @> result2 |>
    named_tuple |>
    Rows |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)),)) |>
    make_columns
end

println("Query 30")
@btime result2 |>
  @groupby(_.Column1) |>
  @map({Count = length(_)}) |>
  DataFrame;

println("DataFramesMeta 30")
@btime by(result2, :Column1, :Column1 => length)
LightQuery
  63.396 s (171558858 allocations: 3.65 GiB)
Query
  8.062 s (47035866 allocations: 5.12 GiB)
DataFramesMeta
  405.624 ms (193 allocations: 349.36 MiB)
LightQuery 30
  168.517 ms (73 allocations: 3.00 MiB)
Query 30
  8.197 s (47035866 allocations: 4.72 GiB)
DataFramesMeta 30
  316.825 ms (192 allocations: 349.36 MiB)
1 Like

@xiaodai I’ve discovered the problem disappears when I copy the data. When I copy the data, the columns turn from CSV.Columns a mix of Arrays and PooledArrays. So I’m thinking now that the problem is might be related to CSV? @quinnj

result2 = copy(result)

@btime @name @> result2 |>
    named_tuple |>
    Rows |>
    Group(By(_, :Column1)) |>
    over(_, @_ (Count = length(value(_)),)) |>
    make_columns

@xiaodai I’ve updated LightQuery with a /test/benchmarks.jl file and I think the benchmarks are fair, and it doesn’t require copying. The results are below

LightQuery
  484.950 ms (9527587 allocations: 366.02 MiB)
DataFrames
  3.443 s (27283614 allocations: 2.75 GiB)
3 Likes

The fastest option in R is the package data.table

Here are some random comments on the Query.jl situation:

  • One thing that will help is to not put all columns into the groups, in particular if all one is doing is counting the number of rows per group. You would modify the group clause to read @groupby(_.Column1, _.Column2) for example. So in this case the groups are now just arrays of the content of column 2, rather than arrays of named tuples of all columns.
  • I had the same experience that @bramtayl had with LightQuery.jl: if you read the file with CSV.read, you get a custom column type in your DataFrame, and then queries become quite a bit slower than with a “normal” DataFrame. You can either make a copy of the DataFrame, like @bramtayl suggested, or read the data with CSV.jl and the copycols=true argument, or use CSVFiles.jl to read if you want to avoid that issue. The latter always returns just standard array types for the columns.
  • I did some tests whether things go off a cliff beyond a certain number of columns. At some point I thought it did, then I couldn’t really replicate it and then I ran out of time to dig in further :slight_smile: So, I’d say I’m just not sure right now.
  • The long-term plan for Query.jl is to augment the current row based processing backend with a column based processing backend that also includes a full query optimizer, more like what you have in a database. Don’t expect any movement on that anytime soon, though. My best guess is that I might be able to start working on this next summer. But the core design of Query.jl has been careful done from the get-go to eventually allow something like that because the current row iterator based processing approach is obviously not generally the best option (and in particular for a query like the one you showed, really quite bad).

Hmmm, row based designs feel anachronistic for modern data workloads (most new data system are column based) . Seems to be a curious designdecision.

I do think that a row based logical data model is the right conceptual choice. And then the easiest and most obvious implementation is a volcano style implementation. That is where we are right now, but I designed Query from the beginning such that I can eventually add a column based processing engine. But building that kind of stuff takes time, and so at least for Query.jl we’ll just have to be patient.

There are also benefits to a row based story: for some queries that is the best strategy, and it allows a very natural extension of queries for purely a tabular data model to queries over arbitrary data structures, a core design goal that Query.jl copied from LINQ and that I think is a natural match with julia’s type system.