I’m trying to use Query.jl for some basic filtering, but couldn’t get the result even after a very long wait. After reducing the number of columns, I get the output, but still it’s much much slower than plain DataFrame filtering and scales badly with the number of columns (exponentially?). The issue is easily reproducible:
using DataFrames, Query
a = Vector{Union{Missing, Float64}}(rand(1_000_000))
kws = Dict(Symbol("a$i") => copy(a) for i in 1:XX) # <<---- here
df = DataFrame(; kws...)
display(df |> size)
df = df |> @filter(_.a1.hasvalue)
display(df)
display(df |> DataFrame |> size)
With 20 in place of XX at the specified line I get the result after ~40 seconds, with 30 or more I get no result even after a long time. The memory usage also gets very high, in the tens of gigabytes for larger column counts.
Possibly the issue is related with handling Missing as I don’t experience the slowdown with _.a1 > 0 instead of _.a1.hasvalue for example. However I’m not sure, maybe there are other more general reasons.
So is there a way to test this fix now before the final release of that version? The bug and fix description don’t look related to my issue: it seems that an additional copy should introduce a relatively minor slowdown only. While the code in the first post is slower than expected by many orders of magnitude.
You can get the master branch of Tables.jl via pkg> add Tables#master. Just make sure you free Tables once a new version has been released.
I found the unnecessary copy part of the code while I was trying to diagnose a slowdown that was similarly extreme to what you found. Removing the copy got rid of that extreme slowdown, and on my system also fixes your problem. There is clearly something else broken somewhere, but removing the copy seems enough to avoid the problematic code path for Query (and the copy shouldn’t happen no matter what).