Large dataframe. fast row selection

I have a very large dataframe and would like to know the fastest way to select a subset of rows. I tried Query package. It is elegant but really slow compared to ugly/brute force solution. Any hints? Thank you. Much appreciated.

Example

using DataFrames, Query

const N = 100000000 ;
const min_age, max_age = 20, 60 ;
const min_year, max_year = 1980, 2010 ;
const J = 9
df = DataFrame(age=rand(min_age:max_age,N),
year=rand(min_year:max_year,N),
jx = rand(1:J,N)) ;

const min_age2, max_age2 = 30, 40 ;
const min_year2, max_year2 = 1990, 2000 ;

@time df1 = df[(df[:age].>=min_age2).*
(df[:age].<=max_age2).*
(df[:year].>=min_year2).*
(df[:year].<=max_year2).*
(df[:jx].<J),:] ;

1.372216 seconds (20.77 k allocations: 206.762 MiB)

@time ds1 = @from i in df begin
@where i.age >= min_age && i.age <= max_age && i.year >= min_year && i.year <= max_year && i.jx < J
@select {i.year,i.age,i.jx}
@collect DataFrame
end

14.201888 seconds (100.02 M allocations: 1.869 GiB, 5.57% gc time)

DataFramesMeta is as fast as the plain DataFrames version, but more compact:

using DataFramesMeta
df1 = @where(df, min_age2 .<= :age .<= max_age2, min_year2 .<= :year .<= max_year2, :jx .< J)

(BTW, note that you can combine minimum and maximum conditions even without DataFramesMeta.)

One thing that you can do to speed things up is to index the dataframes using row numbers instead of logical indices. I’m not sure why, but I find that if you have a bitarray of rows you want to keep, you can get better performance by applying findall to it before indexing the array. In your example, the comparison would be:

using BenchmarkTools

@btime df1 = @where($df, min_age2 .<= :age .<= max_age2,
                    min_year2 .<= :year .<= max_year2,
                    :jx .< J);
# 1.508 s (59 allocations: 238.46 MiB)

@btime begin
    idx = ($df[:age].>=min_age2).*
        ($df[:age].<=max_age2).*
        ($df[:year].>=min_year2).*
        ($df[:year].<=max_year2).*
        ($df[:jx].<J);
    df1 = $df[findall(idx),:]
end;
# 963.606 ms (66 allocations: 279.15 MiB)

That still leaves you with the problem of the ugly syntax. I like to use map for problems like this:

@btime df1 = $df[map($df[:age], $df[:year], $df[:jx]) do age, year, jx
   min_age2 <= age <= max_age2  || return false
   min_year2<= year <= max_year2|| return false
   jx < J                       || return false
   return true
end |> findall, :];
# 947.216 ms (49 allocations: 362.59 MiB)

It’s not necessarily less code to type, but I think that handling it with explicit switches makes it a lot more readable.

Good catch. This is probably because the getindex method for DataFrame calls getindex on each column, which needs to compute the number and position of true entries repeatedly (via the internal LogicalIndex type). We should try two alternative solutions:

  • create a LogicalIndex once and pass it to getindex to avoid recomputing the number of entries
  • call findall and pass a vector of indices to getindex to also avoid recomputing the position of entries

The second approach is equivalent to what you are doing. It has the drawback that a new vector needs to be allocated, but that’s not a big deal given that we already have to allocate new columns with the same length.

Would you be interested in making a pull request?

Sure. I can try to work on it later this week.

Yes, Query.jl’s current row based implementation won’t be able to compete with the kind of column based implementation you are seeing better performance for. The design to add a column based backend to Query.jl is all there (well, in my head :wink: ), but it is a lot of work. We are starting on that, but this won’t be done in a few weeks/months…

Having said that, I’m seeing a smaller difference on julia 1.0 than on julia 0.6 on my system:

julia 0.6:
Query: 10.895567 seconds (100.02 M allocations: 1.533 GiB, 7.64% gc time)
Your code: 1.527942 seconds (20.77 k allocations: 206.659 MiB, 0.42% gc time)

julia 1.0:
Query: 5.751088 seconds (390.53 k allocations: 3.022 GiB, 9.96% gc time)
Your code: 1.121577 seconds (60 allocations: 205.535 MiB)

There might also still be stuff we can do to improve the performance of the row based implementation, I actually have not run a profiler on this kind of stuff in a long time…