I have a very large dataframe and would like to know the fastest way to select a subset of rows. I tried Query package. It is elegant but really slow compared to ugly/brute force solution. Any hints? Thank you. Much appreciated.

Example

using DataFrames, Query

const N = 100000000 ;

const min_age, max_age = 20, 60 ;

const min_year, max_year = 1980, 2010 ;

const J = 9

df = DataFrame(age=rand(min_age:max_age,N),

year=rand(min_year:max_year,N),

jx = rand(1:J,N)) ;

const min_age2, max_age2 = 30, 40 ;

const min_year2, max_year2 = 1990, 2000 ;

@time df1 = df[(df[:age].>=min_age2).*

(df[:age].<=max_age2).*

(df[:year].>=min_year2).*

(df[:year].<=max_year2).*

(df[:jx].<J),:] ;

1.372216 seconds (20.77 k allocations: 206.762 MiB)

@time ds1 = @from i in df begin

@where i.age >= min_age && i.age <= max_age && i.year >= min_year && i.year <= max_year && i.jx < J

@select {i.year,i.age,i.jx}

@collect DataFrame

end

14.201888 seconds (100.02 M allocations: 1.869 GiB, 5.57% gc time)