Large dataframe. fast row selection

grandemundo82 · September 12, 2018, 10:33am

I have a very large dataframe and would like to know the fastest way to select a subset of rows. I tried Query package. It is elegant but really slow compared to ugly/brute force solution. Any hints? Thank you. Much appreciated.

Example

using DataFrames, Query

const N = 100000000 ;
const min_age, max_age = 20, 60 ;
const min_year, max_year = 1980, 2010 ;
const J = 9
df = DataFrame(age=rand(min_age:max_age,N),
year=rand(min_year:max_year,N),
jx = rand(1:J,N)) ;

const min_age2, max_age2 = 30, 40 ;
const min_year2, max_year2 = 1990, 2000 ;

@time df1 = df[(df[:age].>=min_age2).*
(df[:age].<=max_age2).*
(df[:year].>=min_year2).*
(df[:year].<=max_year2).*
(df[:jx].<J),:] ;

1.372216 seconds (20.77 k allocations: 206.762 MiB)

@time ds1 = @from i in df begin
@where i.age >= min_age && i.age <= max_age && i.year >= min_year && i.year <= max_year && i.jx < J
@select {i.year,i.age,i.jx}
@collect DataFrame
end

14.201888 seconds (100.02 M allocations: 1.869 GiB, 5.57% gc time)

nalimilan · September 12, 2018, 11:39am

DataFramesMeta is as fast as the plain DataFrames version, but more compact:

using DataFramesMeta
df1 = @where(df, min_age2 .<= :age .<= max_age2, min_year2 .<= :year .<= max_year2, :jx .< J)

(BTW, note that you can combine minimum and maximum conditions even without DataFramesMeta.)

jacobadenbaum · September 12, 2018, 2:00pm

One thing that you can do to speed things up is to index the dataframes using row numbers instead of logical indices. I’m not sure why, but I find that if you have a bitarray of rows you want to keep, you can get better performance by applying findall to it before indexing the array. In your example, the comparison would be:

using BenchmarkTools

@btime df1 = @where($df, min_age2 .<= :age .<= max_age2,
                    min_year2 .<= :year .<= max_year2,
                    :jx .< J);
# 1.508 s (59 allocations: 238.46 MiB)

@btime begin
    idx = ($df[:age].>=min_age2).*
        ($df[:age].<=max_age2).*
        ($df[:year].>=min_year2).*
        ($df[:year].<=max_year2).*
        ($df[:jx].<J);
    df1 = $df[findall(idx),:]
end;
# 963.606 ms (66 allocations: 279.15 MiB)

That still leaves you with the problem of the ugly syntax. I like to use map for problems like this:

@btime df1 = $df[map($df[:age], $df[:year], $df[:jx]) do age, year, jx
   min_age2 <= age <= max_age2  || return false
   min_year2<= year <= max_year2|| return false
   jx < J                       || return false
   return true
end |> findall, :];
# 947.216 ms (49 allocations: 362.59 MiB)

It’s not necessarily less code to type, but I think that handling it with explicit switches makes it a lot more readable.

nalimilan · September 12, 2018, 2:25pm

Good catch. This is probably because the getindex method for DataFrame calls getindex on each column, which needs to compute the number and position of true entries repeatedly (via the internal LogicalIndex type). We should try two alternative solutions:

create a LogicalIndex once and pass it to getindex to avoid recomputing the number of entries
call findall and pass a vector of indices to getindex to also avoid recomputing the position of entries

The second approach is equivalent to what you are doing. It has the drawback that a new vector needs to be allocated, but that’s not a big deal given that we already have to allocate new columns with the same length.

Would you be interested in making a pull request?

jacobadenbaum · September 12, 2018, 2:34pm

Sure. I can try to work on it later this week.

davidanthoff · September 13, 2018, 5:49pm

Yes, Query.jl’s current row based implementation won’t be able to compete with the kind of column based implementation you are seeing better performance for. The design to add a column based backend to Query.jl is all there (well, in my head ), but it is a lot of work. We are starting on that, but this won’t be done in a few weeks/months…

Having said that, I’m seeing a smaller difference on julia 1.0 than on julia 0.6 on my system:

julia 0.6:
Query: 10.895567 seconds (100.02 M allocations: 1.533 GiB, 7.64% gc time)
Your code: 1.527942 seconds (20.77 k allocations: 206.659 MiB, 0.42% gc time)

julia 1.0:
Query: 5.751088 seconds (390.53 k allocations: 3.022 GiB, 9.96% gc time)
Your code: 1.121577 seconds (60 allocations: 205.535 MiB)

There might also still be stuff we can do to improve the performance of the row based implementation, I actually have not run a profiler on this kind of stuff in a long time…

Topic		Replies	Views
Is there a faster way to extract single rows of a dataframe than the filter function General Usage	3	667	February 18, 2020
DataFramesMeta.jl insert @where subset programmatically? General Usage question	5	1015	October 12, 2017
DataFrames.jl - Choosing between the core functions and available libraries (Query.jl, DataFramesMeta.jl, etc) Data	10	2123	September 15, 2018
Performance of DataFrames' subset and ByRow Performance	10	1585	May 5, 2021
[DataFrames Question]: hash-based row indexing for DataFrames package Data question , suggestions	16	2294	October 16, 2019

Large dataframe. fast row selection

Related topics