Can this be made faster?

r0b0ty · March 18, 2022, 2:42am

I have two DataFrames. The size of the first is (7390, 2), represented below by a small (3, 2) array. The size of the second is quite a bit larger, at (482244, 8), represented below by a larger (7, 4) array.

I need to compare the values of EACH row in the small array with the values of ALL rows of the large array.

Obviously, it doesn’t take long in this example, but using my very large-sized DataFrames (and comparing against 8 values in the large one), it takes about 10 minutes. Unfortunately, I have to do this over 24 times. Not knowing any better, I used a for loop, but wonder if there’s a better way. Any suggestions?

function inbound(x::Float64, y::Float64, bounds::Matrix{Float64})
    lastrow, _ = size(bounds)
    return any(x .>= bounds[1:lastrow, 1] .&&
               x .<= bounds[1:lastrow, 3] .&&
               y .>= bounds[1:lastrow, 2] .&&
               y .<= bounds[1:lastrow, 4])
end

julia> smallarray = [2.2 7.5
                     3.5 8.1
                     1.7 3.6];

julia> nrow_smallarray, _ = size(smallarray);

julia> largearray = [1.6 3.5 1.8 3.7
                     2.0 8.5 2.2 8.7
                     3.0 9.5 3.2 9.7
                     4.0 10.5 4.2 10.7
                     5.0 11.5 5.2 11.7
                     6.0 12.5 6.2 12.7
                     7.0 13.5 7.2 13.7];

julia> result = Vector{Bool}(undef, nrow_smallarray);

julia> for i in 1:nrow_smallarray
           result[i] = inbound(smallarray[i, 1], smallarray[i, 2], largearray)
       end

julia> result
3-element Vector{Bool}:
 0
 0
 1

CameronBieganek · March 18, 2022, 4:05am

A couple things to notice:

When you slice an array like bounds[1:lastrow, 1], it actually makes a copy. So you are making a bunch of copies of large vectors when you run on your full data set.
An expression like any(a .≤ x .≤ b) will broadcast the comparison operators over the full vector x, and then apply the any function. It’s more efficient to use the any(f, x) form, because that form will short-circuit. In other words, it will return early as soon as the predicate f finds finds a true case.

You can address both of those issues with code like this:

function foo(smallarray, largearray)
    map(eachrow(smallarray)) do (x, y)
        any(eachrow(largearray)) do (a, c, b, d)
            a ≤ x ≤ b  &&  c ≤ y ≤ d
        end
    end
end

Notes:

I chose to use map because I find it more convenient than a for loop—you don’t have to preallocate an array. But the performance for map and a for loop should be similar.
Notice the handy pattern matching / destructuring syntax in the do blocks that allows you to unpack an iterable object (in this case a row) into smaller named parts.
eachrow is a non-allocating iterator over the rows of a matrix.

Here’s a benchmark:

julia> using BenchmarkTools

julia> A = rand(7390, 2); B = rand(482244, 4);

julia> @btime foo($A, $B);
  14.208 ms (1 allocation: 7.38 KiB)

(EDIT: This benchmark might not be the most representative of your problem, since for this random data set any will short-circuit quite quickly on each iteration of map.)

I would recommend using plain arrays for this if you can. It’s not as fast with a DataFrame because row iteration of DataFrames is not very efficient:

julia> dfA = DataFrame(A, :auto); dfB = DataFrame(B, :auto);

julia> @btime foo($dfA, $dfB);
  2.013 s (67294279 allocations: 1.19 GiB)

bkamins · March 18, 2022, 6:23am

or use a specialized function for data frames:

julia> function foo2(smalldf, largedf)
           map(Tables.namedtupleiterator(smalldf)) do x
               any(Tables.namedtupleiterator(largedf[!, 1:4])) do y
                   y[1] ≤ x[1] ≤ y[2]  &&  y[3] ≤ x[2] ≤ y[4]
               end
           end
       end
foo2 (generic function with 1 method)

julia> @btime foo2($dfA, $dfB);
  23.745 ms (206935 allocations: 12.75 MiB)

In general, Tables.namedtupleiterator makes a loop type stable. As @CameronBieganek commented, by default DataFrame is optimized for column oriented operations.

r0b0ty · March 18, 2022, 2:53pm

Thank you, gentlemen. I love learning new “tricks” to help improve my skills and can’t wait to try this out. My DataFrames include other data types (Strings and Bools), so I could extract the Floats into arrays to do this comparison for quicker results. Thanks again for your input and have a great weekend.

bkamins · March 18, 2022, 2:57pm

to get column names which are floats automatically write names(df, AbstractFloat), e.g.:

julia> df = DataFrame(x=1:3, y=["a", "b", "c"], z1=1.0:3.0, z2=4.0:6.0)
3×4 DataFrame
 Row │ x      y       z1       z2
     │ Int64  String  Float64  Float64
─────┼─────────────────────────────────
   1 │     1  a           1.0      4.0
   2 │     2  b           2.0      5.0
   3 │     3  c           3.0      6.0

julia> names(df, AbstractFloat)
2-element Vector{String}:
 "z1"
 "z2"

r0b0ty · March 19, 2022, 1:00am

FYI, when using the approach offered by @CameronBieganek, the processing time was incredibly reduced when applying it to actual data. A single run went from 10 minutes to ~12 seconds!

julia> size(z1_0_matrix)
(7390, 2)

julia> size(nok_20m_matrix)
(482244, 4)

julia> @btime result = inbound(z1_0_matrix, nok_20m_matrix)
  12.636 s (1 allocation: 7.38 KiB)
7390-element Vector{Bool}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

There are, in fact, expected true results (not just zeros).

julia> count(>(0), result)
497

For completeness (and curiosity’s sake), I’ll also do it as @bkamins suggested (knowing that we expect it to be somewhat slower).

Thanks, again.

Topic		Replies	Views
Large dataframe. fast row selection Data query , dataframes	5	2411	September 13, 2018
How to speed up this DataFrame operation Data performance	11	731	March 19, 2021
Fast iteration over rows of a DataFrame Performance	14	14173	June 30, 2020
Row wise operations in DataFrames Data dataframes	3	2319	September 30, 2018
Creating sub-arrays by comparing multiple arrays DataFrames General Usage array , dataframes	4	601	April 26, 2019

Can this be made faster?

Related topics