Can this be made faster?

I have two DataFrames. The size of the first is (7390, 2), represented below by a small (3, 2) array. The size of the second is quite a bit larger, at (482244, 8), represented below by a larger (7, 4) array.

I need to compare the values of EACH row in the small array with the values of ALL rows of the large array.

Obviously, it doesn’t take long in this example, but using my very large-sized DataFrames (and comparing against 8 values in the large one), it takes about 10 minutes. Unfortunately, I have to do this over 24 times. Not knowing any better, I used a for loop, but wonder if there’s a better way. Any suggestions?

function inbound(x::Float64, y::Float64, bounds::Matrix{Float64})
    lastrow, _ = size(bounds)
    return any(x .>= bounds[1:lastrow, 1] .&&
               x .<= bounds[1:lastrow, 3] .&&
               y .>= bounds[1:lastrow, 2] .&&
               y .<= bounds[1:lastrow, 4])
end
julia> smallarray = [2.2 7.5
                     3.5 8.1
                     1.7 3.6];
julia> nrow_smallarray, _ = size(smallarray);
julia> largearray = [1.6 3.5 1.8 3.7
                     2.0 8.5 2.2 8.7
                     3.0 9.5 3.2 9.7
                     4.0 10.5 4.2 10.7
                     5.0 11.5 5.2 11.7
                     6.0 12.5 6.2 12.7
                     7.0 13.5 7.2 13.7];
julia> result = Vector{Bool}(undef, nrow_smallarray);
julia> for i in 1:nrow_smallarray
           result[i] = inbound(smallarray[i, 1], smallarray[i, 2], largearray)
       end
julia> result
3-element Vector{Bool}:
 0
 0
 1
1 Like

A couple things to notice:

  • When you slice an array like bounds[1:lastrow, 1], it actually makes a copy. So you are making a bunch of copies of large vectors when you run on your full data set.
  • An expression like any(a .≤ x .≤ b) will broadcast the comparison operators over the full vector x, and then apply the any function. It’s more efficient to use the any(f, x) form, because that form will short-circuit. In other words, it will return early as soon as the predicate f finds finds a true case.

You can address both of those issues with code like this:

function foo(smallarray, largearray)
    map(eachrow(smallarray)) do (x, y)
        any(eachrow(largearray)) do (a, c, b, d)
            a ≤ x ≤ b  &&  c ≤ y ≤ d
        end
    end
end

Notes:

  • I chose to use map because I find it more convenient than a for loop—you don’t have to preallocate an array. But the performance for map and a for loop should be similar.
  • Notice the handy pattern matching / destructuring syntax in the do blocks that allows you to unpack an iterable object (in this case a row) into smaller named parts.
  • eachrow is a non-allocating iterator over the rows of a matrix.

Here’s a benchmark:

julia> using BenchmarkTools

julia> A = rand(7390, 2); B = rand(482244, 4);

julia> @btime foo($A, $B);
  14.208 ms (1 allocation: 7.38 KiB)

(EDIT: This benchmark might not be the most representative of your problem, since for this random data set any will short-circuit quite quickly on each iteration of map.)

I would recommend using plain arrays for this if you can. It’s not as fast with a DataFrame because row iteration of DataFrames is not very efficient:

julia> dfA = DataFrame(A, :auto); dfB = DataFrame(B, :auto);

julia> @btime foo($dfA, $dfB);
  2.013 s (67294279 allocations: 1.19 GiB)
5 Likes

or use a specialized function for data frames:

julia> function foo2(smalldf, largedf)
           map(Tables.namedtupleiterator(smalldf)) do x
               any(Tables.namedtupleiterator(largedf[!, 1:4])) do y
                   y[1] ≤ x[1] ≤ y[2]  &&  y[3] ≤ x[2] ≤ y[4]
               end
           end
       end
foo2 (generic function with 1 method)

julia> @btime foo2($dfA, $dfB);
  23.745 ms (206935 allocations: 12.75 MiB)

In general, Tables.namedtupleiterator makes a loop type stable. As @CameronBieganek commented, by default DataFrame is optimized for column oriented operations.

4 Likes

Thank you, gentlemen. I love learning new “tricks” to help improve my skills and can’t wait to try this out. My DataFrames include other data types (Strings and Bools), so I could extract the Floats into arrays to do this comparison for quicker results. Thanks again for your input and have a great weekend.

to get column names which are floats automatically write names(df, AbstractFloat), e.g.:

julia> df = DataFrame(x=1:3, y=["a", "b", "c"], z1=1.0:3.0, z2=4.0:6.0)
3×4 DataFrame
 Row │ x      y       z1       z2
     │ Int64  String  Float64  Float64
─────┼─────────────────────────────────
   1 │     1  a           1.0      4.0
   2 │     2  b           2.0      5.0
   3 │     3  c           3.0      6.0

julia> names(df, AbstractFloat)
2-element Vector{String}:
 "z1"
 "z2"
2 Likes

FYI, when using the approach offered by @CameronBieganek, the processing time was incredibly reduced when applying it to actual data. A single run went from 10 minutes to ~12 seconds!

julia> size(z1_0_matrix)
(7390, 2)

julia> size(nok_20m_matrix)
(482244, 4)

julia> @btime result = inbound(z1_0_matrix, nok_20m_matrix)
  12.636 s (1 allocation: 7.38 KiB)
7390-element Vector{Bool}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

There are, in fact, expected true results (not just zeros).

julia> count(>(0), result)
497

For completeness (and curiosity’s sake), I’ll also do it as @bkamins suggested (knowing that we expect it to be somewhat slower).

Thanks, again.

3 Likes