Can this be made faster?

I have two DataFrames. The size of the first is (7390, 2), represented below by a small (3, 2) array. The size of the second is quite a bit larger, at (482244, 8), represented below by a larger (7, 4) array.

I need to compare the values of EACH row in the small array with the values of ALL rows of the large array.

Obviously, it doesnโ€™t take long in this example, but using my very large-sized DataFrames (and comparing against 8 values in the large one), it takes about 10 minutes. Unfortunately, I have to do this over 24 times. Not knowing any better, I used a for loop, but wonder if thereโ€™s a better way. Any suggestions?

function inbound(x::Float64, y::Float64, bounds::Matrix{Float64})
    lastrow, _ = size(bounds)
    return any(x .>= bounds[1:lastrow, 1] .&&
               x .<= bounds[1:lastrow, 3] .&&
               y .>= bounds[1:lastrow, 2] .&&
               y .<= bounds[1:lastrow, 4])
end
julia> smallarray = [2.2 7.5
                     3.5 8.1
                     1.7 3.6];
julia> nrow_smallarray, _ = size(smallarray);
julia> largearray = [1.6 3.5 1.8 3.7
                     2.0 8.5 2.2 8.7
                     3.0 9.5 3.2 9.7
                     4.0 10.5 4.2 10.7
                     5.0 11.5 5.2 11.7
                     6.0 12.5 6.2 12.7
                     7.0 13.5 7.2 13.7];
julia> result = Vector{Bool}(undef, nrow_smallarray);
julia> for i in 1:nrow_smallarray
           result[i] = inbound(smallarray[i, 1], smallarray[i, 2], largearray)
       end
julia> result
3-element Vector{Bool}:
 0
 0
 1
1 Like

A couple things to notice:

  • When you slice an array like bounds[1:lastrow, 1], it actually makes a copy. So you are making a bunch of copies of large vectors when you run on your full data set.
  • An expression like any(a .โ‰ค x .โ‰ค b) will broadcast the comparison operators over the full vector x, and then apply the any function. Itโ€™s more efficient to use the any(f, x) form, because that form will short-circuit. In other words, it will return early as soon as the predicate f finds finds a true case.

You can address both of those issues with code like this:

function foo(smallarray, largearray)
    map(eachrow(smallarray)) do (x, y)
        any(eachrow(largearray)) do (a, c, b, d)
            a โ‰ค x โ‰ค b  &&  c โ‰ค y โ‰ค d
        end
    end
end

Notes:

  • I chose to use map because I find it more convenient than a for loopโ€”you donโ€™t have to preallocate an array. But the performance for map and a for loop should be similar.
  • Notice the handy pattern matching / destructuring syntax in the do blocks that allows you to unpack an iterable object (in this case a row) into smaller named parts.
  • eachrow is a non-allocating iterator over the rows of a matrix.

Hereโ€™s a benchmark:

julia> using BenchmarkTools

julia> A = rand(7390, 2); B = rand(482244, 4);

julia> @btime foo($A, $B);
  14.208 ms (1 allocation: 7.38 KiB)

(EDIT: This benchmark might not be the most representative of your problem, since for this random data set any will short-circuit quite quickly on each iteration of map.)

I would recommend using plain arrays for this if you can. Itโ€™s not as fast with a DataFrame because row iteration of DataFrames is not very efficient:

julia> dfA = DataFrame(A, :auto); dfB = DataFrame(B, :auto);

julia> @btime foo($dfA, $dfB);
  2.013 s (67294279 allocations: 1.19 GiB)
5 Likes

or use a specialized function for data frames:

julia> function foo2(smalldf, largedf)
           map(Tables.namedtupleiterator(smalldf)) do x
               any(Tables.namedtupleiterator(largedf[!, 1:4])) do y
                   y[1] โ‰ค x[1] โ‰ค y[2]  &&  y[3] โ‰ค x[2] โ‰ค y[4]
               end
           end
       end
foo2 (generic function with 1 method)

julia> @btime foo2($dfA, $dfB);
  23.745 ms (206935 allocations: 12.75 MiB)

In general, Tables.namedtupleiterator makes a loop type stable. As @CameronBieganek commented, by default DataFrame is optimized for column oriented operations.

4 Likes

Thank you, gentlemen. I love learning new โ€œtricksโ€ to help improve my skills and canโ€™t wait to try this out. My DataFrames include other data types (Strings and Bools), so I could extract the Floats into arrays to do this comparison for quicker results. Thanks again for your input and have a great weekend.

to get column names which are floats automatically write names(df, AbstractFloat), e.g.:

julia> df = DataFrame(x=1:3, y=["a", "b", "c"], z1=1.0:3.0, z2=4.0:6.0)
3ร—4 DataFrame
 Row โ”‚ x      y       z1       z2
     โ”‚ Int64  String  Float64  Float64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1  a           1.0      4.0
   2 โ”‚     2  b           2.0      5.0
   3 โ”‚     3  c           3.0      6.0

julia> names(df, AbstractFloat)
2-element Vector{String}:
 "z1"
 "z2"
2 Likes

FYI, when using the approach offered by @CameronBieganek, the processing time was incredibly reduced when applying it to actual data. A single run went from 10 minutes to ~12 seconds!

julia> size(z1_0_matrix)
(7390, 2)

julia> size(nok_20m_matrix)
(482244, 4)

julia> @btime result = inbound(z1_0_matrix, nok_20m_matrix)
  12.636 s (1 allocation: 7.38 KiB)
7390-element Vector{Bool}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 โ‹ฎ
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

There are, in fact, expected true results (not just zeros).

julia> count(>(0), result)
497

For completeness (and curiosityโ€™s sake), Iโ€™ll also do it as @bkamins suggested (knowing that we expect it to be somewhat slower).

Thanks, again.

3 Likes