I have two DataFrames. The size of the first is (7390, 2), represented below by a small (3, 2) array. The size of the second is quite a bit larger, at (482244, 8), represented below by a larger (7, 4) array.

I need to compare the values of EACH row in the small array with the values of ALL rows of the large array.

Obviously, it doesn’t take long in this example, but using my very large-sized DataFrames (and comparing against 8 values in the large one), it takes about 10 minutes. Unfortunately, I have to do this over 24 times. Not knowing any better, I used a `for` loop, but wonder if there’s a better way. Any suggestions?

``````function inbound(x::Float64, y::Float64, bounds::Matrix{Float64})
lastrow, _ = size(bounds)
return any(x .>= bounds[1:lastrow, 1] .&&
x .<= bounds[1:lastrow, 3] .&&
y .>= bounds[1:lastrow, 2] .&&
y .<= bounds[1:lastrow, 4])
end
``````
``````julia> smallarray = [2.2 7.5
3.5 8.1
1.7 3.6];
``````
``````julia> nrow_smallarray, _ = size(smallarray);
``````
``````julia> largearray = [1.6 3.5 1.8 3.7
2.0 8.5 2.2 8.7
3.0 9.5 3.2 9.7
4.0 10.5 4.2 10.7
5.0 11.5 5.2 11.7
6.0 12.5 6.2 12.7
7.0 13.5 7.2 13.7];
``````
``````julia> result = Vector{Bool}(undef, nrow_smallarray);
``````
``````julia> for i in 1:nrow_smallarray
result[i] = inbound(smallarray[i, 1], smallarray[i, 2], largearray)
end
``````
``````julia> result
3-element Vector{Bool}:
0
0
1
``````
1 Like

A couple things to notice:

• When you slice an array like `bounds[1:lastrow, 1]`, it actually makes a copy. So you are making a bunch of copies of large vectors when you run on your full data set.
• An expression like `any(a .≤ x .≤ b)` will broadcast the comparison operators over the full vector `x`, and then apply the `any` function. It’s more efficient to use the `any(f, x)` form, because that form will short-circuit. In other words, it will return early as soon as the predicate `f` finds finds a `true` case.

You can address both of those issues with code like this:

``````function foo(smallarray, largearray)
map(eachrow(smallarray)) do (x, y)
any(eachrow(largearray)) do (a, c, b, d)
a ≤ x ≤ b  &&  c ≤ y ≤ d
end
end
end
``````

Notes:

• I chose to use `map` because I find it more convenient than a `for` loop—you don’t have to preallocate an array. But the performance for `map` and a `for` loop should be similar.
• Notice the handy pattern matching / destructuring syntax in the `do` blocks that allows you to unpack an iterable object (in this case a row) into smaller named parts.
• `eachrow` is a non-allocating iterator over the rows of a matrix.

Here’s a benchmark:

``````julia> using BenchmarkTools

julia> A = rand(7390, 2); B = rand(482244, 4);

julia> @btime foo(\$A, \$B);
14.208 ms (1 allocation: 7.38 KiB)
``````

(EDIT: This benchmark might not be the most representative of your problem, since for this random data set `any` will short-circuit quite quickly on each iteration of `map`.)

I would recommend using plain arrays for this if you can. It’s not as fast with a `DataFrame` because row iteration of `DataFrame`s is not very efficient:

``````julia> dfA = DataFrame(A, :auto); dfB = DataFrame(B, :auto);

julia> @btime foo(\$dfA, \$dfB);
2.013 s (67294279 allocations: 1.19 GiB)
``````
5 Likes

or use a specialized function for data frames:

``````julia> function foo2(smalldf, largedf)
map(Tables.namedtupleiterator(smalldf)) do x
any(Tables.namedtupleiterator(largedf[!, 1:4])) do y
y[1] ≤ x[1] ≤ y[2]  &&  y[3] ≤ x[2] ≤ y[4]
end
end
end
foo2 (generic function with 1 method)

julia> @btime foo2(\$dfA, \$dfB);
23.745 ms (206935 allocations: 12.75 MiB)
``````

In general, `Tables.namedtupleiterator` makes a loop type stable. As @CameronBieganek commented, by default `DataFrame` is optimized for column oriented operations.

4 Likes

Thank you, gentlemen. I love learning new “tricks” to help improve my skills and can’t wait to try this out. My DataFrames include other data types (Strings and Bools), so I could extract the Floats into arrays to do this comparison for quicker results. Thanks again for your input and have a great weekend.

to get column names which are floats automatically write `names(df, AbstractFloat)`, e.g.:

``````julia> df = DataFrame(x=1:3, y=["a", "b", "c"], z1=1.0:3.0, z2=4.0:6.0)
3×4 DataFrame
Row │ x      y       z1       z2
│ Int64  String  Float64  Float64
─────┼─────────────────────────────────
1 │     1  a           1.0      4.0
2 │     2  b           2.0      5.0
3 │     3  c           3.0      6.0

julia> names(df, AbstractFloat)
2-element Vector{String}:
"z1"
"z2"
``````
2 Likes

FYI, when using the approach offered by @CameronBieganek, the processing time was incredibly reduced when applying it to actual data. A single run went from 10 minutes to ~12 seconds!

``````julia> size(z1_0_matrix)
(7390, 2)

julia> size(nok_20m_matrix)
(482244, 4)

julia> @btime result = inbound(z1_0_matrix, nok_20m_matrix)
12.636 s (1 allocation: 7.38 KiB)
7390-element Vector{Bool}:
0
0
0
0
0
0
0
0
0
0
0
⋮
0
0
0
0
0
0
0
0
0
0
0
``````

There are, in fact, expected `true` results (not just zeros).

``````julia> count(>(0), result)
497
``````

For completeness (and curiosity’s sake), I’ll also do it as @bkamins suggested (knowing that we expect it to be somewhat slower).

Thanks, again.

3 Likes