For loop optimization

Hello,

I’m not sure how to optimize the following for loop.

array_1 and array_2 are two Vector{Vector{Int64}}s, and I’m interested in evaluating which combinations (of 2 elements) of their vector components has an element-wise sum equal to a specified sum_target, which of course is a Vector{Int64}.

The code I’m currently using is the following:

using BenchmarkTools

function f()

    n_elements = 50

    array_1 = [Int64.(rand(Int8,500)) for i in 1:n_elements]
    array_2 = [Int64.(rand(Int8,500)) for i in 1:n_elements]

    sum_target = Int64.(rand(Int8,500)) 

    solutions = Vector{Vector{Int64}}[]

    iterator_over_combinations  = Iterators.product(array_1,array_2)

    for combination in iterator_over_combinations
        if all(sum(combination) .== sum_target)
            push!(solutions, combination  )
        end
    end

    return solutions




end

The number of vector elements of array_1 and array_2 is here set to n_elements::Int64 = 50 in order for @btime to work, but in the real application I have like n_elements::Int64 = 10^11 and combine ten arrays instead of just two.

Right now I get:

julia> @btime f() ;
  2.587 ms (10205 allocations: 21.02 MiB)

Thank you very much in advance.

You can (and should) remove all type assertions on the left hand sides. They do not help performance, but make the code harder to read.

all(sum(combination) .== sum_target) This is really slow because you are computing all the equalities and then seeing if they are all true rather than only computing until you find a place where they aren’t equal.

The main issue is that all(sum(combination) .== sum_target) allocates two arrays at each iteration of the for loop, which ends up being very costly. You could instead do the comparisons lazily, in a way that fully avoids allocations. Most importantly, as @Oscar_Smith suggested, this would also allow you to terminate the computations as soon as there is a disagreement, instead of having to compute the whole sum before doing the comparisons.

Below are a few alternatives. Note that I have moved the allocation of the large arrays to outside of the function.

function f2(array_1, array_2, sum_target)
    solutions = Tuple{eltype(array_1), eltype(array_2)}[]

    iterator_over_combinations = Iterators.product(array_1, array_2)

    for combination in iterator_over_combinations
        if all(sum(combination) .== sum_target)
            push!(solutions, combination)
        end
    end

    solutions
end

function f3(array_1, array_2, sum_target)
    solutions = Tuple{eltype(array_1), eltype(array_2)}[]

    iterator_over_combinations = Iterators.product(array_1, array_2)

    for combination in iterator_over_combinations
        success = all(
            vs -> vs[1] + vs[2] == vs[3],
            zip(combination..., sum_target),
        )
        if success
            push!(solutions, combination)
        end
    end

    solutions
end

function f4(array_1, array_2, sum_target)
    solutions = Tuple{eltype(array_1), eltype(array_2)}[]

    iterator_over_combinations = Iterators.product(array_1, array_2)

    for combination in iterator_over_combinations
        success = true
        a, b = combination
        for i ∈ eachindex(sum_target)
            if a[i] + b[i] ≠ sum_target[i]
                success = false
                break
            end
        end
        if success
            push!(solutions, combination)
        end
    end

    solutions
end

n_elements = 50
m = 500
array_1 = [Int64.(rand(Int8, m)) for i in 1:n_elements]
array_2 = [Int64.(rand(Int8, m)) for i in 1:n_elements]
sum_target = Int64.(rand(Int8, m))

@btime f()  # 5.152 ms (10205 allocations: 20.98 MiB)

@btime f2($array_1, $array_2, $sum_target)  # 5.142 ms (10001 allocations: 20.52 MiB)
@btime f3($array_1, $array_2, $sum_target)  # 14.733 μs (1 allocation: 64 bytes)
@btime f4($array_1, $array_2, $sum_target)  # 10.009 μs (1 allocation: 64 bytes)

Note that f2 is nearly the same as f, and the performance is the same, while the f3 and f4 variants are almost 3 orders of magnitude faster. Moreover, f4 is (in my opinion) more readable and ends up being the fastest option.