For loop optimization

Hello,

I’m not sure how to optimize the following for loop.

array_1 and array_2 are two Vector{Vector{Int64}}s, and I’m interested in evaluating which combinations (of 2 elements) of their vector components has an element-wise sum equal to a specified sum_target, which of course is a Vector{Int64}.

The code I’m currently using is the following:

using BenchmarkTools

function f()

    n_elements = 50

    array_1 = [Int64.(rand(Int8,500)) for i in 1:n_elements]
    array_2 = [Int64.(rand(Int8,500)) for i in 1:n_elements]

    sum_target = Int64.(rand(Int8,500)) 

    solutions = Vector{Vector{Int64}}[]

    iterator_over_combinations  = Iterators.product(array_1,array_2)

    for combination in iterator_over_combinations
        if all(sum(combination) .== sum_target)
            push!(solutions, combination  )
        end
    end

    return solutions




end

The number of vector elements of array_1 and array_2 is here set to n_elements::Int64 = 50 in order for @btime to work, but in the real application I have like n_elements::Int64 = 10^11 and combine ten arrays instead of just two.

Right now I get:

julia> @btime f() ;
  2.587 ms (10205 allocations: 21.02 MiB)

Thank you very much in advance.

1 Like

You can (and should) remove all type assertions on the left hand sides. They do not help performance, but make the code harder to read.

1 Like

all(sum(combination) .== sum_target) This is really slow because you are computing all the equalities and then seeing if they are all true rather than only computing until you find a place where they aren’t equal.

4 Likes

The main issue is that all(sum(combination) .== sum_target) allocates two arrays at each iteration of the for loop, which ends up being very costly. You could instead do the comparisons lazily, in a way that fully avoids allocations. Most importantly, as @Oscar_Smith suggested, this would also allow you to terminate the computations as soon as there is a disagreement, instead of having to compute the whole sum before doing the comparisons.

Below are a few alternatives. Note that I have moved the allocation of the large arrays to outside of the function.

function f2(array_1, array_2, sum_target)
    solutions = Tuple{eltype(array_1), eltype(array_2)}[]

    iterator_over_combinations = Iterators.product(array_1, array_2)

    for combination in iterator_over_combinations
        if all(sum(combination) .== sum_target)
            push!(solutions, combination)
        end
    end

    solutions
end

function f3(array_1, array_2, sum_target)
    solutions = Tuple{eltype(array_1), eltype(array_2)}[]

    iterator_over_combinations = Iterators.product(array_1, array_2)

    for combination in iterator_over_combinations
        success = all(
            vs -> vs[1] + vs[2] == vs[3],
            zip(combination..., sum_target),
        )
        if success
            push!(solutions, combination)
        end
    end

    solutions
end

function f4(array_1, array_2, sum_target)
    solutions = Tuple{eltype(array_1), eltype(array_2)}[]

    iterator_over_combinations = Iterators.product(array_1, array_2)

    for combination in iterator_over_combinations
        success = true
        a, b = combination
        for i ∈ eachindex(sum_target)
            if a[i] + b[i] ≠ sum_target[i]
                success = false
                break
            end
        end
        if success
            push!(solutions, combination)
        end
    end

    solutions
end

n_elements = 50
m = 500
array_1 = [Int64.(rand(Int8, m)) for i in 1:n_elements]
array_2 = [Int64.(rand(Int8, m)) for i in 1:n_elements]
sum_target = Int64.(rand(Int8, m))

@btime f()  # 5.152 ms (10205 allocations: 20.98 MiB)

@btime f2($array_1, $array_2, $sum_target)  # 5.142 ms (10001 allocations: 20.52 MiB)
@btime f3($array_1, $array_2, $sum_target)  # 14.733 μs (1 allocation: 64 bytes)
@btime f4($array_1, $array_2, $sum_target)  # 10.009 μs (1 allocation: 64 bytes)

Note that f2 is nearly the same as f, and the performance is the same, while the f3 and f4 variants are almost 3 orders of magnitude faster. Moreover, f4 is (in my opinion) more readable and ends up being the fastest option.

5 Likes