# Help speedup a tight triple loop

Hello Julia,

The following code is inside a multithreaded loop which gets called a lot. It essentially loops over a symmetric matrix M (dimension ~50k at most) and computes the minimum with respect to each row of N in a funny way.

``````using BenchmarkTools, Random

"""
Find minimum of `M_jk - N_ij - N_ik` for each i. M is upper triangular.

Saves minimum value in `best_err` and `j, k` index in `row` and `col`.
"""
function best_index!(
best_err::Vector{T},
row::Vector{Int},
col::Vector{Int},
M::AbstractMatrix{T},
N::AbstractMatrix{T},
) where T <: Real

n, d = size(N)

@inbounds for k in 1:d, j in 1:k
@simd for i in 1:n
score = M[j, k] - N[i, j] - N[i, k]

# save minimum value and index
if score < best_err[i]
best_err[i], row[i], col[i] = score, j, k
end
end
end
end

Random.seed!(2020)
n = 100
p = 100
d = 5000

row_index, col_index = ones(Int, n), ones(Int, n)
best_err = [typemax(Float64) for _ in 1:n]
M = rand(d, d)
N = rand(n, d)

julia> @btime best_index!(\$best_err, \$row_index, \$col_index, \$M, \$N)

1.162 s (0 allocations: 0 bytes)
``````

For context, this piece of code use to work well for some small datasets I was working with. But on a new, larger data with large `d` ~ 20k, this function starts consuming > 95% of time. Since itâ€™s so simple, I have no idea how to further optimize it.

Please give me any suggestion that comes to mind.

This already looks like pretty good code. Maybe the compiler isnâ€™t smart enough to figure out how to SIMD this loop? You can check that with `code_native`. Although since looks very much bandwidth-limited maybe the SIMD doesnâ€™t actually do anything (Iâ€™m not very knowledgeable about this)

I always store the â€ślargestâ€ť dimension (here d) first in arrays - itâ€™s usually the right thing to do mathematically. Here that would mean `N = rand(d, n)`. It might have a performance impact (although I wouldnâ€™t dare try to guess without benchmarking it). Apart from that, I would play with loop ordering, and add threading.

1 Like

I managed to get some speedup my manually unrolling and using SIMD.jl

``````using SIMD

function best_index!(
best_err::Vector{T},
row::Vector{Int},
col::Vector{Int},
M::AbstractMatrix{T},
N::AbstractMatrix{T},
) where T <: Real

n, d = size(N)

@inbounds for k in 1:d, j in 1:k
@fastmath for i in 1:4:n
iv = VecRange{4}(i)
score = M[j, k] - N[iv, j] - N[iv, k]

if score[1] < best_err[i]
best_err[i], row[i], col[i] = score[1], j, k
end

if score[2] < best_err[i+1]
best_err[i+1], row[i+1], col[i+1] = score[2], j, k
end

if score[3] < best_err[i+2]
best_err[i+2], row[i+2], col[i+2] = score[3], j, k
end

if score[4] < best_err[i+3]
best_err[i+3], row[i+3], col[i+3] = score[4], j, k
end
end
end
end
``````

Note that this code only handles cases where the relevant dimension is a multiple of the unrolling factor (4 in the code above).

``````1.255 s (0 allocations: 0 bytes) # default
1.095 s (0 allocations: 0 bytes) # SIMD.jl
``````
1 Like

Wow thank you!! Although Iâ€™ve heard of unrolling loops Iâ€™ve never actually done it before.

Depending on your CPU instruction set, the unrolling factor may need tweaking. I have an ancient CPU where `2` was the optimal factor. Most modern CPUs will see better performance with 4 for Float64 or 8 for Float32.

You can write a generated function which automatically unrolls the loop at a specified factor, but Iâ€™m not sure if youâ€™ll find the performance gain worth it since you will still be limited by the if statements. `Base.Cartesian` helps you do stuff like this.

Note also that the benchmarking might be a bit misleading here. After the first run, `best_err` already contains the result, and the branch prediction then has a trivial job predicting the branches since none of them will be entered. A better benchmark would create new arrays each time so that the branch predictor can not learn the patterns.

1 Like

Itâ€™s worth nothing that that the `@simd` annotation on the innermost for loop doesnâ€™t do anything because of the `if` branch.

Now, I know you said this function is being called in a multi-threaded loop, but have you considered using threads inside this function as well? If the outer multi-threaded loop is using `Threads.@spawn` (or anything derived from that) instead of `Threads.@threads`, then you shouldnâ€™t get any destructive interference from the nested multi-threading and can see performance improvements if thereâ€™s any waiting happening in the outer threaded loop.

One other thing I just noticed, when you write

``````    @inbounds for k in 1:d, j in 1:k
for i in 1:n
score = M[j, k] - N[i, j] - N[i, k]

# save minimum value and index
if score < best_err[i]
best_err[i], row[i], col[i] = score, j, k
end
end
end
``````

the `M[j, k]` access is a constant in the `i` the loop and could be hoisted up a level, but Julia is not doing that. Hence, I see that

``````    @inbounds for k in 1:d, j in 1:k
Mjk = M[j, k]
for i in 1:n
score = Mjk - N[i, j] - N[i, k]
# save minimum value and index
if score < best_err[i]
best_err[i], row[i], col[i] = score, j, k
end
end
end
``````

is ~10% faster than your version on my machine.

1 Like

I see. The outer loop is indeed a `Threads.@threads` and Iâ€™ve also never tried using `Threads.@spawn`. Thank you! Now I have another things to try.