The following code is inside a multithreaded loop which gets called a lot. It essentially loops over a symmetric matrix M (dimension ~50k at most) and computes the minimum with respect to each row of N in a funny way.

using BenchmarkTools, Random
"""
Find minimum of `M_jk - N_ij - N_ik` for each i. M is upper triangular.
Saves minimum value in `best_err` and `j, k` index in `row` and `col`.
"""
function best_index!(
best_err::Vector{T},
row::Vector{Int},
col::Vector{Int},
M::AbstractMatrix{T},
N::AbstractMatrix{T},
) where T <: Real
n, d = size(N)
@inbounds for k in 1:d, j in 1:k
@simd for i in 1:n
score = M[j, k] - N[i, j] - N[i, k]
# save minimum value and index
if score < best_err[i]
best_err[i], row[i], col[i] = score, j, k
end
end
end
end
Random.seed!(2020)
n = 100
p = 100
d = 5000
row_index, col_index = ones(Int, n), ones(Int, n)
best_err = [typemax(Float64) for _ in 1:n]
M = rand(d, d)
N = rand(n, d)
julia> @btime best_index!($best_err, $row_index, $col_index, $M, $N)
1.162 s (0 allocations: 0 bytes)

For context, this piece of code use to work well for some small datasets I was working with. But on a new, larger data with large d ~ 20k, this function starts consuming > 95% of time. Since itâ€™s so simple, I have no idea how to further optimize it.

This already looks like pretty good code. Maybe the compiler isnâ€™t smart enough to figure out how to SIMD this loop? You can check that with code_native. Although since looks very much bandwidth-limited maybe the SIMD doesnâ€™t actually do anything (Iâ€™m not very knowledgeable about this)

I always store the â€ślargestâ€ť dimension (here d) first in arrays - itâ€™s usually the right thing to do mathematically. Here that would mean N = rand(d, n). It might have a performance impact (although I wouldnâ€™t dare try to guess without benchmarking it). Apart from that, I would play with loop ordering, and add threading.

I managed to get some speedup my manually unrolling and using SIMD.jl

using SIMD
function best_index!(
best_err::Vector{T},
row::Vector{Int},
col::Vector{Int},
M::AbstractMatrix{T},
N::AbstractMatrix{T},
) where T <: Real
n, d = size(N)
@inbounds for k in 1:d, j in 1:k
@fastmath for i in 1:4:n
iv = VecRange{4}(i)
score = M[j, k] - N[iv, j] - N[iv, k]
if score[1] < best_err[i]
best_err[i], row[i], col[i] = score[1], j, k
end
if score[2] < best_err[i+1]
best_err[i+1], row[i+1], col[i+1] = score[2], j, k
end
if score[3] < best_err[i+2]
best_err[i+2], row[i+2], col[i+2] = score[3], j, k
end
if score[4] < best_err[i+3]
best_err[i+3], row[i+3], col[i+3] = score[4], j, k
end
end
end
end

Note that this code only handles cases where the relevant dimension is a multiple of the unrolling factor (4 in the code above).

1.255 s (0 allocations: 0 bytes) # default
1.095 s (0 allocations: 0 bytes) # SIMD.jl

Depending on your CPU instruction set, the unrolling factor may need tweaking. I have an ancient CPU where 2 was the optimal factor. Most modern CPUs will see better performance with 4 for Float64 or 8 for Float32.

You can write a generated function which automatically unrolls the loop at a specified factor, but Iâ€™m not sure if youâ€™ll find the performance gain worth it since you will still be limited by the if statements. Base.Cartesian helps you do stuff like this.

Note also that the benchmarking might be a bit misleading here. After the first run, best_err already contains the result, and the branch prediction then has a trivial job predicting the branches since none of them will be entered. A better benchmark would create new arrays each time so that the branch predictor can not learn the patterns.

Itâ€™s worth nothing that that the @simd annotation on the innermost for loop doesnâ€™t do anything because of the if branch.

Now, I know you said this function is being called in a multi-threaded loop, but have you considered using threads inside this function as well? If the outer multi-threaded loop is using Threads.@spawn (or anything derived from that) instead of Threads.@threads, then you shouldnâ€™t get any destructive interference from the nested multi-threading and can see performance improvements if thereâ€™s any waiting happening in the outer threaded loop.

@inbounds for k in 1:d, j in 1:k
for i in 1:n
score = M[j, k] - N[i, j] - N[i, k]
# save minimum value and index
if score < best_err[i]
best_err[i], row[i], col[i] = score, j, k
end
end
end

the M[j, k] access is a constant in the i the loop and could be hoisted up a level, but Julia is not doing that. Hence, I see that

@inbounds for k in 1:d, j in 1:k
Mjk = M[j, k]
for i in 1:n
score = Mjk - N[i, j] - N[i, k]
# save minimum value and index
if score < best_err[i]
best_err[i], row[i], col[i] = score, j, k
end
end
end