Help speedup a tight triple loop

Hello Julia,

The following code is inside a multithreaded loop which gets called a lot. It essentially loops over a symmetric matrix M (dimension ~50k at most) and computes the minimum with respect to each row of N in a funny way.

using BenchmarkTools, Random

"""
Find minimum of `M_jk - N_ij - N_ik` for each i. M is upper triangular. 

Saves minimum value in `best_err` and `j, k` index in `row` and `col`. 
"""
function best_index!(
    best_err::Vector{T},
    row::Vector{Int},
    col::Vector{Int},
    M::AbstractMatrix{T},
    N::AbstractMatrix{T},
    ) where T <: Real

    n, d = size(N)

    @inbounds for k in 1:d, j in 1:k 
        @simd for i in 1:n
            score = M[j, k] - N[i, j] - N[i, k]

            # save minimum value and index
            if score < best_err[i]
                best_err[i], row[i], col[i] = score, j, k
            end
        end
    end
end

Random.seed!(2020)
n = 100
p = 100
d = 5000

row_index, col_index = ones(Int, n), ones(Int, n)
best_err = [typemax(Float64) for _ in 1:n]
M = rand(d, d)
N = rand(n, d)


julia> @btime best_index!($best_err, $row_index, $col_index, $M, $N)

  1.162 s (0 allocations: 0 bytes)

For context, this piece of code use to work well for some small datasets I was working with. But on a new, larger data with large d ~ 20k, this function starts consuming > 95% of time. Since it’s so simple, I have no idea how to further optimize it.

Please give me any suggestion that comes to mind.

This already looks like pretty good code. Maybe the compiler isn’t smart enough to figure out how to SIMD this loop? You can check that with code_native. Although since looks very much bandwidth-limited maybe the SIMD doesn’t actually do anything (I’m not very knowledgeable about this)

I always store the “largest” dimension (here d) first in arrays - it’s usually the right thing to do mathematically. Here that would mean N = rand(d, n). It might have a performance impact (although I wouldn’t dare try to guess without benchmarking it). Apart from that, I would play with loop ordering, and add threading.

1 Like

I managed to get some speedup my manually unrolling and using SIMD.jl

using SIMD

function best_index!(
    best_err::Vector{T},
    row::Vector{Int},
    col::Vector{Int},
    M::AbstractMatrix{T},
    N::AbstractMatrix{T},
    ) where T <: Real

    n, d = size(N)

    @inbounds for k in 1:d, j in 1:k
        @fastmath for i in 1:4:n
            iv = VecRange{4}(i)
            score = M[j, k] - N[iv, j] - N[iv, k]

            if score[1] < best_err[i]
                best_err[i], row[i], col[i] = score[1], j, k
            end

            if score[2] < best_err[i+1]
                best_err[i+1], row[i+1], col[i+1] = score[2], j, k
            end

            if score[3] < best_err[i+2]
                best_err[i+2], row[i+2], col[i+2] = score[3], j, k
            end
            
            if score[4] < best_err[i+3]
                best_err[i+3], row[i+3], col[i+3] = score[4], j, k
            end
        end
    end
end

Note that this code only handles cases where the relevant dimension is a multiple of the unrolling factor (4 in the code above).

1.255 s (0 allocations: 0 bytes) # default
1.095 s (0 allocations: 0 bytes) # SIMD.jl
1 Like

Wow thank you!! Although I’ve heard of unrolling loops I’ve never actually done it before.

Depending on your CPU instruction set, the unrolling factor may need tweaking. I have an ancient CPU where 2 was the optimal factor. Most modern CPUs will see better performance with 4 for Float64 or 8 for Float32.

You can write a generated function which automatically unrolls the loop at a specified factor, but I’m not sure if you’ll find the performance gain worth it since you will still be limited by the if statements. Base.Cartesian helps you do stuff like this.

Note also that the benchmarking might be a bit misleading here. After the first run, best_err already contains the result, and the branch prediction then has a trivial job predicting the branches since none of them will be entered. A better benchmark would create new arrays each time so that the branch predictor can not learn the patterns.

1 Like

It’s worth nothing that that the @simd annotation on the innermost for loop doesn’t do anything because of the if branch.

Now, I know you said this function is being called in a multi-threaded loop, but have you considered using threads inside this function as well? If the outer multi-threaded loop is using Threads.@spawn (or anything derived from that) instead of Threads.@threads, then you shouldn’t get any destructive interference from the nested multi-threading and can see performance improvements if there’s any waiting happening in the outer threaded loop.

One other thing I just noticed, when you write

    @inbounds for k in 1:d, j in 1:k 
        for i in 1:n
            score = M[j, k] - N[i, j] - N[i, k]

            # save minimum value and index
            if score < best_err[i]
                best_err[i], row[i], col[i] = score, j, k
            end
        end
    end

the M[j, k] access is a constant in the i the loop and could be hoisted up a level, but Julia is not doing that. Hence, I see that

    @inbounds for k in 1:d, j in 1:k 
        Mjk = M[j, k]
        for i in 1:n
            score = Mjk - N[i, j] - N[i, k]
            # save minimum value and index
            if score < best_err[i]
                best_err[i], row[i], col[i] = score, j, k
            end
        end
    end

is ~10% faster than your version on my machine.

1 Like

I see. The outer loop is indeed a Threads.@threads and I’ve also never tried using Threads.@spawn. Thank you! Now I have another things to try.