Declaring variables inside a for loop using Threads.@threads has a significant impact on performance

Hello. This is my first time asking a question, so I apologize if it’s not in any kind of form. I’m also new to programming and Julia.

As the title says, I often code by indexing array data and declaring it as a new variable. This is normally fine, but when using the Base.Threads library, it causes a serious performance disruption (hundreds of times slower).

This code is a simple test; in practice, it’s much more complex.
Using the “#good” annotated coding style doesn’t degrade performance, but it makes the code very long and less readable. Why is this happening and what is the best way to code so that the code is both readable and performance?

struct infomations
    nn :: Int64
    A1 :: Matrix{Float64}
    A2 :: Matrix{Float64}
    dx :: Float64
    dy :: Float64
    function infomations(nn,L)
        nn = nn+4
        A1 = zeros(nn,nn)
        A2 = rand(nn,nn)
        dx = L/nn
        dy = L/nn
        new(nn,A1,A2,dx,dy)
    end
end

function get_Div(obj1)
    (;nn, A1, A2, dx, dy) = obj1
    ny,nx  = size(A1)
    Threads.@threads for i = 3:nx-2
        for j = 3:ny-2
            # bad
            ee = A2[j,i+2]; e = A2[j,i+1]; w = A2[j,i-1]; ww = A2[j,i-2]
            nn = A2[j+2,i]; n = A2[j+1,i]; s = A2[j-1,i]; ss= A2[j-2,i]
            ∂B1∂x = (-ee + 8*e - 8*w + ww)/dx/12
            ∂B2∂y = (-nn + 8*n - 8*s + ss)/dy/12
            A1[j,i] = ∂B1∂x + ∂B2∂y

            # good
            ∂B1∂x = (-A2[j,i+2] + 8*A2[j,i+1] - 8*A2[j,i-1] + A2[j,i-2])/dx/12
            ∂B2∂y = (-A2[j+2,i] + 8*A2[j+1,i] - 8*A2[j-1,i] + A2[j-2,i])/dy/12
            A1[j,i] = ∂B1∂x + ∂B2∂y
        end
    end
end

function run!(obj1, iter)
    for i = 1:iter
        get_Div(obj1)
    end
end

obj1 = infomations(2000,3)
@time run!(obj1, 200)

our “bad” and “good” examples should both be equally performant as far as i can tell.

unfortunately I cannot run any code right now but a few ideas:

Use BenchmarkTools.jl with @btime (Are you sure you have not measured compiling time?). Your example should run pretty fast, so it’s possible that the overhead of many threads is too large to give any speedup.

Maybe your example is too far removed from your actual use case but it looks like you are doing convolutions with a small kernel. I can highly recommend the package Stencils.jl for that kind of thing. It has multithreading implemented and is quite performant. (There are also FFT based options, see DSP.jl but I think this only become fast once your kernels also get larger)

This is the culprit. you overwrite nn which is an Int and later a float. changing the second nn to another name recovers the performance lost

3 Likes

@code_warntype reveals that the variable nn is boxed, i.e. its type is not inferred correctly. So the type of ∂B2∂y is not inferred correctly. This leads to allocations, which is devastating for performance in parallel runs. Moreover, the program becomes incorrect because the variable nn is shared among all the threads, and you get race conditions and things.

Rename your temporary nn in the “bad code” example.

julia> @code_warntype get_Div(obj1)
...
  nn::Core.Box
...
2 Likes

Oh… I must have missed this, thank you so much.
It’s scary that duplicate names don’t throw an error, only performance.
I’ll check out your other advice as well, thanks again.

Thanks for introducing me to this useful macro, I know a bit more about Julia thanks to you. Have a nice day.