# Declaring variables inside a for loop using Threads.@threads has a significant impact on performance

Hello. This is my first time asking a question, so I apologize if it’s not in any kind of form. I’m also new to programming and Julia.

As the title says, I often code by indexing array data and declaring it as a new variable. This is normally fine, but when using the Base.Threads library, it causes a serious performance disruption (hundreds of times slower).

This code is a simple test; in practice, it’s much more complex.
Using the “#good” annotated coding style doesn’t degrade performance, but it makes the code very long and less readable. Why is this happening and what is the best way to code so that the code is both readable and performance?

``````struct infomations
nn :: Int64
A1 :: Matrix{Float64}
A2 :: Matrix{Float64}
dx :: Float64
dy :: Float64
function infomations(nn,L)
nn = nn+4
A1 = zeros(nn,nn)
A2 = rand(nn,nn)
dx = L/nn
dy = L/nn
new(nn,A1,A2,dx,dy)
end
end

function get_Div(obj1)
(;nn, A1, A2, dx, dy) = obj1
ny,nx  = size(A1)
for j = 3:ny-2
ee = A2[j,i+2]; e = A2[j,i+1]; w = A2[j,i-1]; ww = A2[j,i-2]
nn = A2[j+2,i]; n = A2[j+1,i]; s = A2[j-1,i]; ss= A2[j-2,i]
∂B1∂x = (-ee + 8*e - 8*w + ww)/dx/12
∂B2∂y = (-nn + 8*n - 8*s + ss)/dy/12
A1[j,i] = ∂B1∂x + ∂B2∂y

# good
∂B1∂x = (-A2[j,i+2] + 8*A2[j,i+1] - 8*A2[j,i-1] + A2[j,i-2])/dx/12
∂B2∂y = (-A2[j+2,i] + 8*A2[j+1,i] - 8*A2[j-1,i] + A2[j-2,i])/dy/12
A1[j,i] = ∂B1∂x + ∂B2∂y
end
end
end

function run!(obj1, iter)
for i = 1:iter
get_Div(obj1)
end
end

obj1 = infomations(2000,3)
@time run!(obj1, 200)
``````

our “bad” and “good” examples should both be equally performant as far as i can tell.

unfortunately I cannot run any code right now but a few ideas:

Use BenchmarkTools.jl with `@btime` (Are you sure you have not measured compiling time?). Your example should run pretty fast, so it’s possible that the overhead of many threads is too large to give any speedup.

Maybe your example is too far removed from your actual use case but it looks like you are doing convolutions with a small kernel. I can highly recommend the package Stencils.jl for that kind of thing. It has multithreading implemented and is quite performant. (There are also FFT based options, see DSP.jl but I think this only become fast once your kernels also get larger)

This is the culprit. you overwrite nn which is an Int and later a float. changing the second `nn` to another name recovers the performance lost

3 Likes

`@code_warntype` reveals that the variable `nn` is boxed, i.e. its type is not inferred correctly. So the type of `∂B2∂y` is not inferred correctly. This leads to allocations, which is devastating for performance in parallel runs. Moreover, the program becomes incorrect because the variable `nn` is shared among all the threads, and you get race conditions and things.

Rename your temporary `nn` in the “bad code” example.

``````julia> @code_warntype get_Div(obj1)
...
nn::Core.Box
...
``````
2 Likes

Oh… I must have missed this, thank you so much.
It’s scary that duplicate names don’t throw an error, only performance.