Because matt1
and matt2
are (without purpose) defined before the loop? Otherwise, I don’t see it. (To be safe, declare vec1
, mat
, matt1
and matt2
as locals.)
Working on a blog-post about it, but here’s the unpublished draft: www.julialang.org/blog/2023/06/PSA-dont-use-threadid.md at 65e25e87d01111cc45a477c8072c9f5dc5878a39 · JuliaLang/www.julialang.org · GitHub
Interesting. I always thought that each task in a for-loop was running separately, in the sense of not sharing local variables with the other iterations. Is that the problem in last instance?
Yes, they run separately. The problem is that to do a summation, you need to somehow merge those separate values, and the various ways people often try to merge those values is incorrect.
But is the problem you mention arising here? if each thread is running separately and you store the result in Arg1[i]
, then you can sum Arg1
out of the loop.
Yes, that one is fine because you haven’t used threadid
. But if you want to be efficient, you should avoid allocating an N
element vector Arg1
and can get away with nthreads()
different elements, and that’s where people start getting themselves into trouble.
To get things back on topic, here’s the fastest version I could make (essentially just what @gdalle already wrote):
#+begin_src julia
function vec_prod1(N,n)
Arg = 0.0
mat = Array{Float64}(undef, n, n)
matt1 = Array{Float64}(undef, n, n)
matt2 = Array{Float64}(undef, n, n)
for i ∈ 1:N
vec1 = 1:n;
mat .= vec1 .* vec1'
matt1 .= (mat .* mat) ./ n^2;
matt2 .= mul!(matt2, mat, mat) ./ n^2
Arg += (matt1 ⋅ matt2)/N;
end
Arg
end
@btime vec_prod1(10, 1000)
#+end_src
#+RESULTS:
: 84.133 ms (6 allocations: 22.89 MiB)
: 2.094817739604174e19
I found this was actually faster than using explicit multithreading since the matrix multiplication is already multithreaded for 1000 x 1000
matrices and BLAS multithreading is very efficient.
Here’s the explicitly multithreaded version for comparison
#+begin_src julia
function vec_prod3(N,n)
chunks = Iterators.partition(1:N, max(1, N ÷ Threads.nthreads()))
tasks = map(chunks) do chunk
Arg = 0.0
mat = Array{Float64}(undef, n, n)
matt1 = Array{Float64}(undef, n, n)
matt2 = Array{Float64}(undef, n, n)
for i ∈ chunk
vec1 = 1:n;
mat .= vec1 .* vec1'
matt1 .= (mat .* mat) ./ n^2;
matt2 .= mul!(matt2, mat, mat) ./ n^2
Arg += (matt1 ⋅ matt2)/N;
end
return Arg
end
Arg = sum(fetch, tasks)
end
@btime vec_prod3(10, 1000)
#+end_src
#+RESULTS:
: 97.034 ms (84 allocations: 228.88 MiB)
: 2.094817739604174e19
Sorry, I dont know why I said this was fine. You’ll still get race conditions from different threads overwriting mat
, matt1
and matt2
(assuming you were actually calculating new values each iteration instead of reusing the exact same values like is done here)
Now I see what you mean. The variables inside the loop weren’t local, because they were already defined outside the loop.
Yeah, redefining variables inside a function is prone to err. For example for type stability:
#STABLE
function type_stable()
e = 1
parameter() = e
return e
end
#UNSTABLE
function type_unstable()
e = 1
parameter() = e
e = 1
return e
end
#UNSTABLE
function type_unstable()
e = 1
parameter() = e
e::Int64 = 1
return e
end
yeah, this is my least favorite issue in Julia: https://github.com/JuliaLang/julia/issues/15276. It is one of the most annoying performance issues in Julia.
Type instability isnt the problem here, it’s a race condition Race condition - Wikipedia
I know, I meant that redefining variables can lead to several problems (e.g. race conditions, type instability)
In this case, the variable is never re-assigned, it’s just that the memory of the array it points to is over-written