Hello,
I am surprised that when using the below code, I have the following execution time:
using .Threads
n=10000;
s=1
Y=zeros(n,n);
@time begin
for i in 1:n
for m in 1:30
if s == 1
Y[i,i] += m;
end
end
end
end
0.039520 seconds (629.95 k allocations: 9.811 MiB, 21.17% compilation time)
but when I use @threads as below, I have a longer execution time:
using .Threads
n=10000;
s=1
Y=zeros(n,n);
@time begin
@threads for i in 1:n
for m in 1:30
if s == 1
Y[i,i] += m;
end
end
end
end
0.076712 seconds (1.80 M allocations: 30.648 MiB, 92.55% compilation time)
While threads do have overhead, there are some other issues, you are using variables in a global scope, so their type canβt be inferred during compilation, you are also running in global scope, which can have issues, you are also including compilation time with your @time measurements, which shows that the threaded version spends 92% of the time compiling, compared to 21% for the non-threaded. A more fair comparison is this
function foo()
n=10000
s = 1
for i in 1:n
for m in 1:30
if s == 1
Y[i,i] += m;
end
end
end
end
foo (generic function with 1 method)
julia> foo()
julia> @time foo()
0.033625 seconds (1.74 M allocations: 26.530 MiB)
julia> function foo2()
n=10000
s = 1
@threads for i in 1:n
for m in 1:30
if s == 1
Y[i,i] += m;
end
end
end
end
foo2 (generic function with 1 method)
julia> foo2()
julia> @time foo2()
0.011409 seconds (1.74 M allocations: 26.532 MiB)
julia> Threads.nthreads()
4
julia> @btime begin
for i in 1:$n
for m in 1:30
if $s == 1
$Y[i,i] += m
end
end
end
end
316.747 ΞΌs (0 allocations: 0 bytes)
julia> @btime begin
Threads.@threads for i in 1:$n
for m in 1:30
if $s == 1
$Y[i,i] += m
end
end
end
end
182.378 ΞΌs (20 allocations: 1.59 KiB)
where the $ s are for interpolating global variables into the benchmark.
You can obtain even better performance with LoopVectorization though:
julia> @btime begin
@turbo for i in 1:$n
for m in 1:30
Y[i,i] += m * ($s == 1)
end
end
end
163.157 ΞΌs (15 allocations: 448 bytes)
julia> @btime begin
@tturbo for i in 1:$n #Multi-threaded
for m in 1:30
Y[i,i] += m * ($s == 1)
end
end
end
46.587 ΞΌs (15 allocations: 448 bytes)
@turbo is does single-threaded SIMD vectorization, @tturbo the same but with lightweight multithreading from Polyester.jl. Note that you have to eliminate the if statement (and effectively evaluate both sides of the branch) to use this, but it still comes out ahead.
I will always recommend taking a look at Performance Tips Β· The Julia Language there you will find most of the pitfalls I explained and some others, leading to better perfomance and less issues while benchmarking.
using .Threads
function foo()
n=10000;
s=1
Y=zeros(n,n);
for i in 1:n
for m in 1:30
if s == 1
Y[i,i] += m;
end
end
end
end
julia> @time foo()
0.162492 seconds (2 allocations: 762.940 MiB, 4.30% gc time)
julia> nthreads()
12
using .Threads
function foo2()
n=10000;
s=1
Y=zeros(n,n);
@threads for i in 1:n
for m in 1:30
if s == 1
Y[i,i] += m;
end
end
end
end
julia> @time foo2()
0.175901 seconds (63.74 k allocations: 766.888 MiB, 3.07% gc time, 10.05% compilation time)
The top one in your screenshot is counting compilation again. You should probably use @btime or @benchmark to avoid that happening by accident all the time.
That seems to be allocating a massive amount of memory (700 MB!), so I suspect a global variable is sneaking in there somehow (perhaps if the function foo uses any variables that are not either explicitly passed to it or defined within it).
You might try copy-pasting the snippets I wrote above, which use $ for interpolation
Do you think my way in using @benchmark is correct because the results of both foo and foo2 functions are close to each other, however, foo2 should gives better performance?