Question for lower performance by using @threads in for loop

Hello,
I am surprised that when using the below code, I have the following execution time:

using .Threads

n=10000;
s=1
Y=zeros(n,n);

@time begin
    for i in 1:n
        for m in 1:30
            if s == 1
                Y[i,i] += m;
            end
        end
    end
end
0.039520 seconds (629.95 k allocations: 9.811 MiB, 21.17% compilation time)

but when I use @threads as below, I have a longer execution time:

using .Threads

n=10000;
s=1
Y=zeros(n,n);

@time begin
    @threads for i in 1:n
        for m in 1:30
            if s == 1
                Y[i,i] += m;
            end
        end
    end
end
0.076712 seconds (1.80 M allocations: 30.648 MiB, 92.55% compilation time)

Any explanation or suggestion please?
Thank you!

Threads can have have pretty high overhead – see for example

1 Like

Though actually one should switch to timing with @btime before drawing any serious conclusions from the above

1 Like

While threads do have overhead, there are some other issues, you are using variables in a global scope, so their type can’t be inferred during compilation, you are also running in global scope, which can have issues, you are also including compilation time with your @time measurements, which shows that the threaded version spends 92% of the time compiling, compared to 21% for the non-threaded. A more fair comparison is this

function foo()
       n=10000
       s = 1
        for i in 1:n
                    for m in 1:30
                          if s == 1
                              Y[i,i] += m;
                          end
                      end
                  end
       end
foo (generic function with 1 method)

julia> foo()

julia> @time foo()
  0.033625 seconds (1.74 M allocations: 26.530 MiB)

julia> function foo2()
       n=10000
       s = 1
        @threads for i in 1:n
                      for m in 1:30
                          if s == 1
                              Y[i,i] += m;
                          end
                      end
                  end
                end
foo2 (generic function with 1 method)

julia> foo2()

julia> @time foo2()
  0.011409 seconds (1.74 M allocations: 26.532 MiB)

Sorry fo the crappy indenting

2 Likes

Here you go, should’ve included it on the first one

julia> @benchmark foo()
BechmarkTools.Trial: 282 samples with 1 evaluations.
 Range (min … max):  14.471 ms … 41.847 ms  β”Š GC (min … max): 0.00% … 63.10%
 Time  (median):     16.291 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   17.727 ms Β±  5.913 ms  β”Š GC (mean Β± Οƒ):  9.13% Β± 15.23%

  β–‚  β–‡β–ˆβ–
  β–ˆβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–…β–β–‡β–‡β–† β–…
  14.5 ms      Histogram: log(frequency) by time      40.4 ms <

 Memory estimate: 26.53 MiB, allocs estimate: 1738680.

julia> @benchmark foo2()
BechmarkTools.Trial: 761 samples with 1 evaluations.
 Range (min … max):  3.973 ms … 117.255 ms  β”Š GC (min … max):  0.00% … 94.69%
 Time  (median):     4.648 ms               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   6.557 ms Β±  14.078 ms  β”Š GC (mean Β± Οƒ):  28.28% Β± 12.44%

  β–ˆ
  β–ˆβ–‡β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–„β–„ β–†
  3.97 ms      Histogram: log(frequency) by time       113 ms <

 Memory estimate: 26.53 MiB, allocs estimate: 1738701.

Yeah, similarly:

julia> Threads.nthreads()
4

julia> @btime begin
          for i in 1:$n
              for m in 1:30
                  if $s == 1
                      $Y[i,i] += m
                  end
              end
          end
       end

  316.747 ΞΌs (0 allocations: 0 bytes)
julia> @btime begin
          Threads.@threads for i in 1:$n
              for m in 1:30
                  if $s == 1
                      $Y[i,i] += m
                  end
              end
          end
       end
  182.378 ΞΌs (20 allocations: 1.59 KiB)

where the $ s are for interpolating global variables into the benchmark.
You can obtain even better performance with LoopVectorization though:

julia> @btime begin
          @turbo for i in 1:$n
              for m in 1:30
                  Y[i,i] += m * ($s == 1)
              end
          end
       end
  163.157 ΞΌs (15 allocations: 448 bytes)

julia> @btime begin 
          @tturbo for i in 1:$n #Multi-threaded
              for m in 1:30
                  Y[i,i] += m * ($s == 1)
              end
          end
       end
  46.587 ΞΌs (15 allocations: 448 bytes)

@turbo is does single-threaded SIMD vectorization, @tturbo the same but with lightweight multithreading from Polyester.jl. Note that you have to eliminate the if statement (and effectively evaluate both sides of the branch) to use this, but it still comes out ahead.

1 Like

I will always recommend taking a look at Performance Tips Β· The Julia Language there you will find most of the pitfalls I explained and some others, leading to better perfomance and less issues while benchmarking.

Yeah, https://github.com/JuliaSIMD/LoopVectorization.jl , https://github.com/JuliaSIMD/LoopVectorization.jl and GitHub - JuliaFolds/FLoops.jl: Fast sequential, threaded, and distributed for-loops for Juliaβ€”fold for humansβ„’ are good resources for speeding things up.

I am really confused when I am running my code

using .Threads

function foo()

n=10000;
s=1
Y=zeros(n,n);

    for i in 1:n
        for m in 1:30
            if s == 1
                Y[i,i] += m;
            end
        end
    end
end
julia> @time foo()
  0.162492 seconds (2 allocations: 762.940 MiB, 4.30% gc time)

julia> nthreads()
12
using .Threads

function foo2()

n=10000;
s=1
Y=zeros(n,n);

    @threads for i in 1:n
        for m in 1:30
            if s == 1
                Y[i,i] += m;
            end
        end
    end
end
julia> @time foo2()
  0.175901 seconds (63.74 k allocations: 766.888 MiB, 3.07% gc time, 10.05% compilation time)

The top one in your screenshot is counting compilation again. You should probably use @btime or @benchmark to avoid that happening by accident all the time.

1 Like


Is this way in using @benchmark is correct?

That seems to be allocating a massive amount of memory (700 MB!), so I suspect a global variable is sneaking in there somehow (perhaps if the function foo uses any variables that are not either explicitly passed to it or defined within it).

You might try copy-pasting the snippets I wrote above, which use $ for interpolation

1 Like

The allocs are the nxn matrix, its quite large

a = zeros(10000,10000)
julia> varinfo()
  name                    size summary
  –––––––––––––––– ––––––––––– –––––––––––––––––––––––––––
  Base                         Module
  Core                         Module
  InteractiveUtils 256.530 KiB Module
  Main                         Module
  a                762.939 MiB 10000Γ—10000 Matrix{Float64}
  ans              762.939 MiB 10000Γ—10000 Matrix{Float64}

You can see that while it uses a lot of memory it uses very few allocations

1 Like

Do you think my way in using @benchmark is correct because the results of both foo and foo2 functions are close to each other, however, foo2 should gives better performance?