Question for lower performance by using @threads in for loop

Amro · July 9, 2021, 1:17am

Hello,
I am surprised that when using the below code, I have the following execution time:

using .Threads

n=10000;
s=1
Y=zeros(n,n);

@time begin
    for i in 1:n
        for m in 1:30
            if s == 1
                Y[i,i] += m;
            end
        end
    end
end
0.039520 seconds (629.95 k allocations: 9.811 MiB, 21.17% compilation time)

but when I use @threads as below, I have a longer execution time:

using .Threads

n=10000;
s=1
Y=zeros(n,n);

@time begin
    @threads for i in 1:n
        for m in 1:30
            if s == 1
                Y[i,i] += m;
            end
        end
    end
end
0.076712 seconds (1.80 M allocations: 30.648 MiB, 92.55% compilation time)

Any explanation or suggestion please?
Thank you!

brenhinkeller · July 9, 2021, 1:25am

Threads can have have pretty high overhead – see for example

brenhinkeller · July 9, 2021, 1:30am

Though actually one should switch to timing with @btime before drawing any serious conclusions from the above

gbaraldi · July 9, 2021, 1:31am

While threads do have overhead, there are some other issues, you are using variables in a global scope, so their type can’t be inferred during compilation, you are also running in global scope, which can have issues, you are also including compilation time with your @time measurements, which shows that the threaded version spends 92% of the time compiling, compared to 21% for the non-threaded. A more fair comparison is this

function foo()
       n=10000
       s = 1
        for i in 1:n
                    for m in 1:30
                          if s == 1
                              Y[i,i] += m;
                          end
                      end
                  end
       end
foo (generic function with 1 method)

julia> foo()

julia> @time foo()
  0.033625 seconds (1.74 M allocations: 26.530 MiB)

julia> function foo2()
       n=10000
       s = 1
        @threads for i in 1:n
                      for m in 1:30
                          if s == 1
                              Y[i,i] += m;
                          end
                      end
                  end
                end
foo2 (generic function with 1 method)

julia> foo2()

julia> @time foo2()
  0.011409 seconds (1.74 M allocations: 26.532 MiB)

Sorry fo the crappy indenting

gbaraldi · July 9, 2021, 1:40am

Here you go, should’ve included it on the first one

julia> @benchmark foo()
BechmarkTools.Trial: 282 samples with 1 evaluations.
 Range (min … max):  14.471 ms … 41.847 ms  ┊ GC (min … max): 0.00% … 63.10%
 Time  (median):     16.291 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.727 ms ±  5.913 ms  ┊ GC (mean ± σ):  9.13% ± 15.23%

  ▂  ▇█▁
  █▆████▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▅▁▇▇▆ ▅
  14.5 ms      Histogram: log(frequency) by time      40.4 ms <

 Memory estimate: 26.53 MiB, allocs estimate: 1738680.

julia> @benchmark foo2()
BechmarkTools.Trial: 761 samples with 1 evaluations.
 Range (min … max):  3.973 ms … 117.255 ms  ┊ GC (min … max):  0.00% … 94.69%
 Time  (median):     4.648 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   6.557 ms ±  14.078 ms  ┊ GC (mean ± σ):  28.28% ± 12.44%

  █
  █▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▄▄ ▆
  3.97 ms      Histogram: log(frequency) by time       113 ms <

 Memory estimate: 26.53 MiB, allocs estimate: 1738701.

brenhinkeller · July 9, 2021, 1:45am

Yeah, similarly:

julia> Threads.nthreads()
4

julia> @btime begin
          for i in 1:$n
              for m in 1:30
                  if $s == 1
                      $Y[i,i] += m
                  end
              end
          end
       end

  316.747 μs (0 allocations: 0 bytes)
julia> @btime begin
          Threads.@threads for i in 1:$n
              for m in 1:30
                  if $s == 1
                      $Y[i,i] += m
                  end
              end
          end
       end
  182.378 μs (20 allocations: 1.59 KiB)

where the $ s are for interpolating global variables into the benchmark.
You can obtain even better performance with LoopVectorization though:

julia> @btime begin
          @turbo for i in 1:$n
              for m in 1:30
                  Y[i,i] += m * ($s == 1)
              end
          end
       end
  163.157 μs (15 allocations: 448 bytes)

julia> @btime begin 
          @tturbo for i in 1:$n #Multi-threaded
              for m in 1:30
                  Y[i,i] += m * ($s == 1)
              end
          end
       end
  46.587 μs (15 allocations: 448 bytes)

@turbo is does single-threaded SIMD vectorization, @tturbo the same but with lightweight multithreading from Polyester.jl. Note that you have to eliminate the if statement (and effectively evaluate both sides of the branch) to use this, but it still comes out ahead.

gbaraldi · July 9, 2021, 1:46am

I will always recommend taking a look at Performance Tips · The Julia Language there you will find most of the pitfalls I explained and some others, leading to better perfomance and less issues while benchmarking.

gbaraldi · July 9, 2021, 1:48am

Yeah, https://github.com/JuliaSIMD/LoopVectorization.jl , https://github.com/JuliaSIMD/LoopVectorization.jl and GitHub - JuliaFolds/FLoops.jl: Fast sequential, threaded, and distributed for-loops for Julia—fold for humans™ are good resources for speeding things up.

Amro · July 9, 2021, 2:01am

I am really confused when I am running my code

using .Threads

function foo()

n=10000;
s=1
Y=zeros(n,n);

    for i in 1:n
        for m in 1:30
            if s == 1
                Y[i,i] += m;
            end
        end
    end
end
julia> @time foo()
  0.162492 seconds (2 allocations: 762.940 MiB, 4.30% gc time)

julia> nthreads()
12
using .Threads

function foo2()

n=10000;
s=1
Y=zeros(n,n);

    @threads for i in 1:n
        for m in 1:30
            if s == 1
                Y[i,i] += m;
            end
        end
    end
end
julia> @time foo2()
  0.175901 seconds (63.74 k allocations: 766.888 MiB, 3.07% gc time, 10.05% compilation time)

brenhinkeller · July 9, 2021, 2:23am

The top one in your screenshot is counting compilation again. You should probably use @btime or @benchmark to avoid that happening by accident all the time.

Amro · July 9, 2021, 2:39am

Is this way in using @benchmark is correct?

brenhinkeller · July 9, 2021, 2:48am

That seems to be allocating a massive amount of memory (700 MB!), so I suspect a global variable is sneaking in there somehow (perhaps if the function foo uses any variables that are not either explicitly passed to it or defined within it).

You might try copy-pasting the snippets I wrote above, which use $ for interpolation

gbaraldi · July 9, 2021, 1:53pm

The allocs are the nxn matrix, its quite large

a = zeros(10000,10000)
julia> varinfo()
  name                    size summary
  –––––––––––––––– ––––––––––– –––––––––––––––––––––––––––
  Base                         Module
  Core                         Module
  InteractiveUtils 256.530 KiB Module
  Main                         Module
  a                762.939 MiB 10000×10000 Matrix{Float64}
  ans              762.939 MiB 10000×10000 Matrix{Float64}

You can see that while it uses a lot of memory it uses very few allocations

Amro · July 9, 2021, 2:37pm

Do you think my way in using @benchmark is correct because the results of both foo and foo2 functions are close to each other, however, foo2 should gives better performance?

Topic		Replies	Views
Slower execution with multi-threading using @threads macro Performance question , parallel , multithreading	5	738	August 13, 2020
Julia Threads.@threads slower than single thread performance Performance multithreading , pde	11	2914	April 24, 2023
Why with @threads, the execution time is worse? Performance question , multithreading	19	2797	September 16, 2021
Threads.@threads with ONE thread: how to remove the overhead? Performance threads	10	4028	April 2, 2021
Threaded loop far slower than sequential loop (+ high compilation time) Performance multithreading	3	828	September 17, 2021

Question for lower performance by using @threads in for loop

Related topics