Slowdown when multi-threading is used

giordano · December 25, 2016, 9:19pm

I’m trying to improve performance of LombScargle.jl package. Functions used in two methods involves for-loops that can be computed in parallel, thus I thought it would have been a nice idea to use Threads.@threads, however I experience a large performance degradation when I do so.

For example, without multi-threading (current master):

julia> using LombScargle, BenchmarkTools

julia> t = collect(linspace(0.01, 10, 100)); s = sin.(t);

julia> @benchmark lombscargle(t, s, fast = false, fit_mean = false)
BenchmarkTools.Trial: 
  memory estimate:  13.36 kb
  allocs estimate:  18
  --------------
  minimum time:     4.187 ms (0.00% GC)
  median time:      4.192 ms (0.00% GC)
  mean time:        4.205 ms (0.00% GC)
  maximum time:     5.395 ms (0.00% GC)
  --------------
  samples:          1189
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Add Threads.@threads before for in line

github.com

JuliaAstro/LombScargle.jl/blob/master/src/LombScargle.jl#L61


      
                              normalization::Symbol)
              if normalization == :standard
                  return P
              elseif normalization == :model
                  return P ./= (1 .- P)
              elseif normalization == :log
                  return P .= -log.(1 .- P)
              elseif normalization == :psd
                  return P .*= psdfactor ./ 2
              elseif normalization == :Scargle
                  return P ./= noise_level
              elseif normalization == :HorneBaliunas
                  return P .*= (N .- 1) ./ 2
              elseif normalization == :Cumming
                  M = maximum(P)
                  return P .*= (N .- 3) ./ (1 .- M) ./ 2
              else
                  error("normalization \"", string(normalization), "\" not supported")
              end
          end

and repeat the same benchmark with just one thread:

julia> @benchmark lombscargle(t, s, fast = false, fit_mean = false)
BenchmarkTools.Trial: 
  memory estimate:  15.67 mb
  allocs estimate:  1026273
  --------------
  minimum time:     21.778 ms (0.00% GC)
  median time:      24.138 ms (7.90% GC)
  mean time:        24.079 ms (5.88% GC)
  maximum time:     32.270 ms (7.19% GC)
  --------------
  samples:          208
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This is a pretty large slowdown. What’s most interesting is that if I move the code inside the loop to a function on its own and enable multi-threading, there is improvement when comparing to the latter case (this is done in current multi-threading branch):

julia> @benchmark lombscargle(t, s, fast = false, fit_mean = false)
BenchmarkTools.Trial: 
  memory estimate:  91.66 kb
  allocs estimate:  5023
  --------------
  minimum time:     4.504 ms (0.00% GC)
  median time:      4.555 ms (0.00% GC)
  mean time:        4.583 ms (0.17% GC)
  maximum time:     6.786 ms (31.55% GC)
  --------------
  samples:          1091
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This has the same performance of non-threaded code, using more threads speed-up calculations.

Why moving body of the for-loop to a function changes the result of benchmark?

yuyichao · December 26, 2016, 3:13am

The most likely reason is performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub . See workarounds I’ve mentioned in the comments.

ChrisRackauckas · December 26, 2016, 12:43pm

You might want to see this?

https://github.com/JuliaLang/julia/issues/17395#issuecomment-241829216

It may be related to performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub . Last time I checked, there were still a few inference bugs around multithreading, so test things out and see how function barriers change the results.

giordano · December 26, 2016, 6:07pm

By moving the whole loop to a function on its own I get the same performance as when moving only the body of the loop to a function. Then I assume I’m hitting bug #15276.

I have no rush, I can wait for this bug to be fixed before adding multi-threading to the package (because the bug will be eventually fixed, right? ;-)).

Edit: I stand corrected: the timing of first benchmark in the first post of the thread (which I’m going to update now) was wrong, the time should have been of the order of ~4 ms, that is the same as threaded code with body loop or whole loop moved to a function. Then I think this is exactly bug #15276 and your workaround indeed fixes it.

giordano · December 26, 2016, 6:12pm

Thank you. Not sure I got your suggestion right. I changed

    @inbounds Threads.@threads for n in eachindex(freqs)

to

    local uidx::Base.OneTo{Int64} = eachindex(freqs)
    @inbounds Threads.@threads for n in uidx

but the code is still 24× slower than non-threaded code (i.e., there is no difference with @inbounds Threads.@threads for n in eachindex(freqs).

You said that it may be a problem with eachindex, then I tried to replace eachindex(freqs) with 1:length(freqs), to no avail.

giordano · January 19, 2017, 1:51am

@yuyichao After your post, I implemented the trick (moving the body of the loop to an external function) in lombscargle and I had a nice linear scaling with the number of used threads (tested with up to 4 threads). I didn’t change that part of the code after that.

However, after commit (found with git bisect)
https://github.com/JuliaLang/julia/commit/673e6f0fdb852c462ea67069c560b5c722b4aa69
I’m experiencing a slowdown of computation of lombscargle function when multi-threading is used, even when using your trick At least, after fixing bug #19876 the results are correct (thank you for that!), but the degradation when running 4 threads is of ~20%, compared to single-threaded mode. Any hint?

giordano · January 19, 2017, 12:06pm

Ok, I managed to produce a very minimal working example. The culprit seems to be trigonometric functions. Consider the following function and variable

function loop!(P)
    @inbounds Threads.@threads for n in eachindex(P)
        P[n] = sin(2.0)
    end
end

P = Vector{Float64}(1000000)

In Julia 0.5 with 1 thread:

julia> @benchmark loop!($P)
BenchmarkTools.Trial: 
  memory estimate:  48.00 bytes
  allocs estimate:  2
  --------------
  minimum time:     11.080 ms (0.00% GC)
  median time:      11.477 ms (0.00% GC)
  mean time:        11.549 ms (0.00% GC)
  maximum time:     13.508 ms (0.00% GC)
  --------------
  samples:          433
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

With 4 threads:

julia> @benchmark loop!($P)
BenchmarkTools.Trial: 
  memory estimate:  48.00 bytes
  allocs estimate:  2
  --------------
  minimum time:     2.969 ms (0.00% GC)
  median time:      2.996 ms (0.00% GC)
  mean time:        3.030 ms (0.00% GC)
  maximum time:     5.497 ms (0.00% GC)
  --------------
  samples:          1650
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In Julia master (clean build: git clean -fxd && make) with 1 thread

julia> versioninfo()
Julia Version 0.6.0-dev.2181
Commit 71fa0a60f (2017-01-19 08:01 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> @benchmark loop!($P)
BenchmarkTools.Trial: 
  memory estimate:  48.00 bytes
  allocs estimate:  2
  --------------
  minimum time:     11.093 ms (0.00% GC)
  median time:      11.302 ms (0.00% GC)
  mean time:        11.379 ms (0.00% GC)
  maximum time:     12.902 ms (0.00% GC)
  --------------
  samples:          440
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

with 4 threads

julia> @benchmark loop!($P)
BenchmarkTools.Trial: 
  memory estimate:  48.00 bytes
  allocs estimate:  2
  --------------
  minimum time:     20.141 ms (0.00% GC)
  median time:      20.149 ms (0.00% GC)
  mean time:        20.163 ms (0.00% GC)
  maximum time:     21.095 ms (0.00% GC)
  --------------
  samples:          248
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

If I replace sin inside loop! with other functions (I tried exp, log, and gamma, or also direct assignment to a constant) there is no slowdown, instead there is the expected nice scaling.

Can someone reproduce this issue?

ChrisRackauckas · January 19, 2017, 1:56pm

I think it has to be related to

https://github.com/JuliaLang/julia/issues/17395

then. Or:

giordano · January 19, 2017, 2:36pm

@ChrisRackauckas That actually seems to be the answer. Using system libm (replace sin(2.0) with ccall(:sin, Float64, (Float64,), 2.0)), as suggested by @yuyichao there, does enable scaling.

I wonder why on my system the behavior seems to have changed after https://github.com/JuliaLang/julia/commit/673e6f0fdb852c462ea67069c560b5c722b4aa69.

Topic		Replies	Views
Threads.@threads with ONE thread: how to remove the overhead? Performance threads	10	4028	April 2, 2021
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	3975	February 11, 2019
Default sin / cos functions do not scale on multithreading Performance parallel , multithreading	19	1546	July 13, 2018
Huge performance improvement by separating function? General Usage	15	1155	August 27, 2018
How to achieve perfect scaling with Threads (Julia 1.7.1) Performance multithreading	33	2437	January 13, 2023

Slowdown when multi-threading is used

Related topics