Slowdown when multi-threading is used


#1

I’m trying to improve performance of LombScargle.jl package. Functions used in two methods involves for-loops that can be computed in parallel, thus I thought it would have been a nice idea to use Threads.@threads, however I experience a large performance degradation when I do so.

For example, without multi-threading (current master):

julia> using LombScargle, BenchmarkTools

julia> t = collect(linspace(0.01, 10, 100)); s = sin.(t);

julia> @benchmark lombscargle(t, s, fast = false, fit_mean = false)
BenchmarkTools.Trial: 
  memory estimate:  13.36 kb
  allocs estimate:  18
  --------------
  minimum time:     4.187 ms (0.00% GC)
  median time:      4.192 ms (0.00% GC)
  mean time:        4.205 ms (0.00% GC)
  maximum time:     5.395 ms (0.00% GC)
  --------------
  samples:          1189
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Add Threads.@threads before for in line


and repeat the same benchmark with just one thread:

julia> @benchmark lombscargle(t, s, fast = false, fit_mean = false)
BenchmarkTools.Trial: 
  memory estimate:  15.67 mb
  allocs estimate:  1026273
  --------------
  minimum time:     21.778 ms (0.00% GC)
  median time:      24.138 ms (7.90% GC)
  mean time:        24.079 ms (5.88% GC)
  maximum time:     32.270 ms (7.19% GC)
  --------------
  samples:          208
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This is a pretty large slowdown. What’s most interesting is that if I move the code inside the loop to a function on its own and enable multi-threading, there is improvement when comparing to the latter case (this is done in current multi-threading branch):

julia> @benchmark lombscargle(t, s, fast = false, fit_mean = false)
BenchmarkTools.Trial: 
  memory estimate:  91.66 kb
  allocs estimate:  5023
  --------------
  minimum time:     4.504 ms (0.00% GC)
  median time:      4.555 ms (0.00% GC)
  mean time:        4.583 ms (0.17% GC)
  maximum time:     6.786 ms (31.55% GC)
  --------------
  samples:          1091
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

This has the same performance of non-threaded code, using more threads speed-up calculations.

Why moving body of the for-loop to a function changes the result of benchmark?


#2

The most likely reason is https://github.com/JuliaLang/julia/issues/15276 . See workarounds I’ve mentioned in the comments.


#3

You might want to see this?

It may be related to https://github.com/JuliaLang/julia/issues/15276 . Last time I checked, there were still a few inference bugs around multithreading, so test things out and see how function barriers change the results.


#4

By moving the whole loop to a function on its own I get the same performance as when moving only the body of the loop to a function. Then I assume I’m hitting bug #15276.

I have no rush, I can wait for this bug to be fixed before adding multi-threading to the package (because the bug will be eventually fixed, right? ;-)).

Edit: I stand corrected: the timing of first benchmark in the first post of the thread (which I’m going to update now) was wrong, the time should have been of the order of ~4 ms, that is the same as threaded code with body loop or whole loop moved to a function. Then I think this is exactly bug #15276 and your workaround indeed fixes it.


#5

Thank you. Not sure I got your suggestion right. I changed

    @inbounds Threads.@threads for n in eachindex(freqs)

to

    local uidx::Base.OneTo{Int64} = eachindex(freqs)
    @inbounds Threads.@threads for n in uidx

but the code is still 24× slower than non-threaded code (i.e., there is no difference with @inbounds Threads.@threads for n in eachindex(freqs).

You said that it may be a problem with eachindex, then I tried to replace eachindex(freqs) with 1:length(freqs), to no avail.


#6

@yuyichao After your post, I implemented the trick (moving the body of the loop to an external function) in lombscargle and I had a nice linear scaling with the number of used threads (tested with up to 4 threads). I didn’t change that part of the code after that.

However, after commit (found with git bisect)


I’m experiencing a slowdown of computation of lombscargle function when multi-threading is used, even when using your trick At least, after fixing bug #19876 the results are correct (thank you for that!), but the degradation when running 4 threads is of ~20%, compared to single-threaded mode. Any hint?


#7

Ok, I managed to produce a very minimal working example. The culprit seems to be trigonometric functions. Consider the following function and variable

function loop!(P)
    @inbounds Threads.@threads for n in eachindex(P)
        P[n] = sin(2.0)
    end
end

P = Vector{Float64}(1000000)

In Julia 0.5 with 1 thread:

julia> @benchmark loop!($P)
BenchmarkTools.Trial: 
  memory estimate:  48.00 bytes
  allocs estimate:  2
  --------------
  minimum time:     11.080 ms (0.00% GC)
  median time:      11.477 ms (0.00% GC)
  mean time:        11.549 ms (0.00% GC)
  maximum time:     13.508 ms (0.00% GC)
  --------------
  samples:          433
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

With 4 threads:

julia> @benchmark loop!($P)
BenchmarkTools.Trial: 
  memory estimate:  48.00 bytes
  allocs estimate:  2
  --------------
  minimum time:     2.969 ms (0.00% GC)
  median time:      2.996 ms (0.00% GC)
  mean time:        3.030 ms (0.00% GC)
  maximum time:     5.497 ms (0.00% GC)
  --------------
  samples:          1650
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

In Julia master (clean build: git clean -fxd && make) with 1 thread

julia> versioninfo()
Julia Version 0.6.0-dev.2181
Commit 71fa0a60f (2017-01-19 08:01 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

julia> @benchmark loop!($P)
BenchmarkTools.Trial: 
  memory estimate:  48.00 bytes
  allocs estimate:  2
  --------------
  minimum time:     11.093 ms (0.00% GC)
  median time:      11.302 ms (0.00% GC)
  mean time:        11.379 ms (0.00% GC)
  maximum time:     12.902 ms (0.00% GC)
  --------------
  samples:          440
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

with 4 threads

julia> @benchmark loop!($P)
BenchmarkTools.Trial: 
  memory estimate:  48.00 bytes
  allocs estimate:  2
  --------------
  minimum time:     20.141 ms (0.00% GC)
  median time:      20.149 ms (0.00% GC)
  mean time:        20.163 ms (0.00% GC)
  maximum time:     21.095 ms (0.00% GC)
  --------------
  samples:          248
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

If I replace sin inside loop! with other functions (I tried exp, log, and gamma, or also direct assignment to a constant) there is no slowdown, instead there is the expected nice scaling.

Can someone reproduce this issue?


#8

I think it has to be related to

then. Or:

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=int+box+julia


#9

@ChrisRackauckas That actually seems to be the answer. Using system libm (replace sin(2.0) with ccall(:sin, Float64, (Float64,), 2.0)), as suggested by @yuyichao there, does enable scaling.

I wonder why on my system the behavior seems to have changed after https://github.com/JuliaLang/julia/commit/673e6f0fdb852c462ea67069c560b5c722b4aa69.