Multithreading performance regressions in 0.6?

I’ve noticed two different performance regressions using the @threads macro when running my code on 0.6.0-rc1 vs 0.5.2. The first is a more significant regression (~400x slower) with a simple solution. The second is less significant (~2x slower) but I do not currently have a solution. I know that multithreading support is still experimental, but any insights into these issues are appreciated.

The first regression occurs when there are multiple threaded loops within a single function. A MWE is given below (I realize they could be combined into a single loop but in my actual code there are arrays of different size).

function test_threads!{T}(f2::Array{T}, f3::Array{T}, f::Array{T})
    Threads.@threads for i = 1:length(f)
        @inbounds f2[i] = f[i]*f[i]
    end
    Threads.@threads for i = 1:length(f)
        @inbounds f3[i] = f[i]*f[i]*f[i]
    end
end

For testing I used

f = Array{Float64}(2^20)
f2 = similar(f)
f3 = similar(f)

with JULIA_NUM_THREADS=4 on a MacBook Pro with an i7 that has 4 cores/8 threads. After an initial run to compile the function BenchmarksTools on 0.5.2 gives

julia> @benchmark test_threads!(f2, f3, f)
BenchmarkTools.Trial:
  memory estimate:  128 bytes
  allocs estimate:  4
  --------------
  minimum time:     670.940 μs (0.00% GC)
  median time:      683.380 μs (0.00% GC)
  mean time:        714.575 μs (0.00% GC)
  maximum time:     2.368 ms (0.00% GC)
  --------------
  samples:          6812
  evals/sample:     1

while on 0.6.0-rc1 I get

julia> @benchmark test_threads!(f2, f3, f)
BenchmarkTools.Trial:
  memory estimate:  62.98 MiB
  allocs estimate:  3207424
  --------------
  minimum time:     270.782 ms (0.00% GC)
  median time:      296.131 ms (0.00% GC)
  mean time:        293.501 ms (0.00% GC)
  maximum time:     303.643 ms (0.00% GC)
  --------------
  samples:          18
  evals/sample:     1

The median time is about 400x slower, and there is significant memory allocation when there should almost none since the arrays are preallocated!

Fortunately, this can be fixed by splitting the two loops into separate functions as shown below.

function square_threads!{T}(fout::Array{T}, f::Array{T})
    Threads.@threads for i = 1:length(f)
        @inbounds fout[i] = f[i]*f[i]
    end
end

function cube_threads!{T}(fout::Array{T}, f::Array{T})
    Threads.@threads for i = 1:length(f)
        @inbounds fout[i] = f[i]*f[i]*f[i]
    end
end

function test_threads_fun!{T}(f2::Array{T}, f3::Array{T}, f::Array{T})
    square_threads!(f2, f)
    cube_threads!(f3, f)
end

Repeating the test with these new functions on 0.5.2 gives

julia> @benchmark test_threads_fun!(f2, f3, f)
BenchmarkTools.Trial:
  memory estimate:  128 bytes
  allocs estimate:  4
  --------------
  minimum time:     671.783 μs (0.00% GC)
  median time:      759.912 μs (0.00% GC)
  mean time:        770.134 μs (0.00% GC)
  maximum time:     3.385 ms (0.00% GC)
  --------------
  samples:          6317
  evals/sample:     1

and 0.6.0-rc1 gives

@benchmark test_threads_fun!(f2, f3, f)
BenchmarkTools.Trial:
  memory estimate:  96 bytes
  allocs estimate:  2
  --------------
  minimum time:     741.631 μs (0.00% GC)
  median time:      797.224 μs (0.00% GC)
  mean time:        807.132 μs (0.00% GC)
  maximum time:     2.577 ms (0.00% GC)
  --------------
  samples:          6147
  evals/sample:     1

The median times and memory allocations are now roughly the same between the two versions, but slightly slower than test_threads! on 0.5.2. While this works fine does anyone know the cause of the excess memory allocation (and corresponding slowdown)?

The second issue involves complex numbers in multithreaded loops. Consider the following MWE.

function test!{T}(fout::Array{T, 3}, f::Array{T, 3}, g::Array{Float64, 2})
    Threads.@threads for i = 1:size(f, 2)
        @inbounds for k = 1:size(f, 3)
            for j = 1:size(f, 1)
                fout[j,i,k] = (g[j,i] + f[j,i,1] + f[j,i,2] + f[j,i,3])*f[j,i,k]
            end
        end
    end
end

If the arrays are of type Float64 the results between 0.5.2 and 0.6.0-rc1 are similar as shown below.

g = Array{Float64}(1024, 1024)
f = Array{Float64}(1024, 1024, 3)
fout = similar(f)

0.5.2

julia> @benchmark test!(fout, f, g)
BenchmarkTools.Trial:
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     1.317 ms (0.00% GC)
  median time:      1.341 ms (0.00% GC)
  mean time:        1.408 ms (0.00% GC)
  maximum time:     4.105 ms (0.00% GC)
  --------------
  samples:          3503
  evals/sample:     1

0.6.0-rc1

julia> @benchmark test!(fout, f, g)
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     1.299 ms (0.00% GC)
  median time:      1.320 ms (0.00% GC)
  mean time:        1.386 ms (0.00% GC)
  maximum time:     5.470 ms (0.00% GC)
  --------------
  samples:          3591
  evals/sample:     1

However if the array f is complex then 0.6.0-rc1 is ~2x slower.

g = Array{Float64}(1024, 1024)
f = Array{Complex128}(1024, 1024, 3)
fout = similar(f)

0.5.2

julia> @benchmark test!(fout, f, g)
BenchmarkTools.Trial:
  memory estimate:  64 bytes
  allocs estimate:  2
  --------------
  minimum time:     2.732 ms (0.00% GC)
  median time:      2.795 ms (0.00% GC)
  mean time:        2.896 ms (0.00% GC)
  maximum time:     7.574 ms (0.00% GC)
  --------------
  samples:          1713
  evals/sample:     1

0.6.0-rc1

julia> @benchmark test!(fcout, fc, g)
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     4.415 ms (0.00% GC)
  median time:      4.654 ms (0.00% GC)
  mean time:        4.742 ms (0.00% GC)
  maximum time:     8.219 ms (0.00% GC)
  --------------
  samples:          1054
  evals/sample:     1

In this case there is not any excessive memory allocation but still a slowdown. I would appreciate any insights into the cause and/or a solution.

2 Likes

Somehow caused by the change in ccall lowering https://github.com/JuliaLang/julia/issues/21906

1 Like

Also, you can see why it is slowing down by looking at @code_lowered Look for Box, :content etc. If you want to know why see the issue I linked and performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub

1 Like