I’ve noticed two different performance regressions using the @threads macro when running my code on 0.6.0-rc1 vs 0.5.2. The first is a more significant regression (~400x slower) with a simple solution. The second is less significant (~2x slower) but I do not currently have a solution. I know that multithreading support is still experimental, but any insights into these issues are appreciated.
The first regression occurs when there are multiple threaded loops within a single function. A MWE is given below (I realize they could be combined into a single loop but in my actual code there are arrays of different size).
function test_threads!{T}(f2::Array{T}, f3::Array{T}, f::Array{T})
Threads.@threads for i = 1:length(f)
@inbounds f2[i] = f[i]*f[i]
end
Threads.@threads for i = 1:length(f)
@inbounds f3[i] = f[i]*f[i]*f[i]
end
end
For testing I used
f = Array{Float64}(2^20)
f2 = similar(f)
f3 = similar(f)
with JULIA_NUM_THREADS=4 on a MacBook Pro with an i7 that has 4 cores/8 threads. After an initial run to compile the function BenchmarksTools on 0.5.2 gives
julia> @benchmark test_threads!(f2, f3, f)
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 4
--------------
minimum time: 670.940 μs (0.00% GC)
median time: 683.380 μs (0.00% GC)
mean time: 714.575 μs (0.00% GC)
maximum time: 2.368 ms (0.00% GC)
--------------
samples: 6812
evals/sample: 1
while on 0.6.0-rc1 I get
julia> @benchmark test_threads!(f2, f3, f)
BenchmarkTools.Trial:
memory estimate: 62.98 MiB
allocs estimate: 3207424
--------------
minimum time: 270.782 ms (0.00% GC)
median time: 296.131 ms (0.00% GC)
mean time: 293.501 ms (0.00% GC)
maximum time: 303.643 ms (0.00% GC)
--------------
samples: 18
evals/sample: 1
The median time is about 400x slower, and there is significant memory allocation when there should almost none since the arrays are preallocated!
Fortunately, this can be fixed by splitting the two loops into separate functions as shown below.
function square_threads!{T}(fout::Array{T}, f::Array{T})
Threads.@threads for i = 1:length(f)
@inbounds fout[i] = f[i]*f[i]
end
end
function cube_threads!{T}(fout::Array{T}, f::Array{T})
Threads.@threads for i = 1:length(f)
@inbounds fout[i] = f[i]*f[i]*f[i]
end
end
function test_threads_fun!{T}(f2::Array{T}, f3::Array{T}, f::Array{T})
square_threads!(f2, f)
cube_threads!(f3, f)
end
Repeating the test with these new functions on 0.5.2 gives
julia> @benchmark test_threads_fun!(f2, f3, f)
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 4
--------------
minimum time: 671.783 μs (0.00% GC)
median time: 759.912 μs (0.00% GC)
mean time: 770.134 μs (0.00% GC)
maximum time: 3.385 ms (0.00% GC)
--------------
samples: 6317
evals/sample: 1
and 0.6.0-rc1 gives
@benchmark test_threads_fun!(f2, f3, f)
BenchmarkTools.Trial:
memory estimate: 96 bytes
allocs estimate: 2
--------------
minimum time: 741.631 μs (0.00% GC)
median time: 797.224 μs (0.00% GC)
mean time: 807.132 μs (0.00% GC)
maximum time: 2.577 ms (0.00% GC)
--------------
samples: 6147
evals/sample: 1
The median times and memory allocations are now roughly the same between the two versions, but slightly slower than test_threads!
on 0.5.2. While this works fine does anyone know the cause of the excess memory allocation (and corresponding slowdown)?
The second issue involves complex numbers in multithreaded loops. Consider the following MWE.
function test!{T}(fout::Array{T, 3}, f::Array{T, 3}, g::Array{Float64, 2})
Threads.@threads for i = 1:size(f, 2)
@inbounds for k = 1:size(f, 3)
for j = 1:size(f, 1)
fout[j,i,k] = (g[j,i] + f[j,i,1] + f[j,i,2] + f[j,i,3])*f[j,i,k]
end
end
end
end
If the arrays are of type Float64 the results between 0.5.2 and 0.6.0-rc1 are similar as shown below.
g = Array{Float64}(1024, 1024)
f = Array{Float64}(1024, 1024, 3)
fout = similar(f)
0.5.2
julia> @benchmark test!(fout, f, g)
BenchmarkTools.Trial:
memory estimate: 64 bytes
allocs estimate: 2
--------------
minimum time: 1.317 ms (0.00% GC)
median time: 1.341 ms (0.00% GC)
mean time: 1.408 ms (0.00% GC)
maximum time: 4.105 ms (0.00% GC)
--------------
samples: 3503
evals/sample: 1
0.6.0-rc1
julia> @benchmark test!(fout, f, g)
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 1
--------------
minimum time: 1.299 ms (0.00% GC)
median time: 1.320 ms (0.00% GC)
mean time: 1.386 ms (0.00% GC)
maximum time: 5.470 ms (0.00% GC)
--------------
samples: 3591
evals/sample: 1
However if the array f is complex then 0.6.0-rc1 is ~2x slower.
g = Array{Float64}(1024, 1024)
f = Array{Complex128}(1024, 1024, 3)
fout = similar(f)
0.5.2
julia> @benchmark test!(fout, f, g)
BenchmarkTools.Trial:
memory estimate: 64 bytes
allocs estimate: 2
--------------
minimum time: 2.732 ms (0.00% GC)
median time: 2.795 ms (0.00% GC)
mean time: 2.896 ms (0.00% GC)
maximum time: 7.574 ms (0.00% GC)
--------------
samples: 1713
evals/sample: 1
0.6.0-rc1
julia> @benchmark test!(fcout, fc, g)
BenchmarkTools.Trial:
memory estimate: 48 bytes
allocs estimate: 1
--------------
minimum time: 4.415 ms (0.00% GC)
median time: 4.654 ms (0.00% GC)
mean time: 4.742 ms (0.00% GC)
maximum time: 8.219 ms (0.00% GC)
--------------
samples: 1054
evals/sample: 1
In this case there is not any excessive memory allocation but still a slowdown. I would appreciate any insights into the cause and/or a solution.