Hello,
I know that multi-threading is still an experimental feature, but I want to ask about the following observation. It appears that the Threads.@threads
macro works quite well when there is one threaded loop in a function, but when there are two threaded loops in a function, the performance of the second threaded loop is poor. This appears to be because the contents of the second loop is allocating memory. I try to illustrate this with the following simple example run on a machine with 20 cpu cores.
Julia code:
function foo(x,y,z)
Threads.@threads for i = 1:length(z)
@inbounds z[i] = sin(x[i]) + y[i]^2
end
Threads.@threads for i = 1:length(z)
@inbounds z[i] = cos(x[i]) + y[i]^3
end
end
function bar(x,y,z)
function _bar1(x,y,z)
Threads.@threads for i=1:length(z)
@inbounds z[i] = sin(x[i]) + y[i]^2
end
end
function _bar2(x,y,z)
Threads.@threads for i=1:length(z)
@inbounds z[i] = cos(x[i]) + y[i]^3
end
end
_bar1(x,y,z)
_bar2(x,y,z)
end
function main(n=100_000)
x,y,z = map(i->rand(n),1:3)
info("running on $(Threads.nthreads()) threads...")
info("foo...")
foo(x,y,z)
@time foo(x,y,z)
@show sum(z)
info("bar...")
bar(x,y,z)
@time bar(x,y,z)
@show sum(z)
info("...")
end
function bootstrap()
for nthreads in (1,2,4,8,16)
run(`env JULIA_NUM_THREADS=$(nthreads) julia -e 'include("threadtest.jl");main(1_000_000)'`)
end
end
Result from running the bootstrap method
using julia -e 'include("threadtest.jl");bootstrap()'
:
INFO: running on 1 threads...
INFO: foo...
0.557323 seconds (9.00 M allocations: 152.565 MiB, 1.31% gc time)
sum(z) = 1.091729300780358e6
INFO: bar...
0.028365 seconds (2 allocations: 96 bytes)
sum(z) = 1.091729300780358e6
INFO: ...
INFO: running on 2 threads...
INFO: foo...
0.641402 seconds (7.66 M allocations: 129.905 MiB, 11.76% gc time)
sum(z) = 1.0912962064405035e6
INFO: bar...
0.014065 seconds (2 allocations: 96 bytes)
sum(z) = 1.0912962064405035e6
INFO: ...
INFO: running on 4 threads...
INFO: foo...
0.519757 seconds (3.88 M allocations: 66.268 MiB, 12.29% gc time)
sum(z) = 1.0917261042575433e6
INFO: bar...
0.005928 seconds (2 allocations: 96 bytes)
sum(z) = 1.0917261042575433e6
INFO: ...
INFO: running on 8 threads...
INFO: foo...
0.501076 seconds (1.88 M allocations: 32.413 MiB, 16.15% gc time)
sum(z) = 1.092236339811011e6
INFO: bar...
0.003090 seconds (2 allocations: 96 bytes)
sum(z) = 1.092236339811011e6
INFO: ...
INFO: running on 16 threads...
INFO: foo...
0.551340 seconds (983.99 k allocations: 16.533 MiB, 20.69% gc time)
sum(z) = 1.091073196018634e6
INFO: bar...
0.001597 seconds (2 allocations: 96 bytes)
sum(z) = 1.091073196018634e6
INFO: ...
This shows reasonable speed-up for the bar()
method, but no speed-up for the foo()
method. To try and understand this, I use the main()
method to look at the memory allocations. In particular, the results from running the main() method using julia --track-allocation -e 'include("threadtest.jl");main()'
is the following .mem
file:
- function foo(x,y,z)
32 Threads.@threads for i = 1:length(z)
0 @inbounds z[i] = sin(x[i]) + y[i]^2
- end
0 Threads.@threads for i = 1:length(z)
16004032 @inbounds z[i] = cos(x[i]) + y[i]^3
- end
- end
-
- function bar(x,y,z)
0 function _bar1(x,y,z)
0 Threads.@threads for i=1:length(z)
0 @inbounds z[i] = sin(x[i]) + y[i]^2
- end
- end
0 function _bar2(x,y,z)
0 Threads.@threads for i=1:length(z)
0 @inbounds z[i] = cos(x[i]) + y[i]^3
- end
- end
0 _bar1(x,y,z)
- _bar2(x,y,z)
- end
-
- function main(n=100_000)
19095004 x,y,z = map(i->rand(n),1:3)
-
2247193 info("running on $(Threads.nthreads()) threads...")
-
1760 info("foo...")
0 foo(x,y,z)
0 @time foo(x,y,z)
65716 @show sum(z)
-
1760 info("bar...")
0 bar(x,y,z)
0 @time bar(x,y,z)
128 @show sum(z)
It seems clear that the second threaded loop in the foo
method is allocating memory, which I would not expect to happen. So, finally, my question: is this expected behaviour, or should a bug-report be filed?
Thanks you very much for your time!
Sam