More than one threaded loop in a function breaks threading?

multithreading

#1

Hello,

I know that multi-threading is still an experimental feature, but I want to ask about the following observation. It appears that the Threads.@threads macro works quite well when there is one threaded loop in a function, but when there are two threaded loops in a function, the performance of the second threaded loop is poor. This appears to be because the contents of the second loop is allocating memory. I try to illustrate this with the following simple example run on a machine with 20 cpu cores.

Julia code:

function foo(x,y,z)
    Threads.@threads for i = 1:length(z)
        @inbounds z[i] = sin(x[i]) + y[i]^2
    end
    Threads.@threads for i = 1:length(z)
        @inbounds z[i] = cos(x[i]) + y[i]^3
    end
end

function bar(x,y,z)
    function _bar1(x,y,z)
        Threads.@threads for i=1:length(z)
            @inbounds z[i] = sin(x[i]) + y[i]^2
        end
    end
    function _bar2(x,y,z)
        Threads.@threads for i=1:length(z)
            @inbounds z[i] = cos(x[i]) + y[i]^3
        end
    end
    _bar1(x,y,z)
    _bar2(x,y,z)
end

function main(n=100_000)
    x,y,z = map(i->rand(n),1:3)
    
    info("running on $(Threads.nthreads()) threads...")
 
    info("foo...")
    foo(x,y,z)
    @time foo(x,y,z)
    @show sum(z)

    info("bar...")
    bar(x,y,z)
    @time bar(x,y,z)
    @show sum(z)

    info("...")
end

function bootstrap()
    for nthreads in (1,2,4,8,16)
        run(`env JULIA_NUM_THREADS=$(nthreads) julia -e 'include("threadtest.jl");main(1_000_000)'`)
    end
end

Result from running the bootstrap method using julia -e 'include("threadtest.jl");bootstrap()':

INFO: running on 1 threads...
INFO: foo...
  0.557323 seconds (9.00 M allocations: 152.565 MiB, 1.31% gc time)
sum(z) = 1.091729300780358e6
INFO: bar...
  0.028365 seconds (2 allocations: 96 bytes)
sum(z) = 1.091729300780358e6
INFO: ...
INFO: running on 2 threads...
INFO: foo...
  0.641402 seconds (7.66 M allocations: 129.905 MiB, 11.76% gc time)
sum(z) = 1.0912962064405035e6
INFO: bar...
  0.014065 seconds (2 allocations: 96 bytes)
sum(z) = 1.0912962064405035e6
INFO: ...
INFO: running on 4 threads...
INFO: foo...
  0.519757 seconds (3.88 M allocations: 66.268 MiB, 12.29% gc time)
sum(z) = 1.0917261042575433e6
INFO: bar...
  0.005928 seconds (2 allocations: 96 bytes)
sum(z) = 1.0917261042575433e6
INFO: ...
INFO: running on 8 threads...
INFO: foo...
  0.501076 seconds (1.88 M allocations: 32.413 MiB, 16.15% gc time)
sum(z) = 1.092236339811011e6
INFO: bar...
  0.003090 seconds (2 allocations: 96 bytes)
sum(z) = 1.092236339811011e6
INFO: ...
INFO: running on 16 threads...
INFO: foo...
  0.551340 seconds (983.99 k allocations: 16.533 MiB, 20.69% gc time)
sum(z) = 1.091073196018634e6
INFO: bar...
  0.001597 seconds (2 allocations: 96 bytes)
sum(z) = 1.091073196018634e6
INFO: ...

This shows reasonable speed-up for the bar() method, but no speed-up for the foo() method. To try and understand this, I use the main() method to look at the memory allocations. In particular, the results from running the main() method using julia --track-allocation -e 'include("threadtest.jl");main()' is the following .mem file:

        - function foo(x,y,z)
       32     Threads.@threads for i = 1:length(z)
        0         @inbounds z[i] = sin(x[i]) + y[i]^2
        -     end
        0     Threads.@threads for i = 1:length(z)
 16004032         @inbounds z[i] = cos(x[i]) + y[i]^3
        -     end
        - end
        - 
        - function bar(x,y,z)
        0     function _bar1(x,y,z)
        0         Threads.@threads for i=1:length(z)
        0             @inbounds z[i] = sin(x[i]) + y[i]^2
        -         end
        -     end
        0     function _bar2(x,y,z)
        0         Threads.@threads for i=1:length(z)
        0             @inbounds z[i] = cos(x[i]) + y[i]^3
        -         end
        -     end
        0     _bar1(x,y,z)
        -     _bar2(x,y,z)
        - end
        - 
        - function main(n=100_000)
 19095004     x,y,z = map(i->rand(n),1:3)
        -     
  2247193     info("running on $(Threads.nthreads()) threads...")
        -  
     1760     info("foo...")
        0     foo(x,y,z)
        0     @time foo(x,y,z)
    65716     @show sum(z)
        - 
     1760     info("bar...")
        0     bar(x,y,z)
        0     @time bar(x,y,z)
      128     @show sum(z)

It seems clear that the second threaded loop in the foo method is allocating memory, which I would not expect to happen. So, finally, my question: is this expected behaviour, or should a bug-report be filed?

Thanks you very much for your time!

Sam


#2

Nothing is really broken (since the behavior is the same) and it’s not about threading. https://github.com/JuliaLang/julia/issues/15276