More than one threaded loop in a function breaks threading?

Hello,

I know that multi-threading is still an experimental feature, but I want to ask about the following observation. It appears that the Threads.@threads macro works quite well when there is one threaded loop in a function, but when there are two threaded loops in a function, the performance of the second threaded loop is poor. This appears to be because the contents of the second loop is allocating memory. I try to illustrate this with the following simple example run on a machine with 20 cpu cores.

Julia code:

function foo(x,y,z)
    Threads.@threads for i = 1:length(z)
        @inbounds z[i] = sin(x[i]) + y[i]^2
    end
    Threads.@threads for i = 1:length(z)
        @inbounds z[i] = cos(x[i]) + y[i]^3
    end
end

function bar(x,y,z)
    function _bar1(x,y,z)
        Threads.@threads for i=1:length(z)
            @inbounds z[i] = sin(x[i]) + y[i]^2
        end
    end
    function _bar2(x,y,z)
        Threads.@threads for i=1:length(z)
            @inbounds z[i] = cos(x[i]) + y[i]^3
        end
    end
    _bar1(x,y,z)
    _bar2(x,y,z)
end

function main(n=100_000)
    x,y,z = map(i->rand(n),1:3)
    
    info("running on $(Threads.nthreads()) threads...")
 
    info("foo...")
    foo(x,y,z)
    @time foo(x,y,z)
    @show sum(z)

    info("bar...")
    bar(x,y,z)
    @time bar(x,y,z)
    @show sum(z)

    info("...")
end

function bootstrap()
    for nthreads in (1,2,4,8,16)
        run(`env JULIA_NUM_THREADS=$(nthreads) julia -e 'include("threadtest.jl");main(1_000_000)'`)
    end
end

Result from running the bootstrap method using julia -e 'include("threadtest.jl");bootstrap()':

INFO: running on 1 threads...
INFO: foo...
  0.557323 seconds (9.00 M allocations: 152.565 MiB, 1.31% gc time)
sum(z) = 1.091729300780358e6
INFO: bar...
  0.028365 seconds (2 allocations: 96 bytes)
sum(z) = 1.091729300780358e6
INFO: ...
INFO: running on 2 threads...
INFO: foo...
  0.641402 seconds (7.66 M allocations: 129.905 MiB, 11.76% gc time)
sum(z) = 1.0912962064405035e6
INFO: bar...
  0.014065 seconds (2 allocations: 96 bytes)
sum(z) = 1.0912962064405035e6
INFO: ...
INFO: running on 4 threads...
INFO: foo...
  0.519757 seconds (3.88 M allocations: 66.268 MiB, 12.29% gc time)
sum(z) = 1.0917261042575433e6
INFO: bar...
  0.005928 seconds (2 allocations: 96 bytes)
sum(z) = 1.0917261042575433e6
INFO: ...
INFO: running on 8 threads...
INFO: foo...
  0.501076 seconds (1.88 M allocations: 32.413 MiB, 16.15% gc time)
sum(z) = 1.092236339811011e6
INFO: bar...
  0.003090 seconds (2 allocations: 96 bytes)
sum(z) = 1.092236339811011e6
INFO: ...
INFO: running on 16 threads...
INFO: foo...
  0.551340 seconds (983.99 k allocations: 16.533 MiB, 20.69% gc time)
sum(z) = 1.091073196018634e6
INFO: bar...
  0.001597 seconds (2 allocations: 96 bytes)
sum(z) = 1.091073196018634e6
INFO: ...

This shows reasonable speed-up for the bar() method, but no speed-up for the foo() method. To try and understand this, I use the main() method to look at the memory allocations. In particular, the results from running the main() method using julia --track-allocation -e 'include("threadtest.jl");main()' is the following .mem file:

        - function foo(x,y,z)
       32     Threads.@threads for i = 1:length(z)
        0         @inbounds z[i] = sin(x[i]) + y[i]^2
        -     end
        0     Threads.@threads for i = 1:length(z)
 16004032         @inbounds z[i] = cos(x[i]) + y[i]^3
        -     end
        - end
        - 
        - function bar(x,y,z)
        0     function _bar1(x,y,z)
        0         Threads.@threads for i=1:length(z)
        0             @inbounds z[i] = sin(x[i]) + y[i]^2
        -         end
        -     end
        0     function _bar2(x,y,z)
        0         Threads.@threads for i=1:length(z)
        0             @inbounds z[i] = cos(x[i]) + y[i]^3
        -         end
        -     end
        0     _bar1(x,y,z)
        -     _bar2(x,y,z)
        - end
        - 
        - function main(n=100_000)
 19095004     x,y,z = map(i->rand(n),1:3)
        -     
  2247193     info("running on $(Threads.nthreads()) threads...")
        -  
     1760     info("foo...")
        0     foo(x,y,z)
        0     @time foo(x,y,z)
    65716     @show sum(z)
        - 
     1760     info("bar...")
        0     bar(x,y,z)
        0     @time bar(x,y,z)
      128     @show sum(z)

It seems clear that the second threaded loop in the foo method is allocating memory, which I would not expect to happen. So, finally, my question: is this expected behaviour, or should a bug-report be filed?

Thanks you very much for your time!

Sam

Nothing is really broken (since the behavior is the same) and it’s not about threading. performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub