Mysterious type instability (& performance hit) with simple @threads

Hello all - I’m really confused about the following large performance hit, I believe due to type instability, in a really simple/standard piece of code. It took a while to boil it down to this MWE. It is triggered when an array is (seemingly trivially) allocated more than once, and then accessed in a @threads loop:

julia> function unstable()
         x = zeros(0)       # if comment out this line, becomes fast and type-stable
         x = zeros(1_000_000)
         Threads.@threads for i in eachindex(x)
           x[i] = 1.0
         end
         x
       end
unstable (generic function with 1 method)

julia> Threads.nthreads()
1

julia> @btime unstable();
  17.816 ms (999502 allocations: 22.88 MiB)

Note that x is unambiguously Float64 in every operation (this can be added explicitly but of course to no effect). Yet somehow the number of allocations is close to the iteration count. But if the commented line is removed (so x is not pre-allocated twice), one gets:

julia> @btime unstable();
  1.382 ms (8 allocations: 7.63 MiB)

Note that I am running Julia with one thread, so it’s not a β€œtoo many threads problem”. Now…

@code_warntype unstable();

includes Body::Any in red, while the second version has the expected
Body::Vector{Float64}.
In the first version it does not matter what size the first allocation is; it could be the same as the 2nd, but there have to be two of them (in my code the allocation was conditional on an input arg, but it seems that is not needed for a MWE).
Fixes include removing @threads, or inserting a let x=x block around the loop (I don’t understand this).
But I want to be able to have it make use of multithreading, and be reallocated based on a condition. This is a really simple piece of textbook code basically taken straight from the manual, so I’m rather concerned by it, and the >10x speed hit which also happened in my original code).

Here’s the first code_warntype output (in my original example there was no warnings about threadsfor_fun, merely a Body::Any):

julia> @code_warntype unstable(); MethodInstance for unstable() from unstable() @ Main REPL[16]:1 Arguments #self#::Core.Const(unstable) Locals threadsfor_fun::var"#81#threadsfor_fun#14"{var"#81#threadsfor_fun#13#15"{_A}} where _A x@_3::Core.Box threadsfor_fun#13::var"#81#threadsfor_fun#13#15" range::Any x@_6::Union{} x@_7::Union{} Body::Any 1 ── Core.NewvarNode(:(threadsfor_fun)) β”‚ (x@_3 = Core.Box()) β”‚ %3 = Main.zeros(0)::Vector{Float64} β”‚ Core.setfield!(x@_3, :contents, %3) β”‚ %5 = Main.zeros(1000000)::Vector{Float64} β”‚ Core.setfield!(x@_3, :contents, %5) β”‚ %7 = Core.isdefined(x@_3, :contents)::Bool └─── goto #3 if not %7 2 ── goto #4 3 ── Core.NewvarNode(:(x@_6)) └─── x@_6 4 ┄─ %12 = Core.getfield(x@_3, :contents)::Any β”‚ %13 = Main.eachindex(%12)::Any β”‚ (range = %13) β”‚ %15 = Main.:(var"#81#threadsfor_fun#13#15")::Core.Const(var"#81#threadsfor_fun#13#15") β”‚ %16 = Core.typeof(range)::DataType β”‚ %17 = Core.apply_type(%15, %16)::Type{var"#81#threadsfor_fun#13#15"{_A}} where _A β”‚ %18 = x@_3::Core.Box β”‚ (threadsfor_fun#13 = %new(%17, %18, range)) β”‚ %20 = Main.:(var"#81#threadsfor_fun#14")::Core.Const(var"#81#threadsfor_fun#14") β”‚ %21 = Core.typeof(threadsfor_fun#13)::Type{var"#81#threadsfor_fun#13#15"{_A}} where _A β”‚ %22 = Core.apply_type(%20, %21)::Type{var"#81#threadsfor_fun#14"{var"#81#threadsfor_fun#13#15"{_A}}} where _A β”‚ (threadsfor_fun = %new(%22, threadsfor_fun#13)) β”‚ %24 = threadsfor_fun::var"#81#threadsfor_fun#14"{var"#81#threadsfor_fun#13#15"{_A}} where _A β”‚ Core.ifelse(false, false, %24) └─── goto #6 if not true 5 ── Base.Threads.threading_run(threadsfor_fun, false) └─── goto #7 6 ── Core.Const(:($(Expr(:foreigncall, :(:jl_in_threaded_region), Int32, svec(), 0, :(:ccall))))) β”‚ Core.Const(:(%29 != 0)) β”‚ Core.Const(:(goto %34 if not %30)) β”‚ Core.Const(:(Base.Threads.error("`@threads :static` cannot be used concurrently or nested"))) β”‚ Core.Const(:(goto %35)) └─── Core.Const(:(Base.Threads.threading_run(threadsfor_fun, true))) 7 ┄─ Base.Threads.nothing β”‚ %36 = Core.isdefined(x@_3, :contents)::Bool └─── goto #9 if not %36 8 ── goto #10 9 ── Core.NewvarNode(:(x@_7)) └─── x@_7 10 β”„ %41 = Core.getfield(x@_3, :contents)::Any └─── return %41

I am on an 8-core ryzen2 ubuntu laptop, running julia 1.10.0. I tried other releases, no difference. Thanks for any help, since this caused me a couple of hours of painful debugging!

1 Like

This is the infamous closure performance issue performance of captured variables in closures Β· Issue #15276 Β· JuliaLang/julia Β· GitHub because @threads creates a closure under the hood. See also this section of the performance tips and various discussions here on discourse.

It’s an annoying performance bug that has been with us for a long long time which requires a lot of fundamental work to get fixed.

The way I typically work around it is to have the multithreaded kernel by itself in a separate function. So that conditionals (and potential post processing steps) and the multithreaded loop are not in the same function.

5 Likes

Note that even if this is usually called a β€œperformance issue”, it’s really a correctness issue:

function g(n)
    r1 = sum(fetch, [
        @spawn begin
            x = i
            Libc.systemsleep(1e-6*rand())
            return x
        end for i in 1:n
    ])
    false && (x = 0)
    return r1
end

This g function cannot be relied on to compute sum(1:n) when nthreads() > 1. The x variable is shared among all the spawned tasks because of the (x = 0) (which is never executed, and most likely entirely optimized away).

1 Like