Type-instability because of @threads boxing variables

Hm, I’m wondering why. Do you have any suspicion?

In a recent thread I had a pretty big kernel, which was speed up by Polyester.

I’m not sure if profiling information is always 100% exact (for example I don’t even know if and how macros have to manage LineNumberNodes) and believe to have seen some oddities when profiling. Never bothered enough to investigate (especially as I prefer an outdated profiler).

In my case the threading time is irrelevant relative to the actual time performing the computations.

I am somewhat worried about the allocations there, though, because I have a use case where I would be calling the function that performs those operations thousands of times, and if each call allocates ~30Mb (which is what I see), that will start to be a problem.

Specifically here I think I can take the burden of the threading on a higher level, and since the non-threaded loop is non-allocating, that solves that specific use case. But it is not what I expect to do in general.

I tried your example code a couple of times, but with 4 threads I never see any allocations with julia 1.7.1. Are you using a different version and/or widely varying numbers of threads?

1 Like

1.9.0-DEV and julia -t 5 here.

Hmm, with -t5 I indeed now also see a few runs reporting allocations.

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i5-11500H @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12

Nothing special (note that the allocations are reported with @allocated only). On the first code.

Actually, for the moment in any minimal example I try, I get 85 or 86 allocations, 71kb, which would be fine. The build up of allocations I’m seeing in my actual example are not easy to reproduce. I have to sort out what is causing that.

So, I think I have a MWE (and it is a strange type-instability):

Copy and past the code below, and you will see that running:

run()

will result in a type instability, or not, depending on these lines of the test function:

    reduce!(list, list_threaded) # type stable
    #list = reduce!(list, list_threaded) # type unstable

The problem is inferring the type of list. Because it is mutable, and reduce also returns list, both calls above should be equivalent (and they are except if using @threads).

Thus, removing the reassignment of list on the return of reduce solves the problem. Except that the reason I am using that is that by doing that I can use the same function to operate on mutable or immutable list variables. If I remove the reassignment, I cannot use the same interface to operate on immutables. In any case, seems an issue the fact that this is affected by the use of @threads in some other part of the code.

The example:

function reduce!(list,list_threaded)
    list .= list[1]
    for i in 2:length(list_threaded)
        list .+= list_threaded[i]
    end
    return list
end

function test(list,list_threaded)
    #for it in 1:Threads.nthreads() # type stable
    Threads.@threads for it in 1:Threads.nthreads()
        for i in it:Threads.nthreads():length(list)
            list_threaded[it][i] += rand()
        end
    end
    #reduce!(list, list_threaded) # type stable
    list = reduce!(list, list_threaded) # type unstable
    return list
end

function run()
    list = zeros(10^4)
    list_threaded = [ copy(list) for _ in 1:Threads.nthreads() ]
    @code_warntype test(list, list_threaded)
end
1 Like

FWIW, @floop detects and warns this. You can also set FLoops.assistant(:error) (default ATM is :warn) in a test suite to throw an error. See also How to avoid Box · FLoops for a case-by-case how-to guide to workaround this.

2 Likes

Polyester tries to pass everything as a function argument instead of through a closure, which is how it reduces allocations in code like this.
Might be worth considering a similar approach in @floop?

Closure is a crucial abstraction. I prefer to fix the compiler.

1 Like

Sure, that is a better use of time long term.
But it has been a problem for years, so unless someone is actively working on it, I wouldn’t expect this to change.

Personally, I’d much prefer semantics that didn’t allow mutation of bindings. If there was a single thing I could change about Julia the language (not Julia the implementation), it would be this.
That’d eliminate the performance foot gun, and also eliminate this source of confusing/unexpected behavior. Allowing changing bindings only has downsides as far as I’m concerned.

I think the hurdle is rather coming up with an argument that can convince the OG devs that the complexities are worth the trouble (I’m working on this point). I think/hope the implementation is actually relatively straightforward.

6 Likes

I agree that @threads seems to affect (break?) type inference, which I’m unsure how to assess (somewhat surprising but not completely unexpected?). It seems to me we can fix this with

using BenchmarkTools

function reduce!(list,list_threaded)
    list .= list[1]
    for i in 2:length(list_threaded)
        list .+= list_threaded[i]
    end
    return list
end

function test_serial(list,list_threaded)
    for it in 1:Threads.nthreads() 
        for i in it:Threads.nthreads():length(list)
            list_threaded[it][i] += rand()
        end
    end
    list = reduce!(list, list_threaded) 
    return list
end

function test_parallel_broken(list,list_threaded)
    Threads.@threads for it in 1:Threads.nthreads()
        for i in it:Threads.nthreads():length(list)
            list_threaded[it][i] += rand()
        end
    end
    list = reduce!(list, list_threaded)
    return list
end

function test_parallel(list,list_threaded)
    Threads.@threads for it in 1:Threads.nthreads()
        for i in it:Threads.nthreads():length(list)
            list_threaded[it][i] += rand()
        end
    end
    result = reduce!(list, list_threaded) 
    return result
end

list = zeros(10^4)
list_threaded = [ copy(list) for _ in 1:Threads.nthreads() ]
@btime test_serial(list, list_threaded)
@btime test_parallel_broken(list, list_threaded)
@btime test_parallel(list, list_threaded)
println()

yielding

  32.000 μs (0 allocations: 0 bytes) # serial
  282.900 μs (59020 allocations: 1.06 MiB) #parallel borken
  31.700 μs (31 allocations: 3.30 KiB) #parallel

Or am I missing something?

There are these workarounds for that specirfic code, yes, but somewhat by chance. But in my case I can trigger the type instability by changing other parts of the code. It does not solve the underlying problem.

In my specific case I checked if list_threaded is nothing or not, and if it is, I initialize it. This check (whether or not satisfied) “causes” the type instability, so the underlying problem is the same, but the “hack” to solve is different (thus, likely something on which I cannot really rely).

1 Like

Just to confirm, this is the infamous:

https://github.com/JuliaLang/julia/issues/15276

again, right?

Yes.
You could try adding let blocks.
Note that Threads.@spawn has sugar where using $ on variables automatically creates let blocks for you:

julia> @macroexpand Threads.@spawn foo($x)
quote
    #= threadingconstructs.jl:229 =#
    let var"##394" = x
        #= threadingconstructs.jl:230 =#
        local var"#14#task" = Base.Threads.Task((()->begin
                            #= threadingconstructs.jl:226 =#
                            foo(var"##394")
                        end))
        #= threadingconstructs.jl:231 =#
        (var"#14#task").sticky = false
        #= threadingconstructs.jl:232 =#
        if $(Expr(:islocal, Symbol("##sync#48")))
            #= threadingconstructs.jl:233 =#
            Base.Threads.put!(var"##sync#48", var"#14#task")
        end
        #= threadingconstructs.jl:235 =#
        Base.Threads.schedule(var"#14#task")
        #= threadingconstructs.jl:236 =#
        var"#14#task"
    end
end

But this isn’t supported for Threads.@threads, although it’s probably easy enough to manually create let blocks, unless you have an absurd number of variables (fun fact: let used to be O(N^2) in number of variables, until very recent versions of Julia [1.8+?], where it is now O(N), thanks to some code with thousands of let causing this to actually be problematic). This is the classic workaround.

You could also manually chunk your iteration range, perhaps separating the code that iterates over a chunk into a separate function, and use @spawn. I also often do this anyway when I want more control over task-local state, e.g. rngs. @spawn is of course more dynamic than @threads, which you may or may not want.

2 Likes

I’m actually using @spawn in my code. Thanks for the tip.

I think that will be the best workaround here.

Regardless of whether or not there are performance implications, I would still strongly prefer if mutating bindings.

julia> @testset "Closure" begin
       function foo(x)
         z = 2x
         return 3z
       end
       z = 3
       @test foo(z) == 18
       @test z == 3
       end
Closure: Test Failed at REPL[6]:8
  Expression: z == 3
   Evaluated: 6 == 3
Stacktrace:
 [1] macro expansion
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.9/Test/src/Test.jl:464 [inlined]
 [2] macro expansion
   @ REPL[6]:8 [inlined]
 [3] macro expansion
   @ ~/Documents/languages/julia/usr/share/julia/stdlib/v1.9/Test/src/Test.jl:1360 [inlined]
 [4] top-level scope
   @ REPL[6]:2
Test Summary: | Pass  Fail  Total  Time
Closure       |    1     1      2  0.3s
ERROR: Some tests did not pass: 1 passed, 1 failed, 0 errored, 0 broken.

This is just confusing.
If you encounter this bug in the wild – without knowing to look for it – you might try copy/pasting it into the REPL, only to find out that the tests pass / you can’t reproduce it.
Spooky heisenbugs at a distance.

When someone actually does want this behavior, it’d be much better to make it explicit via capturing and mutating a RefValue or some other mutable struct.

Still, would be nice for the performance gets fixed, too.

2 Likes

Very interesting. My workflow regarding common performance pitfalls is usually asking a profiler and/or JET.jl for advice. Do you think the compiler itself should warn us somehow (this would resemble clang sanitizers)?