Type-instability because of @threads boxing variables

See Type-instability because of @threads boxing variables - #12 by lmiq
for a more precise description of the problem.

Why are allocations occurring (or being reported, at least, by @allocated) in this example, sort of randomly?

Note that I’m measuring the allocations of the inner most operation, which I would think would be computationally independent of threading, or not.

julia> function test(x)
           nt = Threads.nthreads()
           s = zeros(eltype(x),nt)
           @sync for it in 1:nt
               Threads.@spawn for i in it:nt:length(x)
                   a = @allocated begin
                       s[it] += x[i]
                   end; a > 0 && @show a
               end
           end
           return sum(s)
       end
test (generic function with 1 method)

julia> x = rand(5);

julia> test(x)
a = 16
3.630101473711777

julia> test(x)
a = 496
a = 64
3.630101473711777

julia> test(x)
3.630101473711777

julia> test(x)
a = 368
a = 32
3.630101473711777


Measuring the same thing with @beallocated does not report allocations, but I don’t know if that is because I am helping the specialization of something by introducing a new scope:

julia> function test(x)
           nt = Threads.nthreads()
           s = zeros(eltype(x),nt)
           @sync for it in 1:nt
               Threads.@spawn for i in it:nt:length(x)
                   a = @ballocated let i = $i, it = $it, s = $s, x = $x
                       s[it] += x[i]
                   end; a > 0 && @show a
               end
           end
           return sum(s)
       end
test (generic function with 1 method)

julia> test(x)
3.811788778147674e7

julia> test(x)
3.811788778147674e7

julia> test(x)
3.811788778147674e7

julia> test(x)
3.811788778147674e7

julia> test(x)
3.811788778147674e7

julia> test(x)
3.811788778147674e7

Uhm… I think I recall that this is because threading creates a closure, and this is the famous problem of the captured variables in the closure?

Any easy workaround for that in this specific case?

No workaround and no solution, only small improvements (which might not apply, because I believe you are @spawn-ing intentionally). This was a good opportunity for me to check out Polyester with atomics. Here is the benchmark

using Random
using BenchmarkTools
using Polyester

function test1(x)
    nt = Threads.nthreads()
    s = zeros(eltype(x),nt)
    @sync for it in 1:nt
        Threads.@spawn for i in it:nt:length(x)
            s[it] += x[i]
        end
    end
    return sum(s)
end

function test2(x)
    nt = Threads.nthreads()
    s = zeros(eltype(x),nt)
    Threads.@threads for it in 1:nt
        for i in it:nt:length(x)
            s[it] += x[i]
        end
    end
    return sum(s)
end

mutable struct Atomic{T}; @atomic x::T; end

function test3(x)
    nt = Threads.nthreads()
    s = Atomic{Float64}(0)
    Threads.@threads for it in 1:nt
        @atomic s.x += sum(@view x[it:nt:length(x)])
    end
    s.x
end

function test4(x, nb)
    s = Atomic{Float64}(0)
    @batch for it in 1:nb
        @atomic s.x += sum(@view x[it:nb:length(x)])
    end
    s.x
end

Random.seed!(42)
x=rand(1000000)
res1 = test1(x)
res2 = test2(x)
@assert res2 ≈ res1
res2 = test3(x)
@assert res2 ≈ res1
res4 = test4(x, 10)
@assert res2 ≈ res1

@btime test1(x) setup=(Random.seed!(42); x=rand(1000000))
@btime test2(x) setup=(Random.seed!(42); x=rand(1000000))
@btime test3(x) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 1) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 10) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 100) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 1000) setup=(Random.seed!(42); x=rand(1000000))

yielding

  654.600 μs (43 allocations: 3.81 KiB)  # @spawn
  646.700 μs (31 allocations: 3.44 KiB)  # @threads
  438.400 μs (31 allocations: 3.36 KiB)  # @threads + atomics
  397.900 μs (1 allocation: 16 bytes) # @batch number of batches = 1
  217.400 μs (2 allocations: 64 bytes) # @batch number of batches = 10
  208.200 μs (2 allocations: 64 bytes) # @batch number of batches = 100
  208.300 μs (2 allocations: 64 bytes) # @batch number of batches = 1000
2 Likes

In my specific case Polyester does not improve the performance (it does remove most allocations).

But what intrigues me the most is not that threading allocates, but that allocations are detected as if they were occurring in the operations which are local to each thread, while these should be all allocation free (and they are without threading or using Polyester).

That makes it quite hard to track the problem, or related problems.

Hm, I’m wondering why. Do you have any suspicion?

In a recent thread I had a pretty big kernel, which was speed up by Polyester.

I’m not sure if profiling information is always 100% exact (for example I don’t even know if and how macros have to manage LineNumberNodes) and believe to have seen some oddities when profiling. Never bothered enough to investigate (especially as I prefer an outdated profiler).

In my case the threading time is irrelevant relative to the actual time performing the computations.

I am somewhat worried about the allocations there, though, because I have a use case where I would be calling the function that performs those operations thousands of times, and if each call allocates ~30Mb (which is what I see), that will start to be a problem.

Specifically here I think I can take the burden of the threading on a higher level, and since the non-threaded loop is non-allocating, that solves that specific use case. But it is not what I expect to do in general.

I tried your example code a couple of times, but with 4 threads I never see any allocations with julia 1.7.1. Are you using a different version and/or widely varying numbers of threads?

1 Like

1.9.0-DEV and julia -t 5 here.

Hmm, with -t5 I indeed now also see a few runs reporting allocations.

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 11th Gen Intel(R) Core(TM) i5-11500H @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 12

Nothing special (note that the allocations are reported with @allocated only). On the first code.

Actually, for the moment in any minimal example I try, I get 85 or 86 allocations, 71kb, which would be fine. The build up of allocations I’m seeing in my actual example are not easy to reproduce. I have to sort out what is causing that.

So, I think I have a MWE (and it is a strange type-instability):

Copy and past the code below, and you will see that running:

run()

will result in a type instability, or not, depending on these lines of the test function:

    reduce!(list, list_threaded) # type stable
    #list = reduce!(list, list_threaded) # type unstable

The problem is inferring the type of list. Because it is mutable, and reduce also returns list, both calls above should be equivalent (and they are except if using @threads).

Thus, removing the reassignment of list on the return of reduce solves the problem. Except that the reason I am using that is that by doing that I can use the same function to operate on mutable or immutable list variables. If I remove the reassignment, I cannot use the same interface to operate on immutables. In any case, seems an issue the fact that this is affected by the use of @threads in some other part of the code.

The example:

function reduce!(list,list_threaded)
    list .= list[1]
    for i in 2:length(list_threaded)
        list .+= list_threaded[i]
    end
    return list
end

function test(list,list_threaded)
    #for it in 1:Threads.nthreads() # type stable
    Threads.@threads for it in 1:Threads.nthreads()
        for i in it:Threads.nthreads():length(list)
            list_threaded[it][i] += rand()
        end
    end
    #reduce!(list, list_threaded) # type stable
    list = reduce!(list, list_threaded) # type unstable
    return list
end

function run()
    list = zeros(10^4)
    list_threaded = [ copy(list) for _ in 1:Threads.nthreads() ]
    @code_warntype test(list, list_threaded)
end
1 Like

FWIW, @floop detects and warns this. You can also set FLoops.assistant(:error) (default ATM is :warn) in a test suite to throw an error. See also How to avoid Box · FLoops for a case-by-case how-to guide to workaround this.

2 Likes

Polyester tries to pass everything as a function argument instead of through a closure, which is how it reduces allocations in code like this.
Might be worth considering a similar approach in @floop?

Closure is a crucial abstraction. I prefer to fix the compiler.

1 Like

Sure, that is a better use of time long term.
But it has been a problem for years, so unless someone is actively working on it, I wouldn’t expect this to change.

Personally, I’d much prefer semantics that didn’t allow mutation of bindings. If there was a single thing I could change about Julia the language (not Julia the implementation), it would be this.
That’d eliminate the performance foot gun, and also eliminate this source of confusing/unexpected behavior. Allowing changing bindings only has downsides as far as I’m concerned.

I think the hurdle is rather coming up with an argument that can convince the OG devs that the complexities are worth the trouble (I’m working on this point). I think/hope the implementation is actually relatively straightforward.

5 Likes

I agree that @threads seems to affect (break?) type inference, which I’m unsure how to assess (somewhat surprising but not completely unexpected?). It seems to me we can fix this with

using BenchmarkTools

function reduce!(list,list_threaded)
    list .= list[1]
    for i in 2:length(list_threaded)
        list .+= list_threaded[i]
    end
    return list
end

function test_serial(list,list_threaded)
    for it in 1:Threads.nthreads() 
        for i in it:Threads.nthreads():length(list)
            list_threaded[it][i] += rand()
        end
    end
    list = reduce!(list, list_threaded) 
    return list
end

function test_parallel_broken(list,list_threaded)
    Threads.@threads for it in 1:Threads.nthreads()
        for i in it:Threads.nthreads():length(list)
            list_threaded[it][i] += rand()
        end
    end
    list = reduce!(list, list_threaded)
    return list
end

function test_parallel(list,list_threaded)
    Threads.@threads for it in 1:Threads.nthreads()
        for i in it:Threads.nthreads():length(list)
            list_threaded[it][i] += rand()
        end
    end
    result = reduce!(list, list_threaded) 
    return result
end

list = zeros(10^4)
list_threaded = [ copy(list) for _ in 1:Threads.nthreads() ]
@btime test_serial(list, list_threaded)
@btime test_parallel_broken(list, list_threaded)
@btime test_parallel(list, list_threaded)
println()

yielding

  32.000 μs (0 allocations: 0 bytes) # serial
  282.900 μs (59020 allocations: 1.06 MiB) #parallel borken
  31.700 μs (31 allocations: 3.30 KiB) #parallel

Or am I missing something?

There are these workarounds for that specirfic code, yes, but somewhat by chance. But in my case I can trigger the type instability by changing other parts of the code. It does not solve the underlying problem.

In my specific case I checked if list_threaded is nothing or not, and if it is, I initialize it. This check (whether or not satisfied) “causes” the type instability, so the underlying problem is the same, but the “hack” to solve is different (thus, likely something on which I cannot really rely).

1 Like

Just to confirm, this is the infamous:

again, right?