# Type-instability because of @threads boxing variables

See Type-instability because of @threads boxing variables - #12 by lmiq
for a more precise description of the problem.

Why are allocations occurring (or being reported, at least, by `@allocated`) in this example, sort of randomly?

Note that I’m measuring the allocations of the inner most operation, which I would think would be computationally independent of threading, or not.

``````julia> function test(x)
s = zeros(eltype(x),nt)
@sync for it in 1:nt
a = @allocated begin
s[it] += x[i]
end; a > 0 && @show a
end
end
return sum(s)
end
test (generic function with 1 method)

julia> x = rand(5);

julia> test(x)
a = 16
3.630101473711777

julia> test(x)
a = 496
a = 64
3.630101473711777

julia> test(x)
3.630101473711777

julia> test(x)
a = 368
a = 32
3.630101473711777

``````

Measuring the same thing with `@beallocated` does not report allocations, but I don’t know if that is because I am helping the specialization of something by introducing a new scope:

``````julia> function test(x)
s = zeros(eltype(x),nt)
@sync for it in 1:nt
a = @ballocated let i = \$i, it = \$it, s = \$s, x = \$x
s[it] += x[i]
end; a > 0 && @show a
end
end
return sum(s)
end
test (generic function with 1 method)

julia> test(x)
3.811788778147674e7

julia> test(x)
3.811788778147674e7

julia> test(x)
3.811788778147674e7

julia> test(x)
3.811788778147674e7

julia> test(x)
3.811788778147674e7

julia> test(x)
3.811788778147674e7

``````

Uhm… I think I recall that this is because threading creates a closure, and this is the famous problem of the captured variables in the closure?

Any easy workaround for that in this specific case?

No workaround and no solution, only small improvements (which might not apply, because I believe you are `@spawn`-ing intentionally). This was a good opportunity for me to check out `Polyester` with atomics. Here is the benchmark

``````using Random
using BenchmarkTools
using Polyester

function test1(x)
s = zeros(eltype(x),nt)
@sync for it in 1:nt
s[it] += x[i]
end
end
return sum(s)
end

function test2(x)
s = zeros(eltype(x),nt)
for i in it:nt:length(x)
s[it] += x[i]
end
end
return sum(s)
end

mutable struct Atomic{T}; @atomic x::T; end

function test3(x)
s = Atomic{Float64}(0)
@atomic s.x += sum(@view x[it:nt:length(x)])
end
s.x
end

function test4(x, nb)
s = Atomic{Float64}(0)
@batch for it in 1:nb
@atomic s.x += sum(@view x[it:nb:length(x)])
end
s.x
end

Random.seed!(42)
x=rand(1000000)
res1 = test1(x)
res2 = test2(x)
@assert res2 ≈ res1
res2 = test3(x)
@assert res2 ≈ res1
res4 = test4(x, 10)
@assert res2 ≈ res1

@btime test1(x) setup=(Random.seed!(42); x=rand(1000000))
@btime test2(x) setup=(Random.seed!(42); x=rand(1000000))
@btime test3(x) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 1) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 10) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 100) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 1000) setup=(Random.seed!(42); x=rand(1000000))
``````

yielding

``````  654.600 μs (43 allocations: 3.81 KiB)  # @spawn
646.700 μs (31 allocations: 3.44 KiB)  # @threads
438.400 μs (31 allocations: 3.36 KiB)  # @threads + atomics
397.900 μs (1 allocation: 16 bytes) # @batch number of batches = 1
217.400 μs (2 allocations: 64 bytes) # @batch number of batches = 10
208.200 μs (2 allocations: 64 bytes) # @batch number of batches = 100
208.300 μs (2 allocations: 64 bytes) # @batch number of batches = 1000
``````
2 Likes

In my specific case Polyester does not improve the performance (it does remove most allocations).

But what intrigues me the most is not that threading allocates, but that allocations are detected as if they were occurring in the operations which are local to each thread, while these should be all allocation free (and they are without threading or using Polyester).

That makes it quite hard to track the problem, or related problems.

Hm, I’m wondering why. Do you have any suspicion?

In a recent thread I had a pretty big kernel, which was speed up by `Polyester`.

I’m not sure if profiling information is always 100% exact (for example I don’t even know if and how macros have to manage `LineNumberNode`s) and believe to have seen some oddities when profiling. Never bothered enough to investigate (especially as I prefer an outdated profiler).

In my case the threading time is irrelevant relative to the actual time performing the computations.

I am somewhat worried about the allocations there, though, because I have a use case where I would be calling the function that performs those operations thousands of times, and if each call allocates ~30Mb (which is what I see), that will start to be a problem.

Specifically here I think I can take the burden of the threading on a higher level, and since the non-threaded loop is non-allocating, that solves that specific use case. But it is not what I expect to do in general.

I tried your example code a couple of times, but with 4 threads I never see any allocations with julia 1.7.1. Are you using a different version and/or widely varying numbers of threads?

1 Like

1.9.0-DEV and `julia -t 5` here.

Hmm, with `-t5` I indeed now also see a few runs reporting allocations.

``````julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 11th Gen Intel(R) Core(TM) i5-11500H @ 2.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake)
Environment:
JULIA_EDITOR = code
``````

Nothing special (note that the allocations are reported with `@allocated` only). On the first code.

Actually, for the moment in any minimal example I try, I get 85 or 86 allocations, 71kb, which would be fine. The build up of allocations I’m seeing in my actual example are not easy to reproduce. I have to sort out what is causing that.

So, I think I have a MWE (and it is a strange type-instability):

Copy and past the code below, and you will see that running:

``````run()
``````

will result in a type instability, or not, depending on these lines of the `test` function:

``````    reduce!(list, list_threaded) # type stable
#list = reduce!(list, list_threaded) # type unstable
``````

The problem is inferring the type of `list`. Because it is mutable, and `reduce` also returns `list`, both calls above should be equivalent (and they are except if using `@threads`).

Thus, removing the reassignment of `list` on the return of `reduce ` solves the problem. Except that the reason I am using that is that by doing that I can use the same function to operate on mutable or immutable `list` variables. If I remove the reassignment, I cannot use the same interface to operate on immutables. In any case, seems an issue the fact that this is affected by the use of `@threads` in some other part of the code.

## The example:

``````function reduce!(list,list_threaded)
list .= list[1]
end
return list
end

end
end
list = reduce!(list, list_threaded) # type unstable
return list
end

function run()
list = zeros(10^4)
end
``````
1 Like

FWIW, `@floop` detects and warns this. You can also set `FLoops.assistant(:error)` (default ATM is `:warn`) in a test suite to throw an error. See also How to avoid Box · FLoops for a case-by-case how-to guide to workaround this.

2 Likes

Polyester tries to pass everything as a function argument instead of through a closure, which is how it reduces allocations in code like this.
Might be worth considering a similar approach in `@floop`?

Closure is a crucial abstraction. I prefer to fix the compiler.

1 Like

Sure, that is a better use of time long term.
But it has been a problem for years, so unless someone is actively working on it, I wouldn’t expect this to change.

Personally, I’d much prefer semantics that didn’t allow mutation of bindings. If there was a single thing I could change about Julia the language (not Julia the implementation), it would be this.
That’d eliminate the performance foot gun, and also eliminate this source of confusing/unexpected behavior. Allowing changing bindings only has downsides as far as I’m concerned.

I think the hurdle is rather coming up with an argument that can convince the OG devs that the complexities are worth the trouble (I’m working on this point). I think/hope the implementation is actually relatively straightforward.

5 Likes

I agree that `@threads` seems to affect (break?) type inference, which I’m unsure how to assess (somewhat surprising but not completely unexpected?). It seems to me we can fix this with

``````using BenchmarkTools

list .= list[1]
end
return list
end

end
end
return list
end

end
end
return list
end

end
end
return result
end

list = zeros(10^4)
println()
``````

yielding

``````  32.000 μs (0 allocations: 0 bytes) # serial
282.900 μs (59020 allocations: 1.06 MiB) #parallel borken
31.700 μs (31 allocations: 3.30 KiB) #parallel
``````

Or am I missing something?

There are these workarounds for that specirfic code, yes, but somewhat by chance. But in my case I can trigger the type instability by changing other parts of the code. It does not solve the underlying problem.

In my specific case I checked if `list_threaded` is `nothing` or not, and if it is, I initialize it. This check (whether or not satisfied) “causes” the type instability, so the underlying problem is the same, but the “hack” to solve is different (thus, likely something on which I cannot really rely).

1 Like

Just to confirm, this is the infamous:

again, right?