Why are allocations occurring (or being reported, at least, by @allocated) in this example, sort of randomly?
Note that I’m measuring the allocations of the inner most operation, which I would think would be computationally independent of threading, or not.
julia> function test(x)
nt = Threads.nthreads()
s = zeros(eltype(x),nt)
@sync for it in 1:nt
Threads.@spawn for i in it:nt:length(x)
a = @allocated begin
s[it] += x[i]
end; a > 0 && @show a
end
end
return sum(s)
end
test (generic function with 1 method)
julia> x = rand(5);
julia> test(x)
a = 16
3.630101473711777
julia> test(x)
a = 496
a = 64
3.630101473711777
julia> test(x)
3.630101473711777
julia> test(x)
a = 368
a = 32
3.630101473711777
Measuring the same thing with @beallocated does not report allocations, but I don’t know if that is because I am helping the specialization of something by introducing a new scope:
julia> function test(x)
nt = Threads.nthreads()
s = zeros(eltype(x),nt)
@sync for it in 1:nt
Threads.@spawn for i in it:nt:length(x)
a = @ballocated let i = $i, it = $it, s = $s, x = $x
s[it] += x[i]
end; a > 0 && @show a
end
end
return sum(s)
end
test (generic function with 1 method)
julia> test(x)
3.811788778147674e7
julia> test(x)
3.811788778147674e7
julia> test(x)
3.811788778147674e7
julia> test(x)
3.811788778147674e7
julia> test(x)
3.811788778147674e7
julia> test(x)
3.811788778147674e7
No workaround and no solution, only small improvements (which might not apply, because I believe you are @spawn-ing intentionally). This was a good opportunity for me to check out Polyester with atomics. Here is the benchmark
using Random
using BenchmarkTools
using Polyester
function test1(x)
nt = Threads.nthreads()
s = zeros(eltype(x),nt)
@sync for it in 1:nt
Threads.@spawn for i in it:nt:length(x)
s[it] += x[i]
end
end
return sum(s)
end
function test2(x)
nt = Threads.nthreads()
s = zeros(eltype(x),nt)
Threads.@threads for it in 1:nt
for i in it:nt:length(x)
s[it] += x[i]
end
end
return sum(s)
end
mutable struct Atomic{T}; @atomic x::T; end
function test3(x)
nt = Threads.nthreads()
s = Atomic{Float64}(0)
Threads.@threads for it in 1:nt
@atomic s.x += sum(@view x[it:nt:length(x)])
end
s.x
end
function test4(x, nb)
s = Atomic{Float64}(0)
@batch for it in 1:nb
@atomic s.x += sum(@view x[it:nb:length(x)])
end
s.x
end
Random.seed!(42)
x=rand(1000000)
res1 = test1(x)
res2 = test2(x)
@assert res2 ≈ res1
res2 = test3(x)
@assert res2 ≈ res1
res4 = test4(x, 10)
@assert res2 ≈ res1
@btime test1(x) setup=(Random.seed!(42); x=rand(1000000))
@btime test2(x) setup=(Random.seed!(42); x=rand(1000000))
@btime test3(x) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 1) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 10) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 100) setup=(Random.seed!(42); x=rand(1000000))
@btime test4(x, 1000) setup=(Random.seed!(42); x=rand(1000000))
yielding
654.600 μs (43 allocations: 3.81 KiB) # @spawn
646.700 μs (31 allocations: 3.44 KiB) # @threads
438.400 μs (31 allocations: 3.36 KiB) # @threads + atomics
397.900 μs (1 allocation: 16 bytes) # @batch number of batches = 1
217.400 μs (2 allocations: 64 bytes) # @batch number of batches = 10
208.200 μs (2 allocations: 64 bytes) # @batch number of batches = 100
208.300 μs (2 allocations: 64 bytes) # @batch number of batches = 1000
In my specific case Polyester does not improve the performance (it does remove most allocations).
But what intrigues me the most is not that threading allocates, but that allocations are detected as if they were occurring in the operations which are local to each thread, while these should be all allocation free (and they are without threading or using Polyester).
That makes it quite hard to track the problem, or related problems.
In a recent thread I had a pretty big kernel, which was speed up by Polyester.
I’m not sure if profiling information is always 100% exact (for example I don’t even know if and how macros have to manage LineNumberNodes) and believe to have seen some oddities when profiling. Never bothered enough to investigate (especially as I prefer an outdated profiler).
In my case the threading time is irrelevant relative to the actual time performing the computations.
I am somewhat worried about the allocations there, though, because I have a use case where I would be calling the function that performs those operations thousands of times, and if each call allocates ~30Mb (which is what I see), that will start to be a problem.
Specifically here I think I can take the burden of the threading on a higher level, and since the non-threaded loop is non-allocating, that solves that specific use case. But it is not what I expect to do in general.
I tried your example code a couple of times, but with 4 threads I never see any allocations with julia 1.7.1. Are you using a different version and/or widely varying numbers of threads?
Actually, for the moment in any minimal example I try, I get 85 or 86 allocations, 71kb, which would be fine. The build up of allocations I’m seeing in my actual example are not easy to reproduce. I have to sort out what is causing that.
So, I think I have a MWE (and it is a strange type-instability):
Copy and past the code below, and you will see that running:
run()
will result in a type instability, or not, depending on these lines of the test function:
reduce!(list, list_threaded) # type stable
#list = reduce!(list, list_threaded) # type unstable
The problem is inferring the type of list. Because it is mutable, and reduce also returns list, both calls above should be equivalent (and they are except if using @threads).
Thus, removing the reassignment of list on the return of reduce solves the problem. Except that the reason I am using that is that by doing that I can use the same function to operate on mutable or immutable list variables. If I remove the reassignment, I cannot use the same interface to operate on immutables. In any case, seems an issue the fact that this is affected by the use of @threads in some other part of the code.
The example:
function reduce!(list,list_threaded)
list .= list[1]
for i in 2:length(list_threaded)
list .+= list_threaded[i]
end
return list
end
function test(list,list_threaded)
#for it in 1:Threads.nthreads() # type stable
Threads.@threads for it in 1:Threads.nthreads()
for i in it:Threads.nthreads():length(list)
list_threaded[it][i] += rand()
end
end
#reduce!(list, list_threaded) # type stable
list = reduce!(list, list_threaded) # type unstable
return list
end
function run()
list = zeros(10^4)
list_threaded = [ copy(list) for _ in 1:Threads.nthreads() ]
@code_warntype test(list, list_threaded)
end
FWIW, @floop detects and warns this. You can also set FLoops.assistant(:error) (default ATM is :warn) in a test suite to throw an error. See also How to avoid Box · FLoops for a case-by-case how-to guide to workaround this.
Polyester tries to pass everything as a function argument instead of through a closure, which is how it reduces allocations in code like this.
Might be worth considering a similar approach in @floop?
Sure, that is a better use of time long term.
But it has been a problem for years, so unless someone is actively working on it, I wouldn’t expect this to change.
Personally, I’d much prefer semantics that didn’t allow mutation of bindings. If there was a single thing I could change about Julia the language (not Julia the implementation), it would be this.
That’d eliminate the performance foot gun, and also eliminate this source of confusing/unexpected behavior. Allowing changing bindings only has downsides as far as I’m concerned.
I think the hurdle is rather coming up with an argument that can convince the OG devs that the complexities are worth the trouble (I’m working on this point). I think/hope the implementation is actually relatively straightforward.
I agree that @threads seems to affect (break?) type inference, which I’m unsure how to assess (somewhat surprising but not completely unexpected?). It seems to me we can fix this with
using BenchmarkTools
function reduce!(list,list_threaded)
list .= list[1]
for i in 2:length(list_threaded)
list .+= list_threaded[i]
end
return list
end
function test_serial(list,list_threaded)
for it in 1:Threads.nthreads()
for i in it:Threads.nthreads():length(list)
list_threaded[it][i] += rand()
end
end
list = reduce!(list, list_threaded)
return list
end
function test_parallel_broken(list,list_threaded)
Threads.@threads for it in 1:Threads.nthreads()
for i in it:Threads.nthreads():length(list)
list_threaded[it][i] += rand()
end
end
list = reduce!(list, list_threaded)
return list
end
function test_parallel(list,list_threaded)
Threads.@threads for it in 1:Threads.nthreads()
for i in it:Threads.nthreads():length(list)
list_threaded[it][i] += rand()
end
end
result = reduce!(list, list_threaded)
return result
end
list = zeros(10^4)
list_threaded = [ copy(list) for _ in 1:Threads.nthreads() ]
@btime test_serial(list, list_threaded)
@btime test_parallel_broken(list, list_threaded)
@btime test_parallel(list, list_threaded)
println()
There are these workarounds for that specirfic code, yes, but somewhat by chance. But in my case I can trigger the type instability by changing other parts of the code. It does not solve the underlying problem.
In my specific case I checked if list_threaded is nothing or not, and if it is, I initialize it. This check (whether or not satisfied) “causes” the type instability, so the underlying problem is the same, but the “hack” to solve is different (thus, likely something on which I cannot really rely).