While playing with Base multithreading I noticed the following odd behaviour (MRE follows):
# Single-threaded code always results in exactly one allocation
julia> a = 0.0; @allocations a+=π
1 # Reproducible over a lot of trials
# Multithreaded code results in seemingly random numbers of allocations for the same operation on thread-local memory
julia> lk = ReentrantLock(); Threads.@threads for i in 1:Threads.nthreads()
a = 0.0; v = @allocations a += π
@lock lk begin display(v) end
end
1
1
0
1
1
0
0
19
0
0
1
1
0
1
0
1
Unfortunately, when scaled to a lot of operations on large matrices, this allocation overhead becomes so significant that the benefit of multithreading for my main application is in large part nullified. Am I doing something wrong or is this expected ? As suggested by other posts on this forum I have made sure my code is type-stable, optimised allocations as much as possible from the single-threaded code by pre-allocating, and even tried Polyester.jl and FLoops.jl but still get astronomical memory usage
(I should specify that I have seen a lot of posts on the forum about threads allocating more memory, but none of them gave a definitive explanation, and nor do the docs).
I am working in the global scope for the sake of providing the most minimal reproducible example possible. The problem still occurs with this function-based MRE:
function foo()
a = 0.0
return @allocations a += π
end
function threaded_foo()
v = zeros(Int, Threads.nthreads())
Threads.@threads :static for _ in 1:Threads.nthreads()
a = 0.0
v[Threads.threadid()] = @allocations a += π
end
return v
end
threaded_foo()
16-element Vector{Int64}:
0
0
5
0
5
5
0
0
5
5
0
0
0
0
0
0
I think those are benchmarking artifacts. @threads does allocate, but not inside the loop, thus the number of allocations is not dependent on the number of iterations:
julia> function threaded_foo(n)
v = zeros(Threads.nthreads())
Threads.@threads :static for _ in 1:n
a = 0.0
v[Threads.threadid()] = a += π
end
return v
end
threaded_foo (generic function with 2 methods)
julia> @btime threaded_foo(10);
3.410 μs (54 allocations: 5.44 KiB)
julia> @btime threaded_foo(100);
3.444 μs (54 allocations: 5.44 KiB)
Polyester.@batch does solve these allocations, but very likely the issue in your real example is another one, in terms of performance:
julia> function threaded_foo(n)
v = zeros(Threads.nthreads())
Polyester.@batch for _ in 1:n
a = 0.0
v[Threads.threadid()] = a += π
end
return v
end
threaded_foo (generic function with 2 methods)
julia> @btime threaded_foo(10);
483.844 ns (2 allocations: 144 bytes)
Thank you for the counter-examples. but the problem I’m noticing only has to do with inner allocations, which I find do not necessarily add up to the number of outer allocations (which is what you’re @btime-ing). Outer allocations are, as you have shown, constant and independent of the number of threads or iterations.
What I’m observing in my main application is that when if, say, 1 thread yields n allocations for each loop, 2 threads yield approx 2n allocations for each loop, when I expect the allocations for a particular loop to be constant and independent of the number of threads used. While debugging, I have tracked down the issue to be tied to the MRE i’ve shown.
Also I’m having trouble seeing how benchmarking artifacts could occur in such a simple benchmarking setting (i.e, when comparing @allocations on a single line of code)
Thank you for the links. I don’t think my full code has these issues either, and obviously here is not the place to be posting it all, but I’m really lost regarding further debugging steps I could take.
I’ve spent a day closing in on the issue with successive @time and @allocations macros deeper and deeper in my code, until I could pinpoint the issue arising even in very simple arithmetic operations within the threads themselves (as in my MRE).
It makes utterly no sense to me that parallelisation would make such simple arithmetic randomly take up to twice as long as for the unparallelised case, especially when I went to ridiculous lengths to ensure no memory was shared between the threads.
The fact that none of the usual multithreading packages helps is also weird and seems to rule out a fundamental Threads.@threads flaw. I have also used the Threads.@sync for ... Threads.@spawn pattern, in vain.
It seems @allocations have some problems. I modified your function slightly:
using Base.Threads
function foo()
a = 0.0
return @allocations a += π
end
function threaded_foo(n)
v = Atomic{Int}(0)
Threads.@threads :static for _ in 1:n
a = 0.0
atomic_add!(v, @allocations foo())
end
return v[]
end
Now, foo never allocates. But threaded_foo returns an integer between 20 and 160, somewhat depending on n being 10 or 1000000. Anyway, I think @allocations does this slightly wrong when running in parallel. It might be that it counts the total allocations taking place while the argument executes, also in other threads (allocations aren’t serialized). It might get a variable number of other thread-setup allocations in its computation.
Note, however, that any allocation, like in v += w for v and w vectors, will be done in every thread, and thus increase with the number of threads. If your serial loop is not allocation free, the number of allocations will be multiplied by the number of threads.
Are you implying that macros such as @time and @allocations report memory allocation for ALL active threads currently executing the line they’re called on ? Even when being called within threads ?
When I look inside the @allocations macro, there’s a call to Base.gc_num() before the argument is executed. Then a call to Base.gc_num() afterwards, and a difference. I don’t know exactly what gc_num does, but it might perhaps fetch a global state, yes.
I feel your, pain, this might be a hard issue to track. I would try to isolate the threaded part of the code into a function, use a @ballocated to inspect allocations of the whole function call, and then progressively simplify the code inside the function to narrow the possibilities.
Isn’t this code referring to GC usage ? As of julia 1.10, one can choose the number of threads used by the GC. I’m not literate enough in C to find out but it feels like this is calculating GC usage over all GC threads, not allocations across all regular threads.
If this is the case, it really feels like something that should be mentioned in the docs. Currently, @time docs say:
A macro to execute an expression, printing the time it took to execute, the number of allocations, and the total number of bytes its execution caused to be allocated, before returning the value of the expression
Which really feels like saying it reports the time spent on that precise expression (not on that expression across all threads)
Anyway. In general, running threaded does not change the allocation patterns of the code. So unwanted allocations can be hunted down in the serial code. Also, it’s often not the actual allocations which slow down multi threaded code, it’s the garbage collections (of the “stop the world” type) which often is devastating. You can see them by switching on logging (with GC.enable_logging(true)). However, some use patterns can be problematic. E.g. with some threads trying to allocate while others are cpu bound.
Thanks for the tips. I have already shaven as many allocations as possible off of the serial code. I know it’s not a GC issue either, because this serial/threaded discrepancy happens even on reduced problem sizes that do not involve any garbage collecting.
I guess it is a benchmarking issue one way or another, as the total allocation patterns are consistent between the serial/threaded case (again, only nested allocation patterns differ).
Also, one last thing to ponder about the Base benchmarking macros potentially computing a general state: provided this is true, why would my original MWE report a value of 19 (which is more than 16 = the actual number of threads) ? If aggregating 0/1 allocation across several threads doing one iteration each, one would expect to see values ranging from 0 to 16.
It’s important to note that BLAS is already multithreaded by default, so if you’re spending significant time in BLAS calls then additional parallelization is fighting BLAS for thread time. When performing parallel calculations where BLAS calls may be involved, it’s often worth using LinearAlgebra.BLAS.set_num_threads(1) so that you only parallelize at the top level.