I am trying to understand why do I get such poor performances from multithreading.
The context is a finite element program, I need to loop over all the elements of the model and evaluate the residual force vector and the tangent stiffness matrix of each element. Which seems a typical embarrassingly parallel problem, since elements are independent from each other.
Following the serial implementation of such a loop
getϕ_serial(elems, u, d) = map(elem->getϕ(elem, u[:,elem.nodes], d[elem.nodes]), elems)
elems is the array with all the elements and
d are the arrays with the degrees of freedom of the model,
getϕ is the function that evaluates the residual and the stiffness matrix of the element. This is the cost of a single function call (the cost of all elements are quite similar to each other)
@btime getϕ(elems, u[:,elems.nodes], d[elems.nodes]); 665.930 μs (3955 allocations: 1.51 MiB)
10.059 μs (283 allocations: 12.89 KiB) (I put this but was wrong the above is the correct one)
I tried to implement a parallel loop using 32 cores on a single cpu on a cluster, starting julia as
julia -t 32, following a few different implementation of the parallel loop each with the corresponding speed up with
function getϕ_threads(elems, u, d) nElems = length(elems) Φ = Vector(undef, nElems) Threads.@threads for ii=1:nElems Φ[ii] = getϕ(elems[ii], u[:,elems[ii].nodes], d[elems[ii].nodes]) end return Φ end julia> @btime Φ = getϕ_threads(elems, u, d); 5.694 s (61793808 allocations: 23.11 GiB)
speed up is 12.631/5.694 = 2.21
function getϕ_sync(elems, u, d) nElems = length(elems) nt = Threads.nthreads() Φ = Vector(undef, nElems) Threads.@sync for id=1:nt Threads.@spawn for ii=id:nt:nElems Φ[ii] = getϕ(elems[ii], u[:,elems[ii].nodes], d[elems[ii].nodes]) end end return Φ end julia> @btime Φ = getϕ_sync(elems, u, d); 5.562 s (61793849 allocations: 23.11 GiB)
speed up is 12.631/5.562 = 2.27
getϕ_ThreadsX(elems, u, d) = ThreadsX.map(elem->getϕ(elem, u[:,elem.nodes], d[elem.nodes]), elems) julia> @btime Φ = getϕ_ThreadsX(elems, u, d); 5.420 s (61875708 allocations: 23.43 GiB)
speed up is 12.631/5.420 = 2.33
Thus, even if I was using 32 cores, I couldn’t get a speed larger than 2.33x, on a problem that apparently should have given good accelerations, considering that the individual calculations are not trivial, and even chopping the entire bunch into nthreads batches, in
getϕ_sync, o avoid conflicts in accessing the variables didn’t give good results.
Also if monitored with
htop, all the cpu on the node went 100%, so they were all doing something
can anybody help in trying to understand why is this happening and how to improve performances?
thanks in advance,