A question about parallel performance in multithreading

Hi all,

I am trying to understand why do I get such poor performances from multithreading.

The context is a finite element program, I need to loop over all the elements of the model and evaluate the residual force vector and the tangent stiffness matrix of each element. Which seems a typical embarrassingly parallel problem, since elements are independent from each other.

Following the serial implementation of such a loop

getϕ_serial(elems, u, d) = map(elem->getϕ(elem, u[:,elem.nodes], d[elem.nodes]), elems)

where elems is the array with all the elements and u and d are the arrays with the degrees of freedom of the model, getϕ is the function that evaluates the residual and the stiffness matrix of the element. This is the cost of a single function call (the cost of all elements are quite similar to each other)

@btime getϕ(elems[1], u[:,elems[1].nodes], d[elems[1].nodes]);
   665.930 μs (3955 allocations: 1.51 MiB)

10.059 μs (283 allocations: 12.89 KiB) (I put this but was wrong the above is the correct one)

I tried to implement a parallel loop using 32 cores on a single cpu on a cluster, starting julia as julia -t 32, following a few different implementation of the parallel loop each with the corresponding speed up with 15640 elements

function getϕ_threads(elems, u, d)
  nElems = length(elems)
  Φ      = Vector(undef, nElems)
  Threads.@threads for ii=1:nElems 
      Φ[ii] = getϕ(elems[ii], u[:,elems[ii].nodes], d[elems[ii].nodes])
  end
  return Φ
end

julia> @btime Φ = getϕ_threads(elems, u, d);
  5.694 s (61793808 allocations: 23.11 GiB)

speed up is 12.631/5.694 = 2.21

function getϕ_sync(elems, u, d)
  nElems  = length(elems)
  nt      = Threads.nthreads()
  Φ       = Vector(undef, nElems)
  Threads.@sync for id=1:nt
    Threads.@spawn for ii=id:nt:nElems
      Φ[ii] = getϕ(elems[ii], u[:,elems[ii].nodes], d[elems[ii].nodes])
    end
  end
  return Φ
end

julia> @btime Φ = getϕ_sync(elems, u, d);
  5.562 s (61793849 allocations: 23.11 GiB)

speed up is 12.631/5.562 = 2.27

getϕ_ThreadsX(elems, u, d) = ThreadsX.map(elem->getϕ(elem, u[:,elem.nodes], d[elem.nodes]), elems)
julia> @btime Φ = getϕ_ThreadsX(elems, u, d);
  5.420 s (61875708 allocations: 23.43 GiB)

speed up is 12.631/5.420 = 2.33

Thus, even if I was using 32 cores, I couldn’t get a speed larger than 2.33x, on a problem that apparently should have given good accelerations, considering that the individual calculations are not trivial, and even chopping the entire bunch into nthreads batches, in getϕ_sync, o avoid conflicts in accessing the variables didn’t give good results.

Also if monitored with htop, all the cpu on the node went 100%, so they were all doing something

can anybody help in trying to understand why is this happening and how to improve performances?

thanks in advance,
Andrea

Its hard to say without knowing getϕ but here are two two things that come to mind:

  • try using views instead of slices in the call u[:,elems[ii].nodes]@view u[:,elems[ii].nodes], which might prevent unnecessary copying of data
  • 283 allocation inside a kernel functions seems quite high. Maybe you’re hitting some type instabilities there?

If your code is memory bound the additional cores won’t help much.

1 Like

there is quite a lot going on in getϕ since the residual force vector and the tangent stiffness matrix are assembled within doing forward mode automatic differentiation, but types should be quite stable

I tried to make the copies out of the part that is sent to the threads, and even not collecting the results to avoid competition to access data as in the following

function getϕu_threads_view(elems, u, d)
  nElems = length(elems)

  Φ = Vector(undef, nElems)
  @sync for ii=1:nElems
    uu    = deepcopy(u[:,elems[ii].nodes])
    dd    = deepcopy(d[elems[ii].nodes])
    elem  = deepcopy(elems[ii])

    Threads.@spawn getϕ(elem, uu, dd)
  end

  return Φ
end

but didn’t get significant improvements:

@btime Φ = getϕu_threads_view(elems, u, d);
  5.499 s (62904119 allocations: 23.30 GiB)

should I really try to have multiple processes instead of multiple threads if the individual calculations are complex and memory demanding?

why are you deepcopying here? Can’t you just do

function getϕ_sync(elems, u, d)
  nElems  = length(elems)
  nt      = Threads.nthreads()
  Φ       = Vector(undef, nElems)
  Threads.@sync for id=1:nt
    @views Threads.@spawn for ii=id:nt:nElems
      Φ[ii] = getϕ(elems[ii], u[:,elems[ii].nodes], d[elems[ii].nodes])
    end
  end
  return Φ
end
1 Like

the reason is that getϕ expects arrays of Floatand it was throwing me an error when given views over arrays, instead of changing the definition of getϕ I just copied the chunks before sending the calculation to the thread, this way there would have been no competition to access the data, but the gain in time was not relevant.

I’m am quite new to parallel computing, but it seems that it really works when you parallelize simple operations, like sums of dot products between small vectors that do not need to allocate much memory, but this is not the case here.

The problem with this approach is that you are copying data single threaded, and those copies are taking roughly 3 seconds (half your total runtime). You should just remove the type assert in getϕ. It isn’t making your code better.

1 Like

thanks for your answer, the reason getϕ expects an Array is because I make (heavy) use of multiple dispatch, I modified the declaration to accept Union{Array{T}, SubArray{T}}) where T and it works, but I didn’t get any sensible acceleration, and execution time was essentially the same either with or without @views

can I ask how did you estimate 3 secs for copying the function arguments?

I am more and more convinced that multithreading really yields acceleration if the individual task do not allocate much memory, or at all. But getϕ does a lot of allocation since it uses forward mode automatic differentiation, that means for every operation it has to make room for the gradient and the hessian of the intermediate results and free the memory of the arguments since these are stored in tuples and cannot be overwritten.

I wonder if multiprocessing has the same issue

The general solution here is to use ::AbstractArray{T} where T instead of the Union. I got the 3 second estimate by timing

function f(n)
    for i=1:n
        x=rand(1024)
    end
end

and timing @time f(1024*3560) (which allocates roughly the same amount of memory)

1 Like

thanks, ::AbstractArray{T} is indeed preferable and more elegant, thanks also for the feed back in time estimate

Looking at the number of allocations I guess a large part of your time is spent in garbage collection. That is bad for multithreading in Julia. (Also if that is the case you should use @time instead of @btime since @btime kind of gives the timing without garbage collection.) Using multiple processes helps because then the garbage collection is trivially parallelised. If, however, the problem is actually memory bound then multiple processes won’t help.

1 Like

thanks for the reply, I think you must be correct in pointing to the memory allocation , while trying to understand what is happening inside inside getϕ I got into something that I can’t explain and it seems the problem is in memory allocation I posted a separate question here