Parallel Computing with Threads.@threads in HPC is slow?

Hello everyone!

  • I am wondering why parallel computing with “Threads.@threads” in my own PC is faster than that in HPC?
  • Although the maximum number of “Threads.nthreads()” I can set up in my PC is only 16 while in HPC I can increase this number to 128 or 256.
  • Is there anyone know the reason? Could you please let me know? and How we can take advantage of HPC to run Julia faster by using “Threads.@threads” in FOR LOOP?
    Thank you so much for your help.
    Best regards,
    image

My guess is you are doing some assembling in the finite element method.
I’m interested in that, so if you wish, drop me a line. We can try to sort it out.

1 Like

Hello!

Welcome to the forum :blush:

Perhaps you could post a more detailed example following the guide here Please read: make it easier to help you ?

Looking at your code I am guessing that one reason could be that your problem is so small, that it is not gaining the full advantage of 256 threads due to communcation overhead, perhaps you are not re-using memory correctly, so each iteration is allocating a lot, slowing down the code and perhaps the way you index into FE.edofMat[e,:] is not efficient and you should use a @view.

1 Like

We’ll need more details, but I find it likely the answer has to do with Non-Uniform Memory Access:

HPC processors like Xeons are built to perform distinct tasks from consumer processors. The consumer processors often have lower memory latency where as the HPC processors have to coordinate memory caches among multiple processor units.

4 Likes

Dear Professor Petr Krysl!
Thank you so much for your comment!
FEM’s assembly is done in other functions.
Right now, I am doing a postprocessing to calculate the maximum VonMises stress in entire models. To accelerate this process, I run Julia in HPC and use “Threads.@threads” in FOR LOOP to simultaneously obtain multiple-element stresses. However, it is still very time-consuming even if I use a few hundred cores in HPC.
Because my FEM model is complicated and it has many sub-functions, then I cannot publish all my codes here. I am sorry for this.
Best regards,

Dear Ahmed Salih,
Thanks a lot for the recommendation.
My problem is huge.
Maybe the way I code is not optimal, because I just directly transfer my code from Matlab to Julia.
Thank you so much.
Best regards.

Dear mkitti,
Thanks for letting me know a new thing about HPC.
So, if this is the issue, how can we overcome this to make parallel computing of Julia FASTER in HPC ?
Best regards,

FYI, Here is my code to use Julia in HPC

#!/bin/bash
#SBATCH --job-name=“HPC”
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=126
#SBATCH --partition=general
#SBATCH --output=“HPC.out”

module load julia
export JULIA_NUM_THREADS=126
julia My_Code.jl

Problem is you haven’t told us very much. When you say the simulation is faster on your desktop, do you literally mean that the wall clock time is less? But then there are the multiple threads, so what does it mean faster or slower?

Let us begin with the PC: how fast are the processors? How does the computation scale with the number of threads (what is the parallel efficiency)?

Then we can look at the HPC platform: how fast are the processors? How many threads did you use? How did the computation scale?

1 Like

The script requests 126 processes from slurm, each with one thread. You should use --ncpus-per-task=126 to request 126 threads.

4 Likes

Thank you so much Professor Petr Krysl!
Concretely, I am running an optimization to find optimal designs.
If I use my PC with 16 CPUs (Intel(R) Core™ i7-7820X CPU @ 3.60GHz (16 CPUs), ~3.6GHz), each iteration only takes around 2 minutes.
However in HPC even if I used 64, 128, or 256 CPUs, each iteration takes around 25-30 minutes.
I think my issue could come from the data transfer within CPUs in HPC.
Thank you so much for your consideration.
I am still fixing this issue. If I find the solution, I will update this post.
Best regards,

Thanks alot, mikkoku!
However, it does not still work.
Thank you so much!
Best regards,

Hello everyone,
Thank you so much for your consideration.
For this issue, I found a solution.
It is already available in the Julia documentation: Multi-processing and Distributed Computing · The Julia Language
For example, to run FOR LOOP in parallel, I will not use @thread, instead, now, I am using the Distributed package in Julia by using 50 processors in HPC, here is a Julia example:

# ====The file name is MyHPC.jl
using Distributed
@everywhere using SharedArrays
n = 1000
A = SharedArray{Float64}(n)
@sync @distributed for i = 1:n
    A[i] = 100
end

And here is my slurm job’s submission:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=55
#SBATCH --partition=general
module load julia
julia -p 50 MyHPC.jl

Best regards,

1 Like