Parallel Computing with Threads.@threads in HPC is slow?

Van_Sy_Tnut · April 6, 2024, 12:32am

Hello everyone!

I am wondering why parallel computing with “Threads.@threads” in my own PC is faster than that in HPC?
Although the maximum number of “Threads.nthreads()” I can set up in my PC is only 16 while in HPC I can increase this number to 128 or 256.
Is there anyone know the reason? Could you please let me know? and How we can take advantage of HPC to run Julia faster by using “Threads.@threads” in FOR LOOP?
Thank you so much for your help.
Best regards,

PetrKryslUCSD · April 6, 2024, 12:55am

My guess is you are doing some assembling in the finite element method.
I’m interested in that, so if you wish, drop me a line. We can try to sort it out.

Ahmed_Salih · April 6, 2024, 2:42am

Hello!

Welcome to the forum

Perhaps you could post a more detailed example following the guide here Please read: make it easier to help you ?

Looking at your code I am guessing that one reason could be that your problem is so small, that it is not gaining the full advantage of 256 threads due to communcation overhead, perhaps you are not re-using memory correctly, so each iteration is allocating a lot, slowing down the code and perhaps the way you index into FE.edofMat[e,:] is not efficient and you should use a @view.

mkitti · April 6, 2024, 3:36am

We’ll need more details, but I find it likely the answer has to do with Non-Uniform Memory Access:

HPC processors like Xeons are built to perform distinct tasks from consumer processors. The consumer processors often have lower memory latency where as the HPC processors have to coordinate memory caches among multiple processor units.

Van_Sy_Tnut · April 6, 2024, 1:13pm

Dear Professor Petr Krysl!
Thank you so much for your comment!
FEM’s assembly is done in other functions.
Right now, I am doing a postprocessing to calculate the maximum VonMises stress in entire models. To accelerate this process, I run Julia in HPC and use “Threads.@threads” in FOR LOOP to simultaneously obtain multiple-element stresses. However, it is still very time-consuming even if I use a few hundred cores in HPC.
Because my FEM model is complicated and it has many sub-functions, then I cannot publish all my codes here. I am sorry for this.
Best regards,

Van_Sy_Tnut · April 6, 2024, 1:17pm

Dear Ahmed Salih,
Thanks a lot for the recommendation.
My problem is huge.
Maybe the way I code is not optimal, because I just directly transfer my code from Matlab to Julia.
Thank you so much.
Best regards.

Van_Sy_Tnut · April 6, 2024, 1:23pm

Dear mkitti,
Thanks for letting me know a new thing about HPC.
So, if this is the issue, how can we overcome this to make parallel computing of Julia FASTER in HPC ?
Best regards,

Van_Sy_Tnut · April 6, 2024, 1:46pm

FYI, Here is my code to use Julia in HPC

#!/bin/bash
#SBATCH --job-name=“HPC”
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=126
#SBATCH --partition=general
#SBATCH --output=“HPC.out”

module load julia
export JULIA_NUM_THREADS=126
julia My_Code.jl

PetrKryslUCSD · April 6, 2024, 2:39pm

Problem is you haven’t told us very much. When you say the simulation is faster on your desktop, do you literally mean that the wall clock time is less? But then there are the multiple threads, so what does it mean faster or slower?

Let us begin with the PC: how fast are the processors? How does the computation scale with the number of threads (what is the parallel efficiency)?

Then we can look at the HPC platform: how fast are the processors? How many threads did you use? How did the computation scale?

mikkoku · April 6, 2024, 3:10pm

The script requests 126 processes from slurm, each with one thread. You should use --ncpus-per-task=126 to request 126 threads.

Van_Sy_Tnut · April 7, 2024, 9:29pm

Thank you so much Professor Petr Krysl!
Concretely, I am running an optimization to find optimal designs.
If I use my PC with 16 CPUs (Intel(R) Core™ i7-7820X CPU @ 3.60GHz (16 CPUs), ~3.6GHz), each iteration only takes around 2 minutes.
However in HPC even if I used 64, 128, or 256 CPUs, each iteration takes around 25-30 minutes.
I think my issue could come from the data transfer within CPUs in HPC.
Thank you so much for your consideration.
I am still fixing this issue. If I find the solution, I will update this post.
Best regards,

Van_Sy_Tnut · April 7, 2024, 9:32pm

Thanks alot, mikkoku!
However, it does not still work.
Thank you so much!
Best regards,

Van_Sy_Tnut · July 21, 2024, 9:24pm

Hello everyone,
Thank you so much for your consideration.
For this issue, I found a solution.
It is already available in the Julia documentation: Multi-processing and Distributed Computing · The Julia Language
For example, to run FOR LOOP in parallel, I will not use @thread, instead, now, I am using the Distributed package in Julia by using 50 processors in HPC, here is a Julia example:

# ====The file name is MyHPC.jl
using Distributed
@everywhere using SharedArrays
n = 1000
A = SharedArray{Float64}(n)
@sync @distributed for i = 1:n
    A[i] = 100
end

And here is my slurm job’s submission:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=55
#SBATCH --partition=general
module load julia
julia -p 50 MyHPC.jl

Best regards,

Topic		Replies	Views
Loosing performance with `Threads.@threads` for loop Performance parallel , multithreading , threads	10	701	October 7, 2021
Multithreaded computation significantly slower Performance question	18	3938	October 17, 2020
Using multiple cores but the speed doesn't increase General Usage	21	889	February 6, 2024
@threads vs @parallel, a simple fail case for @threads Performance	3	1397	October 31, 2017
Hyperthreading in HPC Julia at Scale hpc , multithreading	16	1708	July 8, 2023

Parallel Computing with Threads.@threads in HPC is slow?

Related topics