Benchmark ParallelStencil & ImplicitGlobalGrid performance on cluster

I am trying to use `ParallelStencil` with `ImplicitGlobalGrid` in order to run in parallel using multithreading as well. I have a working code solving the wave equation and I would like to measure the performance of the code, if it scales etc.

I know there are tools like HPCToolkit, but I haven’t seen any mention of the julia lang there.

Which tool would you suggest to benchmark my code and check performance?

PS: I am running on a cluster with Slurm

1 Like

Hello @svretina As usual I ask can you give us a little background on the work you are doing with Julia.
Will the wallclock time not be sufficient for the scaling parts of your study?

I am currently implementing the wave equation in 3D on a cartesian grid.
My main project involves solving the wave equation but for every grid point I need to perform a 3D quadrature with say at least `11^3` points.

This scales as `5*Ng^3*Nq^3`, where Ng are the grid points per direction and Nq are the quadrature points per direction, and 5 is the number of variables in my state vector. I have implemented threading and I run my code on 80 threads (40 physical threads) and the code is very slow, so I want to further use MPI parallelization to speed up my code. To do so, I just started with the plain wave equation.

edit: my rhs function is non-allocating and I have exhausted what can be sped up currently (up to my knowledge).

edit2: I use finite differences (2nd order) for my derivatives, and I have some conditionals to choose the stencil accordingly, same for boundary conditions.

Hi @svretina . Solving the wave equation in 3D FDTD seems to be an ideal candidate for GPU acceleration (especially if you already written up your application using ParallelStencil and ImplicitGlobalGrid). Now, performance assessment may be different for CPU and GPU and I am mostly familiar with GPU profiling (there you could use Nsight suite from Nvidia…).

3D Waves FDTD make you certainly memory bounded. In this case, I would check you achieve good memcpy values with your implementation (not performing any stencil operation or neighbouring cell access), and then incrementally adding back physics. This would allow you to spot the major bottleneck(s). You could try to assess the effective memory throughput in a similar fashion as proposed here.

Note that using hyper threads (80 on 40 physical cores) may often result in less optimal performance. Also, note that ParallelStencil’s `threads` backend is not as optimised as the GPU backends, and we currently don’t implement communication/computation overlap features when combined to ImplicitGlobalGrid for CPU execution.

my main project involves a 3D quadrature for every grid cell. Would that be suitable for a GPU ? At the moment my code has performance issues and memory bounded is not the issue yet.

With `ParallelStencil.jl` even if I put 80 threads, I see the machine running only on the odd threads, so ~40.

What do you mean by that? Is hybrid parallelization not possible for CPU?

The cluster has GPUs but to be honest, I am a bit afraid to go into the GPU realm.

Hi @svretina - seems there are possibly several challenges to be resolved.

For the performance issue, it would be interesting to nail them down. I am not familiar with “3D quadrature”. Taking this out, your FDTD wave equations is most likely memory bound. So one of the first optimisation would be to avoid relying on too many temporary fields and make sure no allocations occur within your time or iteration loop.

The `threads` issue, most likely hyper threading is deactivated on the server.

Is hybrid parallelization not possible for CPU?

What do you mean by hybrid parallelisation? I am referring to the fact that on the GPU, there is the possibility to overlap MPI communication with physics computation in order to “hide” the time spent in updating halo for the distributed memory parallelisation.

As suggested by @johnh , it may be easier to provide further help if you could share a MWE or point to your implementation workflow.

Here: GitHub - svretina/ScalarWave3D.jl you can find my implementation of the 3D wave equation. Note that I have not put in the 3D quadrature step inside my RHS function. This repo is my attempt to use `ParallelStencil.jl` and `ImplicitGlobalGrid.jl`.

My RHS function is non allocating.

By 3D quadrature I mean that I have to calculate a 3D integral for every grid point. This is because I need to project a function onto some basis functions, and this needs to happen at each timestep for every grid point.

By hybrid parallelization I meant to use both MPI and threading for the CPU. I didn’t understand what you meant by overlap, because I am not so experienced with HPC terminology.

If the 3D integral at each timestep is not a dealbreaker for the GPU, then I would be happy to explore this further.

Combining MPI and multi-threading works, but is only useful if you are parallelizing across multiple machines. On a single machine you generally just want to use multi-threading.

hi @Oscar_Smith ,
yes the cluster I have access has multiple nodes to run.