Choosing between KernelAbstractions, AcceleratedKernels, ParallelStencils, or just CUDA.jl

Hello,

Until today, I have used mainly CUDA.jl to run simulations. I mostly do hydrodynamics and solve various PDEs.

I would like to make my code available for other users in my lab, to be able to run simulations also on CPU or Metal GPU, and to compare performance between GPUs/CPUs and with ParallelStencils.jl, which claims to be faster than my naive CUDA.jl implementation.

The problem is that I am a bit lost in the choice of packages. AcceleratedKernels.jl claims to be faster than a naive use of KernelAbstractions.jl, but I don’t know if it is still maintained today.

ParallelStencils.jl looks nice for my finite difference simulations but is not made for the use of spectral methods. Also, with KernelAbstractions.jl, I can precompile the kernels like I do in CUDA.jl:

# CUDA.jl
f! = @cuda launch=false f_kernel(A, B)
f!(A, B; threads=threads, blocks=blocks)
# KernelAbstractions.jl
f! = f_kernel!(device)
f!(A, B, block_size)

I don’t know if it is possible to do that using ParallelStencils.jl?

What would be the best choice for me? I would prefer to use the same package for all my simulations (based on pseudo-spectral methods or finite differences).

I am mainly solving Navier-Stokes equations using the Lattice Boltzmann Method, Poisson equations using FFTW or iterative schemes, and advection-diffusion equations. Nothing very complicated, so I suppose a package like WaterLily.jl, Trixi.jl, or FiniteVolumeMethod.jl would also make sense? But here again, the number of choices is huge.

Thanks in advance for your help!

Best,

2 Likes

AcceleratedKernels.jl is a package for several common algorithms that uses KernelAbstractions.jl to target the GPU backends, and it is still maintained (it is used by AMDGPU.jl as an example).

Similarly, ParallelStencil.jl and Chmy.jl both use KernelAbstractions underneath, but provide a lot more domain specific tools on top of it.

WaterLily.jl and Trixi.jl sit another abstraction step higher.

Trixi has a Lattice-Boltzman example here Trixi.jl/examples/tree_3d_dgsem/elixir_lbm_taylor_green_vortex.jl at 5178770a1fa5e9e7021b4a77e78d788b565b33cd · trixi-framework/Trixi.jl · GitHub

2 Likes

Thank you for your reply,
Unfortunatly AcceleratedKernels.jl seems to not be suitable for 2D/3D finite differences as foreachindex() gives only linear indices, and ParallelStencil.jl does not really help with pseudo-spectral methods.
I’ll use KernelAbstractions.jl.
Thank you !

Hi @Ludovic_Dumoulin
You might not want to use a single package for writing your entire code? The Julia language and its ecosystem are very modular, and you can use different packages for different parts of your code. For example, ParallelStencil.jl is naturally combined with standard Julia high-level programming, e.g. for using FFT libraries, but can of course also be used together with other packages as those you mentioned. Besides, JustRelax.jl, which is built on top of ParallelStencil.jl and ImplicitGlobalGrid.jl, provides pseudo-transient accelerated iterative solvers, might also be of interest for you.

Now, regarding specific points brought up concerning ParallelStencil.jl:

  • ParallelStencil.jl emits directly hardware-backend-specific CUDA.jl / AMDGPU.jl / Metal.jl code, except if KernelAbstractions is used as a backend, in which case it emits KernelAbstractions.jl code. The interest of the KernelAbstractions.jl backend is primarily for being able to switch the target hardware within an interactive REPL session - for specific prototyping where switching frequently between different hardware is useful. However, the other backends are preferable for production code and package development (embracing the Julia extension feature a package can easily support multiple backends, see e.g. here for a minimal example: ParallelStencil.jl/test/test_projects/Diffusion at main · omlins/ParallelStencil.jl · GitHub; and check out JustRelax.jl for a real-world example), as they offer features that are not available in the KernelAbstractions.jl backend, such as support for hiding communication (implemented with streams which is not exposed by KernelAbstractions.jl), or the convenience Data/TData modules containing useful types (e.g, Data.Array, Data.Number) which are defined at parse time depending on the backend chosen.

  • ParallelStencil.jl passes further to the backend all keywords arguments that it does not deal with itself. So, also, the launch=false keyword argument is passed to the backend (e.g. to CUDA.jl), and you can use it to compile the kernel without launching it. This can be useful for profiling/optimization purposes, including the usage of backend-specific tools as, e.g., the rich set of tools provided by CUDA.jl. Note, however, that a ParallelStencil kernel always needs to be launched with the @parallel macro. That said, the compilation is cashed and therefore there is in general no need to compile explicitly your kernels before launching them, as the first launch will trigger the compilation and all subsequent launches will reuse the cached version of the kernel.

Cheers,
Sam

3 Likes

Thank you for your detailed answer.

I agree, but here is the issue: In my laboratory, I am the one trying to introduce Julia and high-performance computing. I am converting my colleagues, but it takes time. Currently, we only have NVIDIA GPUs, so my code is written using CUDA.jl.
It was very difficult for me to convince my colleagues to stop ‘coding badly’ in C and try Julia and CUDA.jl instead. I will be leaving the lab soon and would like to free them from the NVIDIA constraint (we bought H100s a few months ago, but we still have no idea when they will arrive). Considering how much inertia my colleagues have when it comes to changing their coding habits, I absolutely need one package that can do everything! XD"

I benchmarked ParallelStencil.jl and KernelAbstractions.jl on my Mac (CPU and GPU) and on a A30. ParallelStencil.jl, on GPU, is twice slower than KernelAbstractions.jl.
Results:
on M4 MAX 40 GPU - 64GB (FP32): GPU / 12CPU
ParallelStencil.jl → 61.8ms / 130ms
KernelAbstractions.jl → 29.2ms / 188ms
AcceleratedKernels.jl + KernelAbstractions.jl → 32ms / 189ms
on Nvidia A30 (FP64): 100itr / 1000itr
ParallelStencil.jl → 30ms / 298ms
KernelAbstractions.jl → 15ms / 153ms
CUDA.jl → 16ms / 157ms
AcceleratedKernels.jl + CUDA.jl → 48.5ms / 479ms

On my MacBook’s GPU:
Using ParallelStencil.jl leads to an effective GPU utilization of ~60% but causes a memory usage explosion. A large portion of the simulation time is then spent clearing the allocated memory.
In contrast, using KernelAbstractions.jl results in a steady GPU utilization of ~92%. While a significant part of the simulation still involves clearing memory, the memory fills up much more slowly.

Are you sure? The behavior of ParallelStencil.jl during my simulations reminds me of when I was using CUDA.jl before I knew about launch=false and kernel compilation.

Thank you for your help !
Best

1 Like

This is not normal; something is off here. For high performance computations you should pre-allocate all memory. Also use the effective memory throughput in order to be able to give meaning to observed runtimes: GitHub - omlins/ParallelStencil.jl: Package for writing high-level code for parallel high-performance stencil computations that can be deployed on both GPUs and CPUs · GitHub

1 Like

Yes, indeed. I was relying on max_error = maximum(abs.(v1 .- v2)). This line is causing the memory usage to explode. While ParallelStencil.jl performs the same memory allocations as KernelAbstractions.jl and CUDA.jl, it seems to trigger the Garbage Collector more frequently, which leads to longer simulation times.

I changed this line by max_error = mapreduce((x, y) -> abs(x - y), max, v1, v2) to not create temporary arrays. ParrallelStencil.jl is still slower than KernelAbstractions.jl and CUDA.jl on A30 GPU, (36s vs 22s vs 25s). Same thing on M4 Max GPU (65s vs 59s).

Good to hear that you could make some progress. The difference in runtime that remains won’t be due to the time it takes to execute the kernel, but still due some things happening around it.

You can use GC.enable(false) to disable the GC before the loop to see if that makes still a difference.

It seems that the GC isn’t the problem here, I tried with GC.enable(false), no real impact on total time.

ParallelStencil.jl (with GC enable)

34.349171 seconds (7.98 M CPU allocations: 352.583 MiB, 0.30% gc time) (2.00 k GPU allocations: 445.312 KiB, 0.02% memmgmt time)

KernelAbstractions.jl (with synchronize(backend) that I forgot before)

23.058713 seconds (28.92 M CPU allocations: 787.055 MiB, 0.57% gc time) (2.00 k GPU allocations: 445.312 KiB, 0.04% memmgmt time)

CUDA.jl (with CUDA.@sync as before)

20.797576 seconds (23.82 M CPU allocations: 533.271 MiB, 0.42% gc time) (2.00 k GPU allocations: 445.312 KiB, 0.04% memmgmt time)

I really don’t get what is happening with ParallelStencil.jl, less CPU allocations but slower…

The issue with ParallelStencil.jl was the default number of blocks. I only know a little about GPU computing, so I didn’t realize it could matter quite so much.
By setting threads=(16,16) and blocks=(Nx,Ny) .\div threads I get

22.434988 seconds (13.81 M CPU allocations: 587.483 MiB, 2.22% gc time) (2.00 k GPU allocations: 445.312 KiB, 0.03% memmgmt time)

And It is even faster with threads=(32,8).

Thank you for your help,
Best regards

2 Likes

This thread-block configuration should have also been computed by default. I cannot reproduce your case where it doesn’t do so. Would you mind to open an issue on ParallelStencil.jl with an MWE in order to allow me figure out why it didn’t do so in your case? or else share an MWE or your code with me in a direct message? Thanks!

Sorry to hijack this thread. I’ve just been wondering how JACC.jl fits in this and how it is different from Kernelabstractions.jl

2 Likes

Hello, Thank you, here is the link to the issue. It is the first time I open an issue on github.

1 Like

With a naive implementation using JACC.jl, I get:

 83.703915 seconds (76.31 M CPU allocations: 2.016 GiB, 0.30% gc time) (2.00 k GPU allocations: 445.312 KiB, 0.02% memmgmt time)

…whereas it only takes ~20s with CUDA.jl.

1 Like

Thank you all for the help! I’ve managed to close the performance gap.

It turns out there were two distinct issues:

  1. Thread-Block Configuration: In this specific case, I am using two parallel indices, but one of my arrays has three dimensions. Because of this mismatch, the package could not automatically guess the correct thread-block configuration. Once I manually defined the ranges (or the thread-block configuration directly), the performance improved significantly.
  2. Warm-up: The remaining difference was due to the lack of a ‘warm-up’ period. In my CUDA code, I was using @cuda launch=false. By running a short warm-up of about 10 time steps before starting the timer, I no longer see any significant difference between CUDA.jl and ParallelStencil.jl.

Both versions now take approximately 17s to complete the simulation, compared to the 20–22s I was seeing originally. It’s great to see that ParallelStencil.jl can match CUDA.jl performance once the configuration is tuned and the kernels are cached!

Since ParallelStencil.jl is as fast as CUDA.jl and KernelAbstractions.jl but much easier to use, I’m going to go with it!

2 Likes

One thing to double-check is the kernel times given by vendor profilers, e.g., ncu - not the Julia timers. Do you have a link to your benchmark code? Happy to take a look and identify gaps. The JACC.jl time sounds like it is downloading the backend. The JACC.jl SC-W’24 paper shows near-zero overheads over backends on common kernels, especially if your kernel is memory bound and/or tuned.

JACC.jl is more high-level (array, parallel_for/parallel_reduce) and doesn’t require users to code fine-grained GPU kernels, annotate functions, or set backends in code keeping it 100% portable. It also adds multiGPU support. JACC.jl also provide knobs for blocks and threads as optional APIs. It’s basically complementary to existing solutions.