With ParallelStencil, is it possible to launch multiple kernels and sync later?

pchaubal · April 8, 2025, 1:04pm

I have written a code to run on GPUs with ParallelStencil but I don’t see a significant speedup by moving to GPUs. I am not sure whether I am using the framework as intended and there exists an easy fix for this. So, any help to get more performance would be greatly appreciated.

To illustrate my problem, I have a minimal example code below. The code launches 2 big kernels (UpdateA ! and UpdateB!) which are supposed to do most of the computation, and a few of small kernels which do only a fraction of actual computation. The code ends up spending lot more time on the smaller kernels.


const USE_GPU = true
# using BenchmarkTools
using ParallelStencil
using ParallelStencil.FiniteDifferences2D

@static if USE_GPU
    @init_parallel_stencil(CUDA, Float64, 2);
else
    @init_parallel_stencil(Threads, Float64, 2);
end

function main2D()
    # Numerics
    nx, ny   = 1024, 512;                               # Number of gridpoints in dimensions x and y
    nt       = 10000;                                          # Number of time steps
    c0       = 10.0

    # Array initializations
    A   = @zeros(nx, ny);
    B   = @zeros(nx, ny);
    # A2  = @zeros(nx, ny);
    C   = @rand(nx, ny);

    # Initial conditions
    A  .= 1.5;
 
    # Time loop
    dt   = 1/nt;
    for it = 1:nt
        if (it == 11)
            GC.enable(false)
            global t_tic=time() # Start measuring time.
        end

        @parallel UpdateA!(A, B, C)

        # the for loops are just to launch many small kernels
        # In actual code, these are different small kernels not called in loop
        for i in 1:10:100
            @parallel (i:i+10, 1:ny) ASubset!(A)
        end

        @parallel UpdateB!(A, B, C)

        # the for loops are just to launch many small kernels
        # In actual code, these are different small kernels not called in loop
        for i in 1:10:300
            @parallel (1:nx, i:i+10) BSubset!(B)
        end
    end
    time_s = time() - t_tic


end


@parallel_indices (ix, iy) function UpdateA!(A, B, C)
    A[ix, iy] = A[ix, iy] + C[ix, iy] * B[ix, iy]
    return
end

@parallel_indices (ix, iy) function UpdateB!(A, B, C)
    B[ix, iy] = B[ix, iy] + C[ix, iy] * A[ix, iy]
    return
end

@parallel_indices (ix, iy) function ASubset!(A)
    # Reduce the value by 10%
    A[ix, iy] = 0.9*A[ix, iy]
    return
end


@parallel_indices (ix, iy) function BSubset!(B)
    # Increase the value by 10%
    B[ix, iy] = 1.1*B[ix, iy]
    return
end

The smaller kernels are not run in a for loop in actual code. In the actual code the same kernel is called but with different set of arguments.

My guess is that it takes a lot more time to sync the kernel than running the it. But smaller kernels ( ASubset! and BSubset!) do not need to be synchronized. Is it possible to launch them without syncing?

This is a screenshot from the profiler which shows the problem.

Alternatively, I might be completely wrong about what kills the performance and would be happy to know what is wrong and how to get more performance.

samo · April 8, 2025, 1:30pm

If I got your question right then this is the answer:

julia> using ParallelStencil

help?> @parallel_async
  @parallel_async kernelcall 
  @parallel_async ∇=... kernelcall

  │ Advanced
  │
  │  @parallel_async ranges kernelcall
  │  @parallel_async nblocks nthreads kernelcall
  │  @parallel_async ranges nblocks nthreads kernelcall
  │  @parallel_async (...) configcall=... backendkwargs... kernelcall
  │  @parallel_async ∇=... ad_mode=... ad_annotations=... (...) backendkwargs... kernelcall

  Declare the kernelcall parallel as with @parallel (see @parallel for more
  information); deactivates however automatic synchronization at the end of the call.
  Use @synchronize for synchronizing.

  │ Performance note
  │
  │  @parallel_async falls currently back to running synchronously if the
  │  package Threads or Polyester was selected with
  │  [`@init_parallel_stencil`](@ref).

  See also: @synchronize, @parallel

julia>

Then you can synchronize them later with @synchronize (which can also synchronize a single stream).

pchaubal · April 8, 2025, 2:10pm

Thanks for the answer. It partly addresses my question but doesn’t unlock much performance.

I replaced the call to second small kernel (BSubset!) with the async call as suggested.

        for i in 1:10:300
            @parallel_async (1:nx, i:i+10) BSubset!(B)
        end
        @synchronize

As seen from the profiler output, it does speedup things a little since the kernels are launched with less delay between two subsequent kernels. However the kernels are still launched serially rather than all at once.

The smaller kernels all in total represent about ~30% of the computation of the big kernel. Yet, the the big kernel takes about 9 microseconds while the smaller kernels take about 80 microseconds. Is there a way to address this?

samo · April 8, 2025, 2:44pm

To run them all at once, you need to run them on different streams. You can pass the keyword argument stream = ParallelStencil.ParallelKernel.@get_stream(i) to @parallel_async where i is a stream index starting at 1. Then you can synchronize all the streams using @synchronize ParallelStencil.ParallelKernel.@get_stream(i).

If these small kernels can also overlap with the large kernels, and you have also communication to hide then this can all automatically be done with the @hide_communication macro (see ?@hide_communication). I guess one could add a macro to automatically overlap kernels in cases like yours (besides the one to hide communication and overlap boundary condition computations with inner point computations). However, it could typically be a better approach to create heavier kernels, computing also for example multiple batches within one kernel.

pchaubal · April 8, 2025, 4:32pm

This seems to be the solution I am looking for. I tried launching the small kernels on different streams as suggested like @parallel_async (1:nx, i:i+10) stream=stream BSubset!(B). They do show up on different streams in the profiler but they start one after the other making them effectively serial. So only a single stream is being run at any given time.

Why does this happen and any suggestions on how to avoid it?

samo · April 8, 2025, 4:50pm

Check out the following: CUDA streams do not overlap

… and note that you can also use ParallelStencil.ParallelKernel.@get_priority_stream(i).

However, you might rather want to create one or a few larger kernel instead of all these small kernels…

pchaubal · April 8, 2025, 5:31pm

Yes, it seems that writing one bigger aggregated kernel seems a better choice. I wanted to avoid that as the code will become complex and unreadable.

Thanks for referring to the previous issue. I will try to run the code with priority stream to see if I can make my kernels overlap.

luraess · April 9, 2025, 12:04pm

If using CUDA.jl > 5.4, it could be that some implicit synchronisation is occurring when accessing the same underlying GPU memory from different streams as discussed in Ability to opt out of / improved automatic synchronization between tasks for shared array usage · Issue #2617 · JuliaGPU/CUDA.jl · GitHub. If this is the case, then one solution may be Support disabling implicit synchronization by vchuravy · Pull Request #2662 · JuliaGPU/CUDA.jl · GitHub.

Topic		Replies	Views
Parallel launch of CUDA kernels GPU cuda , kernelabstractions	5	296	November 13, 2024
Asynchronous kernel scheduling with KernelAbstractions GPU	6	314	July 3, 2025
How to use multiple GPUs correctly? GPU question	2	2744	October 16, 2019
CUDAnative , performance drop after several timesteps Performance question , gpu , cudanative , cuda	5	1164	September 8, 2019
GPU Synchronization Issue - using KernelAbstraction GPU question	5	429	December 13, 2023

With ParallelStencil, is it possible to launch multiple kernels and sync later?

Related topics