With ParallelStencil, is it possible to launch multiple kernels and sync later?

To run them all at once, you need to run them on different streams. You can pass the keyword argument stream = ParallelStencil.ParallelKernel.@get_stream(i) to @parallel_async where i is a stream index starting at 1. Then you can synchronize all the streams using @synchronize ParallelStencil.ParallelKernel.@get_stream(i).

If these small kernels can also overlap with the large kernels, and you have also communication to hide then this can all automatically be done with the @hide_communication macro (see ?@hide_communication). I guess one could add a macro to automatically overlap kernels in cases like yours (besides the one to hide communication and overlap boundary condition computations with inner point computations). However, it could typically be a better approach to create heavier kernels, computing also for example multiple batches within one kernel.

2 Likes