To run them all at once, you need to run them on different streams. You can pass the keyword argument stream = ParallelStencil.ParallelKernel.@get_stream(i) to @parallel_async where i is a stream index starting at 1. Then you can synchronize all the streams using @synchronize ParallelStencil.ParallelKernel.@get_stream(i).
If these small kernels can also overlap with the large kernels, and you have also communication to hide then this can all automatically be done with the @hide_communication macro (see ?@hide_communication). I guess one could add a macro to automatically overlap kernels in cases like yours (besides the one to hide communication and overlap boundary condition computations with inner point computations). However, it could typically be a better approach to create heavier kernels, computing also for example multiple batches within one kernel.