Most efficient way of _waiting_ for GPU results?



I am using the following pattern today to synchronize results of the GPU processing back to CPU (CUDAnative, CuArrays, CUDAdrv)

# data_out_gpu and data_in_gpu are CuArrays
#do_some_gpu_processing() will launch kernels etc
do_some_gpu_processing(data_out_gpu, data_in_gpu)
#calling Array will force wait for gpu to finish processing
data_out_cpu = Array(data_out_gpu)

However, I noticed that with such a pattern cpu gets loaded 100% waiting for gpu. Thus I wonder if there is a more gentle on the CPU pattern? Some type of an event or similar which can fired on “ok GPU is done you can come and pick your data” or some other recommended way to wait?

For reference: the GPU processing I have takes somewhere between 10ms - 1 sec. Averaging on 250ms. And all the processing I do on the CPU is an order of magnitude of 1ms. data_out_gpu is really a tiny array with results (3-24 Float32 numbers). So I would expected CPU idling most of the time…

This is normally not a problem with 1 julia instance. But I run several julia instances per gpu (to be able to run several kernels in parallel) and have several GPUs in the system. Thus before you know CPU is 100% busy and starts having troubles feeding GPUs with new kernels.

The only thing I found so far is GPU Event: Was planning to measure performance on of it vs Array(gpu_result) synchronization. Is that more canonical way?

Hints, pointers on more efficient way of waiting for the results are highly appreciated


CUDA events would be the best approach, but we haven’t wrapped the necessary functionality: either to create an event with the blocking sync flag set, or manually querying the state of the event, resp. and Alternatively, creating the context with the CU_CTX_SCHED_BLOCKING_SYNC flag set should accomplish the same.

Bottom line, a couple of low-level ways to accomplish this, nothing user friendly yet :slight_smile:


It looks like

Looks like CUDADrv already has it

@enum(CUctx_flags, SCHED_AUTO           = 0x00,
                   SCHED_SPIN           = 0x01,
                   SCHED_YIELD          = 0x02,
                   SCHED_BLOCKING_SYNC  = 0x04,
                   MAP_HOST             = 0x08,
                   LMEM_RESIZE_TO_MAX   = 0x10)

so this should be a matter of calling ?
ctx = CuContext(dev, CUctx_flags(4))


Yes, but CUDAnative manages your context and there’s no API for setting flags there:
You can try changing the constructor below:
Also, you can use CUDAdrv.SCHED_BLOCKING_SYNC.

Use other contexts, ie. constructing and activating a new one disregarding what CUDAnative has constructed before, might break some functionality. AFAIK this is similar to how CUDA treats contexts.


@maleadt, Thanks a bunch!

I guess that means there are no obvious APIs/patterns I am missing. No low hanging fruits.

And there are several modifications for the lib code one can make if one wants to push forward here. Makes sense.


Yeah, nothing specific to CUDAnative here. The low-level enhancements aren’t difficult to implement though, feel free to give it a try or file an issue on CUDAdrv to have them implemented. But the underlying “issue”, where blocking on a GPU task results in a CPU-intensive busy loop, is also present with CUDA. There’s probably a reason why the blocking sync isn’t the default, so I’m not sure we should change it for all of CUDAnative/CuArrays.

EDIT: although we could always expose a blocking CuEvent through eg. an argument to CuArrays.@sync, or just make it the default there (where CUDAdrv.synchronize() would then still be a CUDA-style busy looping sync). Feel free to make suggestions if you have any.