I am using the following pattern today to synchronize results of the GPU processing back to CPU (CUDAnative, CuArrays, CUDAdrv)
# data_out_gpu and data_in_gpu are CuArrays #do_some_gpu_processing() will launch kernels etc do_some_gpu_processing(data_out_gpu, data_in_gpu) #calling Array will force wait for gpu to finish processing data_out_cpu = Array(data_out_gpu)
However, I noticed that with such a pattern cpu gets loaded 100% waiting for gpu. Thus I wonder if there is a more gentle on the CPU pattern? Some type of an event or similar which can fired on “ok GPU is done you can come and pick your data” or some other recommended way to wait?
For reference: the GPU processing I have takes somewhere between 10ms - 1 sec. Averaging on 250ms. And all the processing I do on the CPU is an order of magnitude of 1ms. data_out_gpu is really a tiny array with results (3-24 Float32 numbers). So I would expected CPU idling most of the time…
This is normally not a problem with 1 julia instance. But I run several julia instances per gpu (to be able to run several kernels in parallel) and have several GPUs in the system. Thus before you know CPU is 100% busy and starts having troubles feeding GPUs with new kernels.
The only thing I found so far is GPU Event: https://github.com/JuliaGPU/CUDAdrv.jl/blob/2a77dff0eaad0df12abd8cf05e73b3f9d5968ad5/src/events.jl#L73-L87 Was planning to measure performance on of it vs Array(gpu_result) synchronization. Is that more canonical way?
Hints, pointers on more efficient way of waiting for the results are highly appreciated