I am using the following pattern today to synchronize results of the GPU processing back to CPU (CUDAnative, CuArrays, CUDAdrv)
# data_out_gpu and data_in_gpu are CuArrays
#do_some_gpu_processing() will launch kernels etc
do_some_gpu_processing(data_out_gpu, data_in_gpu)
#calling Array will force wait for gpu to finish processing
data_out_cpu = Array(data_out_gpu)
However, I noticed that with such a pattern cpu gets loaded 100% waiting for gpu. Thus I wonder if there is a more gentle on the CPU pattern? Some type of an event or similar which can fired on “ok GPU is done you can come and pick your data” or some other recommended way to wait?
For reference: the GPU processing I have takes somewhere between 10ms - 1 sec. Averaging on 250ms. And all the processing I do on the CPU is an order of magnitude of 1ms. data_out_gpu is really a tiny array with results (3-24 Float32 numbers). So I would expected CPU idling most of the time…
This is normally not a problem with 1 julia instance. But I run several julia instances per gpu (to be able to run several kernels in parallel) and have several GPUs in the system. Thus before you know CPU is 100% busy and starts having troubles feeding GPUs with new kernels.
Use other contexts, ie. constructing and activating a new one disregarding what CUDAnative has constructed before, might break some functionality. AFAIK this is similar to how CUDA treats contexts.
Yeah, nothing specific to CUDAnative here. The low-level enhancements aren’t difficult to implement though, feel free to give it a try or file an issue on CUDAdrv to have them implemented. But the underlying “issue”, where blocking on a GPU task results in a CPU-intensive busy loop, is also present with CUDA. There’s probably a reason why the blocking sync isn’t the default, so I’m not sure we should change it for all of CUDAnative/CuArrays.
EDIT: although we could always expose a blocking CuEvent through eg. an argument to CuArrays.@sync, or just make it the default there (where CUDAdrv.synchronize() would then still be a CUDA-style busy looping sync). Feel free to make suggestions if you have any.
Do I understand it correctly that the idea is that where I used to have
y_gpu = some_gpu_func(x_gpu)
y_cpu = Array(y_gpu) #<- this sync is expensive
I would use
y_gpu = @sync(some_gpu_func(x_gpu)) #<- this sync is cheaper
y_cpu = Array(y_gpu) #<- "no sync" here anymore
?
Anyway I tried that and didn’t notice any difference in cpu load still one thread fully utilized even if gpu work takes a second or two.
OTH
At the same time I moved from julia 1.0.2 to 1.1.0 (CuArrays from 0.8.6 ->0.9.0) and updated all packages. and my whole program got almost 5-10x slower (with both methods). so looks like there is a much bigger problem somewhere. Maybe masking any potential gains
The only difference in code I had to do was to move CuArray{Float32, 2}(x,y) style of array creation to CuArray{Float32, 2}(undef, x, y)
It should be more efficient, yes. Are you sure that the thread is busy waiting for the @sync?
EDIT: actually, there looks to be something wrong with the new @sync… I’ll look into it as soon as I have some time.
Bummer. We really need GPU benchmarking as part of CI…
The 0.9.0 change has been a very large release, both in CuArrays.jl and CUDAnative.jl. If you could give me a kernel + launch sequence that has regressed, I’d be happy to give it a look.
I think I managed to reduce a pile of code down it to a simple example. Looks like the compilation of user generated function takes dramatically longer now in CUDAnative 1.0.1 (Julia 1.1.0). With such an amount of CPU spent compiling, no wonder we have no chance seeing improvements in waiting time.
You were complaining about kernel performance, but now you mention compilation time?
What do you mean with this? Kernels are compiled once and cached, this has nothing to do with the time spent in sync, which is a busy loop within CUDA. Baseless accusations are of no help here.
I really didn’t mean to offend. I am really sorry if my reports were perceived like this. Looks like my joke about me not being able to see the improvement didn’t land too well. Was not the intention at all.
I am quite the opposite, I really really appreciate the work you and the team does to push Julia and GPU programming forward. I use it every day and it helps me tremendously. Thank you very much.
With that background. I took the latest CUDAnative. noticed that our whole (complex) software runs noticeably slower (reported as devil is probably there, maybe masking everything). After a fair bit of investigation noticed that the slowdown is probably related to actually compilation part. Created a simple piece of code one can run to reproduce the issue. Hoping this could help the developers to easier find the issue (looks like it did. Thanks for fixing it so fast!).
Again, I am not complaining. I am grateful for the work you and team did. Obviously often I will run into something I don’t understand (or even guess about problem source, which might turn to be in a different place, like this thread evolved) and every so often I will even bump into bug. So I ask/report. Just trying to help and learn. No offence meant at all.
Guess I misinterpreted that joke indeed Your MWE is appreciated though, and makes it much easier to diagnose issues like this. Let me know if your @sync problem is now solved too.
Now with compilation problem out of the way (I pulled latest CUDAnative from git)
I still can’t seem to make @sync working. Here is a sample code with long gpu processing function and an attempt to use @sync macro. It doesn’t seem to have an effect.
using CUDAdrv: CuDevice, CuContext, synchronize
use_dev = 0
dev = CuDevice(use_dev)
ctx = CuContext(dev)
println("Running on ", dev)
using CUDAnative
CUDAnative.device!(use_dev)
using CuArrays
function long_gpu_compute(x, num)
for i = 1:100000000
x[i % 2 + 1] = num
end
return
end
ex = :(()->begin
x = CuArray{Float32, 1}(undef, 2)
@cuda threads=1 long_gpu_compute(x, 42.0f0)
return x
end)
f = eval(ex)
for i = 1:3
println("**************************")
println("x_out = CUDAnative.@sync...")
@time( x_out = CUDAnative.@sync(Base.invokelatest(f)))
println("synchronize(ctx)...")
@time( synchronize(ctx) )
println("x_out_cpu = Array(x_out)...")
@time( x_out_cpu = Array(x_out))
println("result it ", x_out_cpu[1], ":", x_out_cpu[1])
end
since judging by the output the @sync macro returned before context was synchronized (if I remove syncronize(ctx) I will just spend those 4 secs in x_out_cpu = Array(x_out) just like before)
| | |_| | | | (_| | | Version 1.1.0 (2019-01-21)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
Running on CuDevice(0): GeForce RTX 2080 Ti
**************************
x_out = CUDAnative.@sync...
4.328115 seconds (16.18 M allocations: 817.926 MiB, 4.16% gc time)
synchronize(ctx)...
4.256406 seconds
x_out_cpu = Array(x_out)...
0.020469 seconds (93.58 k allocations: 4.704 MiB)
result it 42.0:42.0
**************************
x_out = CUDAnative.@sync...
0.000068 seconds (25 allocations: 768 bytes)
synchronize(ctx)...
4.247434 seconds
x_out_cpu = Array(x_out)...
0.000103 seconds (5 allocations: 208 bytes)
result it 42.0:42.0
**************************
x_out = CUDAnative.@sync...
0.000055 seconds (25 allocations: 768 bytes)
synchronize(ctx)...
4.247422 seconds
x_out_cpu = Array(x_out)...
0.000104 seconds (5 allocations: 208 bytes)
result it 42.0:42.0
Any idea how to move forward?
I run Win10
NVIDIA 2080 Ti (driver 417.35)
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:04_Central_Daylight_Time_2018
Cuda compilation tools, release 10.0, V10.0.130