Most efficient way of _waiting_ for GPU results?

Hello,

I am using the following pattern today to synchronize results of the GPU processing back to CPU (CUDAnative, CuArrays, CUDAdrv)

# data_out_gpu and data_in_gpu are CuArrays
#do_some_gpu_processing() will launch kernels etc
do_some_gpu_processing(data_out_gpu, data_in_gpu)
#calling Array will force wait for gpu to finish processing
data_out_cpu = Array(data_out_gpu)

However, I noticed that with such a pattern cpu gets loaded 100% waiting for gpu. Thus I wonder if there is a more gentle on the CPU pattern? Some type of an event or similar which can fired on “ok GPU is done you can come and pick your data” or some other recommended way to wait?

For reference: the GPU processing I have takes somewhere between 10ms - 1 sec. Averaging on 250ms. And all the processing I do on the CPU is an order of magnitude of 1ms. data_out_gpu is really a tiny array with results (3-24 Float32 numbers). So I would expected CPU idling most of the time…

This is normally not a problem with 1 julia instance. But I run several julia instances per gpu (to be able to run several kernels in parallel) and have several GPUs in the system. Thus before you know CPU is 100% busy and starts having troubles feeding GPUs with new kernels.

The only thing I found so far is GPU Event: CUDAdrv.jl/events.jl at 2a77dff0eaad0df12abd8cf05e73b3f9d5968ad5 · JuliaGPU/CUDAdrv.jl · GitHub Was planning to measure performance on of it vs Array(gpu_result) synchronization. Is that more canonical way?

Hints, pointers on more efficient way of waiting for the results are highly appreciated

CUDA events would be the best approach, but we haven’t wrapped the necessary functionality: either to create an event with the blocking sync flag set, or manually querying the state of the event, resp. CUDA Driver API :: CUDA Toolkit Documentation and CUDA Driver API :: CUDA Toolkit Documentation. Alternatively, creating the context with the CU_CTX_SCHED_BLOCKING_SYNC flag set should accomplish the same.

Bottom line, a couple of low-level ways to accomplish this, nothing user friendly yet :slight_smile:

re: [CU_CTX_SCHED_BLOCKING_SYNC] (CUDA Driver API :: CUDA Toolkit Documentation)
It looks like
https://github.com/JuliaGPU/CUDAdrv.jl/blob/adffa0a260e91ccbf89bf3cd22dd46dece962bd2/src/context.jl

Looks like CUDADrv already has it

@enum(CUctx_flags, SCHED_AUTO           = 0x00,
                   SCHED_SPIN           = 0x01,
                   SCHED_YIELD          = 0x02,
                   SCHED_BLOCKING_SYNC  = 0x04,
                   MAP_HOST             = 0x08,
                   LMEM_RESIZE_TO_MAX   = 0x10)

so this should be a matter of calling ?
ctx = CuContext(dev, CUctx_flags(4))

Yes, but CUDAnative manages your context and there’s no API for setting flags there: CUDAnative.jl/init.jl at 98baf5840a9ae65f5c569cac30b1981e7bbfb071 · JuliaGPU/CUDAnative.jl · GitHub
You can try changing the constructor below: CUDAnative.jl/init.jl at 98baf5840a9ae65f5c569cac30b1981e7bbfb071 · JuliaGPU/CUDAnative.jl · GitHub
Also, you can use CUDAdrv.SCHED_BLOCKING_SYNC.

Use other contexts, ie. constructing and activating a new one disregarding what CUDAnative has constructed before, might break some functionality. AFAIK this is similar to how CUDA treats contexts.

@maleadt, Thanks a bunch!

I guess that means there are no obvious APIs/patterns I am missing. No low hanging fruits.

And there are several modifications for the lib code one can make if one wants to push forward here. Makes sense.

Yeah, nothing specific to CUDAnative here. The low-level enhancements aren’t difficult to implement though, feel free to give it a try or file an issue on CUDAdrv to have them implemented. But the underlying “issue”, where blocking on a GPU task results in a CPU-intensive busy loop, is also present with CUDA. There’s probably a reason why the blocking sync isn’t the default, so I’m not sure we should change it for all of CUDAnative/CuArrays.

EDIT: although we could always expose a blocking CuEvent through eg. an argument to CuArrays.@sync, or just make it the default there (where CUDAdrv.synchronize() would then still be a CUDA-style busy looping sync). Feel free to make suggestions if you have any.

https://github.com/JuliaGPU/CuArrays.jl/pull/245

Thanks for introducing new @sync macro.

Do I understand it correctly that the idea is that where I used to have

y_gpu = some_gpu_func(x_gpu)
y_cpu = Array(y_gpu) #<- this sync is expensive

I would use

y_gpu = @sync(some_gpu_func(x_gpu)) #<- this sync is cheaper
y_cpu = Array(y_gpu) #<- "no sync" here anymore

?

Anyway I tried that and didn’t notice any difference in cpu load :frowning: still one thread fully utilized even if gpu work takes a second or two.

OTH
At the same time I moved from julia 1.0.2 to 1.1.0 (CuArrays from 0.8.6 ->0.9.0) and updated all packages. and my whole program got almost 5-10x slower (with both methods). so looks like there is a much bigger problem somewhere. Maybe masking any potential gains

The only difference in code I had to do was to move CuArray{Float32, 2}(x,y) style of array creation to CuArray{Float32, 2}(undef, x, y) :frowning:

It should be more efficient, yes. Are you sure that the thread is busy waiting for the @sync?
EDIT: actually, there looks to be something wrong with the new @sync… I’ll look into it as soon as I have some time.

Bummer. We really need GPU benchmarking as part of CI…
The 0.9.0 change has been a very large release, both in CuArrays.jl and CUDAnative.jl. If you could give me a kernel + launch sequence that has regressed, I’d be happy to give it a look.

I think I managed to reduce a pile of code down it to a simple example. Looks like the compilation of user generated function takes dramatically longer now in CUDAnative 1.0.1 (Julia 1.1.0). With such an amount of CPU spent compiling, no wonder we have no chance seeing improvements in waiting time.

Hope that helps

https://github.com/JuliaGPU/CUDAnative.jl/issues/336

You were complaining about kernel performance, but now you mention compilation time?

What do you mean with this? Kernels are compiled once and cached, this has nothing to do with the time spent in sync, which is a busy loop within CUDA. Baseless accusations are of no help here.

1 Like

Fixed in Record the sync event before synchronizing. · JuliaGPU/CuArrays.jl@8e45a27 · GitHub and verified that CuArrays.@sync does not take and CPU time anymore (where CUDAdrv.synchronize() does).

I really didn’t mean to offend. I am really sorry if my reports were perceived like this. Looks like my joke about me not being able to see the improvement didn’t land too well. Was not the intention at all.

I am quite the opposite, I really really appreciate the work you and the team does to push Julia and GPU programming forward. I use it every day and it helps me tremendously. Thank you very much.

With that background. I took the latest CUDAnative. noticed that our whole (complex) software runs noticeably slower (reported as devil is probably there, maybe masking everything). After a fair bit of investigation noticed that the slowdown is probably related to actually compilation part. Created a simple piece of code one can run to reproduce the issue. Hoping this could help the developers to easier find the issue (looks like it did. Thanks for fixing it so fast!).

Again, I am not complaining. I am grateful for the work you and team did. Obviously often I will run into something I don’t understand (or even guess about problem source, which might turn to be in a different place, like this thread evolved) and every so often I will even bump into bug. So I ask/report. Just trying to help and learn. No offence meant at all.

2 Likes

Guess I misinterpreted that joke indeed :slight_smile: Your MWE is appreciated though, and makes it much easier to diagnose issues like this. Let me know if your @sync problem is now solved too.

Now with compilation problem out of the way (I pulled latest CUDAnative from git)
I still can’t seem to make @sync working. Here is a sample code with long gpu processing function and an attempt to use @sync macro. It doesn’t seem to have an effect.

using CUDAdrv: CuDevice, CuContext, synchronize

use_dev = 0
dev = CuDevice(use_dev)
ctx = CuContext(dev)
println("Running on ", dev)

using CUDAnative
CUDAnative.device!(use_dev)
using CuArrays

function long_gpu_compute(x, num)
    for i = 1:100000000
      x[i % 2 + 1] = num
   end
   return
end

ex = :(()->begin
    x = CuArray{Float32, 1}(undef, 2)
    @cuda threads=1 long_gpu_compute(x, 42.0f0)
    return x
end)

f = eval(ex)

for i = 1:3
   println("**************************")
   println("x_out = CUDAnative.@sync...")
   @time( x_out = CUDAnative.@sync(Base.invokelatest(f)))
   println("synchronize(ctx)...")
   @time( synchronize(ctx) )
   println("x_out_cpu = Array(x_out)...")
   @time( x_out_cpu = Array(x_out))
   println("result it ", x_out_cpu[1], ":", x_out_cpu[1])
end

since judging by the output the @sync macro returned before context was synchronized (if I remove syncronize(ctx) I will just spend those 4 secs in x_out_cpu = Array(x_out) just like before)

  | | |_| | | | (_| |  |  Version 1.1.0 (2019-01-21)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

Running on CuDevice(0): GeForce RTX 2080 Ti
**************************
x_out = CUDAnative.@sync...
  4.328115 seconds (16.18 M allocations: 817.926 MiB, 4.16% gc time)
synchronize(ctx)...
  4.256406 seconds
x_out_cpu = Array(x_out)...
  0.020469 seconds (93.58 k allocations: 4.704 MiB)
result it 42.0:42.0
**************************
x_out = CUDAnative.@sync...
  0.000068 seconds (25 allocations: 768 bytes)
synchronize(ctx)...
  4.247434 seconds
x_out_cpu = Array(x_out)...
  0.000103 seconds (5 allocations: 208 bytes)
result it 42.0:42.0
**************************
x_out = CUDAnative.@sync...
  0.000055 seconds (25 allocations: 768 bytes)
synchronize(ctx)...
  4.247422 seconds
x_out_cpu = Array(x_out)...
  0.000104 seconds (5 allocations: 208 bytes)
result it 42.0:42.0

Any idea how to move forward?
I run Win10

NVIDIA 2080 Ti (driver 417.35)
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:04_Central_Daylight_Time_2018
Cuda compilation tools, release 10.0, V10.0.130
Julia 1.1.0
  [c52e3926] Atom v0.7.14
  [6e4b80f9] BenchmarkTools v0.4.2
  [c5f51814] CUDAdrv v1.0.1
  [be33ccc6] CUDAnative v1.0.1+ [`dev\CUDAnative`]
  [3a865a2d] CuArrays v0.9.0
  [5789e2e9] FileIO v1.0.5
  [033835bb] JLD2 v0.1.2
  [e5e0dc1b] Juno v0.5.4

Cute issue, you’re using CUDAnative.@sync which resolves to Base.@sync. You need to use CuArrays.@sync.

You aren’t the first running into this, I’m growing to dislike this behavior…

Oh. that was sloppy of me. Sorry. However I still get the same numbers even with the right function

for i = 1:3
   println("**************************")
   println("x_out = CuArrays.@sync...")
   @time( x_out = CuArrays.@sync(Base.invokelatest(f)))
   println("synchronize(ctx)...")
   @time( synchronize(ctx) )
   println("x_out_cpu = Array(x_out)...")
   @time( x_out_cpu = Array(x_out))
   println("result it ", x_out_cpu[1], ":", x_out_cpu[1])
end

output

x_out = CuArrays.@sync...
  0.000069 seconds (24 allocations: 640 bytes)
synchronize(ctx)...
  4.247411 seconds
x_out_cpu = Array(x_out)...
  0.000185 seconds (5 allocations: 208 bytes)
result it 42.0:42.0

And double checking. Looks like it resolved to the right piece of the code

That is strange, I’ve verified it just works here:

**************************
x_out = CUDAnative.@sync...
  1.960479 seconds (31 allocations: 768 bytes)
synchronize(ctx)...
  0.000023 seconds
x_out_cpu = Array(x_out)...
  0.000039 seconds (5 allocations: 208 bytes)
result it 42.0:42.0
**************************

Are you sure you are on CuArrays#master? Your Pkg output doesn’t seem to indicate that.

yes. That was the trick. (I somehow felt it was part of 0.9 release)
(v1.1) pkg> develop --local CuArrays

Now I see the expected results

x_out = CuArrays.@sync...
  4.247996 seconds (26 allocations: 704 bytes)
synchronize(ctx)...
  0.000088 seconds
x_out_cpu = Array(x_out)...
  0.000138 seconds (5 allocations: 208 bytes)
result it 42.0:42.0

Thanks a bunch!