CUDA.jl - When to synchronize

The project I’m working on is run entirely on a GPU. What I mean by that is that there is no copying back and forth between the GPU and CPU until the very end. Along the way, I go through 5 main types of operations:

Array programming
Kernels (3 functions I couldn’t figure out with array programming)
mapreducedim-type function calls
Calls to cufinufft in python (gpu version of flatiron institute’s NUFFT code)
Calls to a GPU compiled LAMMPS through LAMMPS.jl (molecular dynamics code written in C++)

I’m not quite sure where I need to put explicit synchronize commands in the code. I think that array programming calls it for me, so I’m thinking I may need to call it before and after the kernels, the call to the cufinufft library in python, and the call to LAMMPS. Is this correct?

Is there a general rule that I should follow, or is it case dependent?

1 Like

Everytime you are passing data to be processed on another stream you need to synchronize beforehand.

So as an example if you are passing a CuArray to a C++ library that launches CUDA operations internally, you will need to synchronize before the ccall (and at the end of the C++ code) to make sure that all operations created by Julia are finished before C++ operates on the memory and vice-versa.

Awesome, thank you. But the same does not apply to kernels compiled with @cuda, correct? These would not be on another stream.

Correct. In fact, with the latest version of CUDA.jl it’s not strictly required anymore to synchronize when performing operations on other streams, as CUDA.jl will synchronize for you: CUDA.jl 5.4: Memory management mayhem ⋅ JuliaGPU. This of course does not hold when calling out to non-CUDA.jl code.

3 Likes

Hi, I have a very basic question related to synchronize using KernelAbstractions.jl. In the example, such as Matmul · KernelAbstractions.jl, there is a call to KernelAbstractions.synchronize(backend). I ran a quick test using the following code, and it appears that removing sync1, sync2, and sync3 still produces the correct results. Are all three synchronizations unnecessary?

using KernelAbstractions, Test
using CUDA
using BenchmarkTools

# Increase x by 1
@kernel function kernel1!(x)
    i = @index(Global)
    x[i] += 1
end

# y = x^2
@kernel function kernel2!(y,x)
    i = @index(Global)
    y[i] = x[i]^2
end

function test_fun(y, x)
    backend = KernelAbstractions.get_backend(x)
    kernel1!(backend)(x; ndrange=length(x))
    
    KernelAbstractions.synchronize(backend) # sync1 

    kernel2!(backend)(y, x; ndrange=length(x))
    
    return nothing
end

N = 10000
x = CUDA.zeros(N)
y = CUDA.zeros(N)
y_cpu = zeros(N)
for i in 1:100
    test_fun(y, x)

    backend = KernelAbstractions.get_backend(x)
    KernelAbstractions.synchronize(backend) # sync2
end

backend = KernelAbstractions.get_backend(x)
KernelAbstractions.synchronize(backend) # sync3
copyto!(y_cpu, y)

@test all(y .==  100^2)
@test all(y_cpu .==  100^2)