CUDA.jl - When to synchronize

Jason_Meziere · June 12, 2024, 9:20pm

The project I’m working on is run entirely on a GPU. What I mean by that is that there is no copying back and forth between the GPU and CPU until the very end. Along the way, I go through 5 main types of operations:

Array programming
Kernels (3 functions I couldn’t figure out with array programming)
mapreducedim-type function calls
Calls to cufinufft in python (gpu version of flatiron institute’s NUFFT code)
Calls to a GPU compiled LAMMPS through LAMMPS.jl (molecular dynamics code written in C++)

I’m not quite sure where I need to put explicit synchronize commands in the code. I think that array programming calls it for me, so I’m thinking I may need to call it before and after the kernels, the call to the cufinufft library in python, and the call to LAMMPS. Is this correct?

Is there a general rule that I should follow, or is it case dependent?

vchuravy · June 12, 2024, 9:43pm

Everytime you are passing data to be processed on another stream you need to synchronize beforehand.

So as an example if you are passing a CuArray to a C++ library that launches CUDA operations internally, you will need to synchronize before the ccall (and at the end of the C++ code) to make sure that all operations created by Julia are finished before C++ operates on the memory and vice-versa.

Jason_Meziere · June 12, 2024, 10:45pm

Awesome, thank you. But the same does not apply to kernels compiled with @cuda, correct? These would not be on another stream.

maleadt · June 13, 2024, 10:56am

Correct. In fact, with the latest version of CUDA.jl it’s not strictly required anymore to synchronize when performing operations on other streams, as CUDA.jl will synchronize for you: CUDA.jl 5.4: Memory management mayhem ⋅ JuliaGPU. This of course does not hold when calling out to non-CUDA.jl code.

ww1g11 · November 2, 2024, 11:23am

Hi, I have a very basic question related to synchronize using KernelAbstractions.jl. In the example, such as Matmul · KernelAbstractions.jl, there is a call to KernelAbstractions.synchronize(backend). I ran a quick test using the following code, and it appears that removing sync1, sync2, and sync3 still produces the correct results. Are all three synchronizations unnecessary?

using KernelAbstractions, Test
using CUDA
using BenchmarkTools

# Increase x by 1
@kernel function kernel1!(x)
    i = @index(Global)
    x[i] += 1
end

# y = x^2
@kernel function kernel2!(y,x)
    i = @index(Global)
    y[i] = x[i]^2
end

function test_fun(y, x)
    backend = KernelAbstractions.get_backend(x)
    kernel1!(backend)(x; ndrange=length(x))
    
    KernelAbstractions.synchronize(backend) # sync1 

    kernel2!(backend)(y, x; ndrange=length(x))
    
    return nothing
end

N = 10000
x = CUDA.zeros(N)
y = CUDA.zeros(N)
y_cpu = zeros(N)
for i in 1:100
    test_fun(y, x)

    backend = KernelAbstractions.get_backend(x)
    KernelAbstractions.synchronize(backend) # sync2
end

backend = KernelAbstractions.get_backend(x)
KernelAbstractions.synchronize(backend) # sync3
copyto!(y_cpu, y)

@test all(y .==  100^2)
@test all(y_cpu .==  100^2)

maleadt · November 4, 2024, 4:27pm

Copying to CPU memory automatically synchronizes.

Kernel execution is ordered on the task-local stream, so there’s no need to synchronize in between.

ww1g11 · November 5, 2024, 4:29am

Many thanks, so all three synchronizations are unnecessary? Does this conclusion also hold for other GPUs supported by KernelAbstractions.jl?

maleadt · November 5, 2024, 5:40am

Yes, they are all unnecessary. That should hold for all our back-ends.

vchuravy · November 5, 2024, 10:40am

Yeah, this is mostly an artifact from a previous version of KernelAbstractions.jl where one had to to be more explicit.

Yassin_ElBedwihy · March 5, 2025, 9:37am

I suppose then that the multiple warnings to synchronize and await kernel launches in the julia con 2021 workshop is outdated correct?

Yassin_ElBedwihy · March 5, 2025, 9:49am

For example, the code at timestamp 2:35:47

maleadt · March 6, 2025, 6:53am

That is correct.

Topic		Replies	Views
Synchronize streams in CUDA.jl GPU gpu , cuda	11	485	August 23, 2024
GPU Synchronization Issue - using KernelAbstraction GPU question	5	430	December 13, 2023
How to benchmark a function that uses KernelAbstractions kernels? GPU question , kernelabstractions	4	126	March 17, 2025
Most efficient way of _waiting_ for GPU results? GPU	20	3037	January 31, 2019
@synchronize causing silent failures in function GPU question	4	246	December 4, 2023

CUDA.jl - When to synchronize

Related topics