Synchronize streams in CUDA.jl

Which of the synchronize commands is supposed to be used to synchronize the streams in CUDA? I am able to synchronize them with device_synchronize(), but this seems to have a huge impact on speed. Is there perhaps a more ‘lightweight’ command that could be used?

CUDA.synchronize() syncs streams. Also, unless you need to access data across different streams (non-default ones), time kernel execution, and some other specific operations (mostly across tasks and streams), kernel execution is stream-ordered, meaning you may skip explicit stream sync.

1 Like

Thanks for the quick reply! I might be using streams in a slightly unconventional way. I’m launching multiple kernels asynchronously using streams (1 kernel launch per stream), and then, performing a single large DtoH memcopy.

I notice that the memcopy, which uses the default stream, starts before all streams complete kernel execution. This obviously causes incorrect results. I want to ensure all streams complete kernel execution before the memcopy. CUDA.synchronize did not seem like it was doing this.

We don’t use the traditional default stream. Instead, CUDA.jl allocates a default stream per task, so one possible solution here is to launch the threads on different Julia tasks (using @async), call synchronize() at the end of each task, and wait or @sync (the Base one) on those tasks. After that, you should be safe to perform the memory copy, without ever having to perform a blocking, let alone device-wide synchronization.

There’s alternatives, like using CuEvents to make the streams depend on one another, but the above one is probably the most idiomatic.

Note that if you’re working on a single array per stream, CUDA.jl will be able to automatically figure out the need for synchronization on another stream, see CUDA.jl 5.4: Memory management mayhem ⋅ JuliaGPU, “Using multiple streams”.

That’s interesting!
Why do you suggest calling synchronize at the end of each task? Wouldn’t the @sync that wraps the block, synchronize all threads/streams when they finish execution?

@sync != CUDA.@sync

Thanks @maleadt and @carstenbauer for your tips! I was able to make some progress.

Unfortunately I’m not working with a single array per stream and now am trying to figure out how to copy a matrix slice over a stream. I saw that unsafe_copy2d! was my only option if I wanted to specify a stream in the arguments, but haven’t got it to work correctly yet.

Should I use that to copy an array slice from device to host or is there a better option?

Here’s an MWE to clarify what I’m trying to achieve:

using CUDA

function kernel!(a, i)
   # A large computation
    m::Int = 10^4
    for j = 1:m 
        for k = 1:m
            a[threadIdx().x, i] = i
        end
    end
    return
end


function main()
    n = 2  # No. of streams

    a = rand(256, n)
    a_d = CuArray(a)

    @sync for i in 1:n
        @async begin
            stream = CuStream()
            
            # Each stream handles a column of a_d
            @cuda threads=256 stream=stream kernel!(a_d, i)
            
            # Copy each column of a_d to a
            # THIS NEEDS TO BE DONE USING THE CURRENT STREAM
            a[:, i] .= Array(a_d[:, i])
        end
    end 
end 

CUDA.@profile main()

From the profiler output, the memcopy occurs on different streams from the kernel launch and aren’t synchronized with them. I’d like them to occur on the same stream after the kernel operation is completed.

What do you mean by that, specifically? If you just want the copy on another stream, you can use CUDA.stream! with do-block syntax to temporarily switch the current task to another stream, and perform a regular copy within. That’s even supposed to work when the kernels producing data on the array haven’t finished yet, as CUDA.jl should synchronize the previous stream that your array was used on when performing the copy and detecting a different stream, but I haven’t tested that specifically.

Sorry if that wasn’t clear.
What I meant was, in the example above, when I do

@sync for i = 1:n
  @async begin
    stream=CuStream()
    @cuda threads=256 stream=stream kernel!(a_d, i)

    a[:, i] .= Array(a_d[:, i])
  end
end

for each i, the kernel launch and the memcopy occur on two different streams.

I want the memcopy to occur on the same stream after the kernel computation on that stream is completed.

I didn’t know about the do ... end block with streams though! That looks promising.

That’s even supposed to work when the kernels producing data on the array haven’t finished yet, as CUDA.jl should synchronize the previous stream that your array was used on when performing the copy and detecting a different stream, but I haven’t tested that specifically.

I wonder if this functionality isn’t working for me because all streams are accessing the same array. I don’t have a single array per stream.

I wouldn’t do that. It’s possible, but just change the stream for that task / @async block by calling stream! (without the do-block syntax).

Yeah that could be problematic, as it will cause additional synchronization. The assumption is currently baked into CUDA.jl’s memory handling: CUDA.jl/src/memory.jl at 76e2972814a0e7910f35ed3ad17b1a9198628f34 · JuliaGPU/CUDA.jl · GitHub

What’s the reason behind sharing arrays between different streams? Do you really have several kernels operating on subsets of the memory, only to copy the pieces of memory that are ready?

The reason for sharing a single array between different streams was to avoid redundant data copying from HtoD.

I’m working on a particle interaction problem where there are certain common clusters of particles that interact with ‘target’ particles. I was initially using separate arrays per stream, but then, those common clusters were copied over to the GPU redundantly.
When each stream works on a subset of the large array, there are no redundant copy operations, and instead just a single large initial copy operation.

At this point, these are experimental algorithms, so I’m still figuring out if there’s a better way to do it, but for now, the stream! do ... end block seems to do the job for me like so:

streams = Vector{CuStream}(undef, nstreams)

# Launch all streams asynchronously
for i = 1:nstreams
    streams[i] = CuStream()
    @cuda threads=256 stream=streams[i] kernel!(a_d, i)
end

# Memcopy using the same streams to ensure synchronization
for i = 1:nstreams
    stream!(streams[i]) do
        a[:, i] .= Array(a_d[:, i])
    end
end

However, like you mentioned, I think there might be additional synchronization that’s occurring that’s costing me performance.