CUDA async is not working properly

I am trying to run saxpy_kernel using CUDA async. I have a need to use the user specified stream to run the saxpy_kernel. It is working good with CUDA.sync, but with CUDA.@async.

Here is the minimum working example.
File: stream_pure.jl

using CUDA

# Define the SAXPY kernel
function saxpy_kernel(z, a, x, y)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    if i <= length(x)
        z[i] = a * x[i] + y[i]
    end
    return
end

# Main function demonstrating CUDA.@async
function main()
    # Array size
    n = 8
    # Create random input data
    x = CUDA.rand(Float32, n)
    y = CUDA.rand(Float32, n)
    z = CUDA.zeros(Float32, n)
    a = 2.5f0

    # Create CUDA streams
    stream = CUDA.CuStream()

    # Launch SAXPY in parallel across multiple streams
    CUDA.@async begin
            println("Launching task on stream")
            @cuda stream=stream threads=256 blocks=ceil(Int, length(x) / 256) saxpy_kernel(z, a, x, y)
            synchronize(stream)  # Ensure the stream completes
            println("Task completed.")
    end

    z_host = Array(z)
    println("All tasks completed.", z_host)
end

# Run the main function
main()

How to run:

$ julia stream_pure.jl

Here is the outcome:

All tasks completed.Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Expected outcome:
all non zeros after completed message.

You are not waiting for the task you launched with @async to complete. CUDA.@async is the same as Base.@async and has nothing to do with CUDA or GPU programming, methods from Base are present in all modules. You could wait() for that task, but unless you have other tasks to run on the same thread, you should leave out the @async call. The stream argument to @cuda and the subsequent synchronize do everything you need.

Thank you. It helped to resolve this issue.

In addition, it may be better to use CUDA.stream! to switch the stream of the current task (either temporarily using do-block syntax, or permanently). If you only pass stream to @cuda, other operations will still use the default task-local stream. If you’d switch instead, you should be able to copy by calling Array without having to synchronize. Of course, this may not apply to your actual application, but it does to this MWE.