I am trying to run saxpy_kernel using CUDA async. I have a need to use the user specified stream to run the saxpy_kernel. It is working good with CUDA.sync, but with CUDA.@async.
Here is the minimum working example.
File: stream_pure.jl
using CUDA
# Define the SAXPY kernel
function saxpy_kernel(z, a, x, y)
i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
if i <= length(x)
z[i] = a * x[i] + y[i]
end
return
end
# Main function demonstrating CUDA.@async
function main()
# Array size
n = 8
# Create random input data
x = CUDA.rand(Float32, n)
y = CUDA.rand(Float32, n)
z = CUDA.zeros(Float32, n)
a = 2.5f0
# Create CUDA streams
stream = CUDA.CuStream()
# Launch SAXPY in parallel across multiple streams
CUDA.@async begin
println("Launching task on stream")
@cuda stream=stream threads=256 blocks=ceil(Int, length(x) / 256) saxpy_kernel(z, a, x, y)
synchronize(stream) # Ensure the stream completes
println("Task completed.")
end
z_host = Array(z)
println("All tasks completed.", z_host)
end
# Run the main function
main()
How to run:
$ julia stream_pure.jl
Here is the outcome:
All tasks completed.Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Expected outcome:
all non zeros after completed message.