Using stream per cpu thread pattern

This CUDA tutorial demonstrates how multiple cpu threads can be spawned and each will have its own stream, which achieves great concurrency on the GPU (see section " A Multi-threading Example") GPU Pro Tip: CUDA 7 Streams Simplify Concurrency | NVIDIA Technical Blog . Ultimately, I’d like to have a multi-threaded program where each thread can spawn kernels on its own stream, and can sync/wait for them.

I’m trying to replicate this by modifying the MWE from this other active thread CUDA streams do not overlap. I’m just adding

My code and output are:

using CUDAdrv, CUDAnative, CuArrays

function memcopy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    A[ix] = B[ix]
    return
end

function main()
    nx = 128*1024^2
    nt = 100
    nthreads = 1024
    nblocks = ceil(Int, nx/nthreads)
    Threads.@threads for i = 1:2
        A = CuArray(zeros(nx))
        B = CuArray(ones(nx))
        s = CuStream()
        @cuda blocks=nblocks threads=nthreads stream=s memcopy!(A, B);
        CUDAdrv.synchronize(s)
    end
end

main()

Which, only with the Threads.@threads, produces:

Error thrown in threaded loop on thread 1: CUDAdrv.CuError(code=201, meta=nothing)
julia>

Without multiple threads, I have no problem. Is there a better way I should be either multithreading or using the CUDAdrv/native packages?

Thanks!

2 Likes

This is currently no possible, Julia’s threading support is experimental and the combination with CUDA is not something that currently works (as far as I know).