Can I move an array asynchronously from main program to CUDA?

I have an application that would benefit from passing an array asynchronously from the main program to my CUDA GPU while I am doing other computations on my CPU. The GPU would only be working on receiving the array.
Also, can threads help here?

EDIT: Updated to reflect @maleadt’s comments.

I think we can do something like this.

using CUDA

# in this example, we want to multiply D*M batches on the GPU.
D = CuMatrix(rand(1_000, 1_000))
op = Base.Fix1(*, D)

M = rand(1_000, 10*1024)
CUDA.pin(M)  # https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/#pinned_host_memory
idx_batches = Iterators.partition(axes(M, 2), 1024)

# We make three channels: copying to GPU, doing our operation, and copying back.
# We use spawned Julia threads for each. Not sure if this is necessary.
ch_cpu_to_gpu = Channel{CuMatrix{Float64}}(; spawn=true) do ch
    foreach(idx_batches) do idx
        put!(ch, CuMatrix(M[:, idx]))
    end
end
ch_op = Channel{CuMatrix{Float64}}(; spawn=true) do ch
    foreach(ch_cpu_to_gpu) do rhs
        put!(ch, op(rhs))
    end
end
ch_gpu_to_cpu = Channel{Matrix{Float64}}(; spawn=true) do ch
    foreach(ch_op) do res
        put!(ch, Matrix(res))
    end
end

We can now do something else on the CPU and get the gpu batches:

@time for batch in ch_gpu_to_cpu
    sleep(0.1)
end
# > 1.013279 seconds (663 allocations: 125.015 MiB, 1.43% gc time)

Compared to just moving the memory, without actually computing the multiplication:

@time begin
    for idx in idx_batches
        Matrix(CuMatrix(arr[:, idx]))
        sleep(0.1)
    end
end
# > 1.089683 seconds (296 allocations: 156.258 MiB, 0.41% gc time)

There’s also Dagger.jl, but it didn’t work so well for me.

1 Like

This will probably not execute asynchronously because copies to and from pageable host memory (i.e., what Julia arrays are by default) are mostly synchronous. See How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog, or the β€œPinned memory” section of Learning/Courses/AdvancedCUDA/part1/2-2-memory_management.ipynb at main Β· JuliaGPU/Learning Β· GitHub. You want to make sure you use page-locked CPU memory, either by using CUDA’s HostMemory, or by pinning the array.

1 Like

Note that the CUDA.Mem submodule has been deprecated in recent versions of CUDA.jl: CUDA.jl 5.4: Memory management mayhem β‹… JuliaGPU

Thanks @maleadt. I’ve updated my example according to your comments, and it looks like it works now. Still would be good to check with a nvidia profiler though probably.

Actually it looks like we still get unpinned memory when we index into M?

M = rand(1_000, 1_000)
CUDA.pin(M)

CUDA.@profile CuMatrix(M)
"""
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Time (%) β”‚ Total time β”‚ Calls β”‚ Name                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   99.35% β”‚    6.46 ms β”‚     1 β”‚ [copy pinned to device memory] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
"""

see β€œcopy pinned to device memory”, but

CUDA.@profile CuMatrix(M[:, 1:100])
"""
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Time (%) β”‚ Total time β”‚ Calls β”‚ Name                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   21.83% β”‚   64.13 Β΅s β”‚     1 β”‚ [copy pageable to device memory] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
"""

, see β€œcopy pageable to device memory”, and similarly

CUDA.@profile CuMatrix(@view M[:, 1:100])
"""
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Time (%) β”‚ Total time β”‚ Calls β”‚ Name                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   15.17% β”‚   64.61 Β΅s β”‚     1 β”‚ [copy pageable to device memory] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
"""

So actually the memory pinning doesn’t help if we have to index afterwards, even with @view?

That’s a copying slice, not a view.

Yeah, that’s unfortunate. Right now, we only allow CuArray construction from Arrays, all other types (e.g. the SubArray here) are first copied to an Array, losing the pin. Could be a good addition to CUDA.jl, but IIRC we removed this at some point because of the huge number of ambiguities we ran into.

I’ve found a way to copy slices of one big memory array, though use of CartesianIndices:

# host data, pinned to enable async transfer.
M_hst = rand(1_000, 1_000)
CUDA.pin(M_hst)
idx_hst = CartesianIndices((axes(M, 1), 201:300));

# device buffer
M_dev = CuArray{eltype(M)}(undef, size(M, 1), 100);
idx_dev = CartesianIndices((axes(M, 1), 1:100));

CUDA.@profile copyto!(M_dev, idx_dev, M_hst, idx_hst)
# yields
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Time (%) β”‚ Total time β”‚ Calls β”‚ Name                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   61.96% β”‚   82.73 Β΅s β”‚     1 β”‚ [copy pinned to device memory] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Note: We’re looking for this to read β€œcopy pinned to device memory” instead of β€œcopy pagable to device memory”.

Moving data back the same way also works:

# we also check the backwards transfer
M_dst = zeros(eltype(M), size(M))
CUDA.pin(M_dst)
CUDA.@profile copyto!(M_dst, idx_hst, M_dev, idx_dev)
# yields
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   53.73% β”‚    87.5 Β΅s β”‚     1 β”‚ [copy device to pinned memory] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜