Can I move an array asynchronously from main program to CUDA?

joaquimg · September 27, 2024, 8:27pm

I have an application that would benefit from passing an array asynchronously from the main program to my CUDA GPU while I am doing other computations on my CPU. The GPU would only be working on receiving the array.
Also, can threads help here?

RomeoV · September 27, 2024, 11:49pm

EDIT: Updated to reflect @maleadt’s comments.

I think we can do something like this.

using CUDA

# in this example, we want to multiply D*M batches on the GPU.
D = CuMatrix(rand(1_000, 1_000))
op = Base.Fix1(*, D)

M = rand(1_000, 10*1024)
CUDA.pin(M)  # https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/#pinned_host_memory
idx_batches = Iterators.partition(axes(M, 2), 1024)

# We make three channels: copying to GPU, doing our operation, and copying back.
# We use spawned Julia threads for each. Not sure if this is necessary.
ch_cpu_to_gpu = Channel{CuMatrix{Float64}}(; spawn=true) do ch
    foreach(idx_batches) do idx
        put!(ch, CuMatrix(M[:, idx]))
    end
end
ch_op = Channel{CuMatrix{Float64}}(; spawn=true) do ch
    foreach(ch_cpu_to_gpu) do rhs
        put!(ch, op(rhs))
    end
end
ch_gpu_to_cpu = Channel{Matrix{Float64}}(; spawn=true) do ch
    foreach(ch_op) do res
        put!(ch, Matrix(res))
    end
end

We can now do something else on the CPU and get the gpu batches:

@time for batch in ch_gpu_to_cpu
    sleep(0.1)
end
# > 1.013279 seconds (663 allocations: 125.015 MiB, 1.43% gc time)

Compared to just moving the memory, without actually computing the multiplication:

@time begin
    for idx in idx_batches
        Matrix(CuMatrix(arr[:, idx]))
        sleep(0.1)
    end
end
# > 1.089683 seconds (296 allocations: 156.258 MiB, 0.41% gc time)

There’s also Dagger.jl, but it didn’t work so well for me.

maleadt · September 28, 2024, 11:07am

This will probably not execute asynchronously because copies to and from pageable host memory (i.e., what Julia arrays are by default) are mostly synchronous. See How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog, or the “Pinned memory” section of Learning/Courses/AdvancedCUDA/part1/2-2-memory_management.ipynb at main · JuliaGPU/Learning · GitHub. You want to make sure you use page-locked CPU memory, either by using CUDA’s HostMemory, or by pinning the array.

maleadt · September 28, 2024, 7:00pm

Note that the CUDA.Mem submodule has been deprecated in recent versions of CUDA.jl: CUDA.jl 5.4: Memory management mayhem ⋅ JuliaGPU

RomeoV · September 28, 2024, 7:06pm

Thanks @maleadt. I’ve updated my example according to your comments, and it looks like it works now. Still would be good to check with a nvidia profiler though probably.

RomeoV · September 28, 2024, 7:18pm

Actually it looks like we still get unpinned memory when we index into M?

M = rand(1_000, 1_000)
CUDA.pin(M)

CUDA.@profile CuMatrix(M)
"""
┌──────────┬────────────┬───────┬────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name                           │
├──────────┼────────────┼───────┼────────────────────────────────┤
│   99.35% │    6.46 ms │     1 │ [copy pinned to device memory] │
└──────────┴────────────┴───────┴────────────────────────────────┘
"""

see “copy pinned to device memory”, but

CUDA.@profile CuMatrix(M[:, 1:100])
"""
┌──────────┬────────────┬───────┬──────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name                             │
├──────────┼────────────┼───────┼──────────────────────────────────┤
│   21.83% │   64.13 µs │     1 │ [copy pageable to device memory] │
└──────────┴────────────┴───────┴──────────────────────────────────┘
"""

, see “copy pageable to device memory”, and similarly

CUDA.@profile CuMatrix(@view M[:, 1:100])
"""
┌──────────┬────────────┬───────┬──────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name                             │
├──────────┼────────────┼───────┼──────────────────────────────────┤
│   15.17% │   64.61 µs │     1 │ [copy pageable to device memory] │
└──────────┴────────────┴───────┴──────────────────────────────────┘
"""

So actually the memory pinning doesn’t help if we have to index afterwards, even with @view?

maleadt · September 30, 2024, 9:19am

That’s a copying slice, not a view.

Yeah, that’s unfortunate. Right now, we only allow CuArray construction from Arrays, all other types (e.g. the SubArray here) are first copied to an Array, losing the pin. Could be a good addition to CUDA.jl, but IIRC we removed this at some point because of the huge number of ambiguities we ran into.