I have an application that would benefit from passing an array asynchronously from the main program to my CUDA GPU while I am doing other computations on my CPU. The GPU would only be working on receiving the array.
Also, can threads help here?
EDIT: Updated to reflect @maleadtβs comments.
I think we can do something like this.
using CUDA
# in this example, we want to multiply D*M batches on the GPU.
D = CuMatrix(rand(1_000, 1_000))
op = Base.Fix1(*, D)
M = rand(1_000, 10*1024)
CUDA.pin(M) # https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/#pinned_host_memory
idx_batches = Iterators.partition(axes(M, 2), 1024)
# We make three channels: copying to GPU, doing our operation, and copying back.
# We use spawned Julia threads for each. Not sure if this is necessary.
ch_cpu_to_gpu = Channel{CuMatrix{Float64}}(; spawn=true) do ch
foreach(idx_batches) do idx
put!(ch, CuMatrix(M[:, idx]))
end
end
ch_op = Channel{CuMatrix{Float64}}(; spawn=true) do ch
foreach(ch_cpu_to_gpu) do rhs
put!(ch, op(rhs))
end
end
ch_gpu_to_cpu = Channel{Matrix{Float64}}(; spawn=true) do ch
foreach(ch_op) do res
put!(ch, Matrix(res))
end
end
We can now do something else on the CPU and get the gpu batches:
@time for batch in ch_gpu_to_cpu
sleep(0.1)
end
# > 1.013279 seconds (663 allocations: 125.015 MiB, 1.43% gc time)
Compared to just moving the memory, without actually computing the multiplication:
@time begin
for idx in idx_batches
Matrix(CuMatrix(arr[:, idx]))
sleep(0.1)
end
end
# > 1.089683 seconds (296 allocations: 156.258 MiB, 0.41% gc time)
Thereβs also Dagger.jl, but it didnβt work so well for me.
This will probably not execute asynchronously because copies to and from pageable host memory (i.e., what Julia arrays are by default) are mostly synchronous. See How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog, or the βPinned memoryβ section of Learning/Courses/AdvancedCUDA/part1/2-2-memory_management.ipynb at main Β· JuliaGPU/Learning Β· GitHub. You want to make sure you use page-locked CPU memory, either by using CUDAβs HostMemory, or by pin
ning the array.
Note that the CUDA.Mem
submodule has been deprecated in recent versions of CUDA.jl: CUDA.jl 5.4: Memory management mayhem β
JuliaGPU
Thanks @maleadt. Iβve updated my example according to your comments, and it looks like it works now. Still would be good to check with a nvidia profiler though probably.
Actually it looks like we still get unpinned memory when we index into M
?
M = rand(1_000, 1_000)
CUDA.pin(M)
CUDA.@profile CuMatrix(M)
"""
ββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββββββββββββββββββββββββ
β Time (%) β Total time β Calls β Name β
ββββββββββββΌβββββββββββββΌββββββββΌβββββββββββββββββββββββββββββββββ€
β 99.35% β 6.46 ms β 1 β [copy pinned to device memory] β
ββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββββββββββββββββββββββββ
"""
see βcopy pinned to device memoryβ, but
CUDA.@profile CuMatrix(M[:, 1:100])
"""
ββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββββββββββββββββββββββββββ
β Time (%) β Total time β Calls β Name β
ββββββββββββΌβββββββββββββΌββββββββΌβββββββββββββββββββββββββββββββββββ€
β 21.83% β 64.13 Β΅s β 1 β [copy pageable to device memory] β
ββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββββββββββββββββββββββββββ
"""
, see βcopy pageable to device memoryβ, and similarly
CUDA.@profile CuMatrix(@view M[:, 1:100])
"""
ββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββββββββββββββββββββββββββ
β Time (%) β Total time β Calls β Name β
ββββββββββββΌβββββββββββββΌββββββββΌβββββββββββββββββββββββββββββββββββ€
β 15.17% β 64.61 Β΅s β 1 β [copy pageable to device memory] β
ββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββββββββββββββββββββββββββ
"""
So actually the memory pinning doesnβt help if we have to index afterwards, even with @view
?
Thatβs a copying slice, not a view.
Yeah, thatβs unfortunate. Right now, we only allow CuArray construction from Arrays, all other types (e.g. the SubArray here) are first copied to an Array, losing the pin. Could be a good addition to CUDA.jl, but IIRC we removed this at some point because of the huge number of ambiguities we ran into.
Iβve found a way to copy slices of one big memory array, though use of CartesianIndices
:
# host data, pinned to enable async transfer.
M_hst = rand(1_000, 1_000)
CUDA.pin(M_hst)
idx_hst = CartesianIndices((axes(M, 1), 201:300));
# device buffer
M_dev = CuArray{eltype(M)}(undef, size(M, 1), 100);
idx_dev = CartesianIndices((axes(M, 1), 1:100));
CUDA.@profile copyto!(M_dev, idx_dev, M_hst, idx_hst)
# yields
ββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββββββββββββββββββββββββ
β Time (%) β Total time β Calls β Name β
ββββββββββββΌβββββββββββββΌββββββββΌβββββββββββββββββββββββββββββββββ€
β 61.96% β 82.73 Β΅s β 1 β [copy pinned to device memory] β
ββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββββββββββββββββββββββββ
Note: Weβre looking for this to read βcopy pinned to device memoryβ instead of βcopy pagable to device memoryβ.
Moving data back the same way also works:
# we also check the backwards transfer
M_dst = zeros(eltype(M), size(M))
CUDA.pin(M_dst)
CUDA.@profile copyto!(M_dst, idx_hst, M_dev, idx_dev)
# yields
ββββββββββββ¬βββββββββββββ¬ββββββββ¬βββββββββββββββββββββββββββββββββ
ββββββββββββΌβββββββββββββΌββββββββΌβββββββββββββββββββββββββββββββββ€
β 53.73% β 87.5 Β΅s β 1 β [copy device to pinned memory] β
ββββββββββββ΄βββββββββββββ΄ββββββββ΄βββββββββββββββββββββββββββββββββ