Unusually Slow First Device-to-Host Copy on A100 GPU

Hi everyone,

I’m running into an issue where the first attempt to copy data from the device to the host is extremely slow. This problem persists even after running the function multiple times, so I don’t think it’s related to JIT compilation.

I’m using an A100 GPU, which should theoretically provide around 32 GiB/s bandwidth. While I understand performance might be lower due to using pageable memory instead of pinned memory, I’m only seeing about 1 GiB/s, which seems unusually low. Below is the relevant part of the code where with time for different parts. These times are about the same every time I test it.

#allocate array
result_complex = Matrix{ComplexF64}(undef, 38402, 38402)
complex_array = KernelAbstractions.allocate(backend, ComplexF64, size(result_complex ))

#test performance afther allocation
GiB = prod(size(complex_array)) * sizeof(ComplexF64) / 2^30
time_to_transfer_with_copy = @elapsed begin
   copyto!(result_cpu, complex_array)
end
@show time_to_transfer_with_copy  #time_to_transfer_with_copy = 13.971949017
println("GiB/s = ", GiB / time_to_transfer_with_copy)  #GiB/s = 1.572790245743567

KernelAbstractions.synchronize(backend)
time_to_transfer_with_copy = @elapsed begin
   copyto!(result_cpu, complex_array)
 end
@show time_to_transfer_with_copy #time_to_transfer_with_copy = 1.51503066
println("GiB/s = ", GiB / time_to_transfer_with_copy) #GiB/s = 14.504620736826553

#other part of my code where results get stored in gpu_array
...

#part to get data to the CPU
make_complex(backend)(complex_array, gpu_array, ndrange = (numfunctions(test_functions), numfunctions(trial_functions)))
KernelAbstractions.synchronize(backend)
time_to_transfer_with_copy = @elapsed begin
   copyto!(result_complex, complex_array)
end
@show time_to_transfer_with_copy #time_to_transfer_with_copy = 14.026707694
println("GiB/s = ", GiB / time_to_transfer_with_copy) # GiB/s = 1.5666502508898734

                    
KernelAbstractions.synchronize(backend)
time_to_transfer_with_copy = @elapsed begin
   copyto!(result_complex, complex_array)
end
@show time_to_transfer_with_copy #time_to_transfer_with_copy = 1.496582015
println("GiB/s = ", GiB / time_to_transfer_with_copy) #time_to_transfer_with_copy = 1.496582015

So extra info that might be relevant. I’m running this on a cluster node (bare metal). There might be high usage of the CPU and memory but no one else is using the GPU. But even when not a lot of people are using the node this still happens.

Next to this issue I have an issue where for the first kernel that is called it takes around 10 second extra to excute. This might not be related and is only a minor issue, but is you know why this happend any info is apriceted.

Try running under the CUDA profiler CUDA.@profile trace=true ... to see if you can confirm the slower speeds actually happening, or whether something else is taking up time.

I ran with the profiler as you said and got the following results.

result_profile = Profiler ran for 15.8 s, capturing 26 events.

Host-side activity: calling CUDA APIs took 15.75 s (99.65% of the trace)
┌────┬──────────┬─────────┬─────────────────────┐
│ ID │ Start │ Time │ Name │
├────┼──────────┼─────────┼─────────────────────┤
│ 7 │ 54.8 ms │ 4.05 µs │ cuStreamSynchronize │
│ 16 │ 54.81 ms │ 15.75 s │ cuMemcpyDtoHAsync │
│ 22 │ 15.8 s │ 9.3 µs │ cuStreamSynchronize │
└────┴──────────┴─────────┴─────────────────────┘

Device-side activity: GPU was busy for 15.75 s (99.65% of the trace)
┌────┬──────────┬─────────┬────────────┬─────────────┬──────────────────────────────────┐
│ ID │ Start │ Time │ Size │ Throughput │ Name │
├────┼──────────┼─────────┼────────────┼─────────────┼──────────────────────────────────┤
│ 16 │ 54.84 ms │ 15.75 s │ 21.975 GiB │ 1.395 GiB/s │ [copy device to pageable memory] │
└────┴──────────┴─────────┴────────────┴─────────────┴──────────────────────────────────┘
result_profile = Profiler ran for 1.51 s, capturing 26 events.

Host-side activity: calling CUDA APIs took 1.51 s (100.00% of the trace)
┌────┬──────────┬───────────┬─────────────────────┐
│ ID │ Start │ Time │ Name │
├────┼──────────┼───────────┼─────────────────────┤
│ 7 │ 7.15 µs │ 953.67 ns │ cuStreamSynchronize │
│ 16 │ 12.64 µs │ 1.51 s │ cuMemcpyDtoHAsync │
│ 22 │ 1.51 s │ 5.96 µs │ cuStreamSynchronize │
└────┴──────────┴───────────┴─────────────────────┘

Device-side activity: GPU was busy for 1.51 s (99.99% of the trace)
┌────┬──────────┬────────┬────────────┬──────────────┬──────────────────────────────────┐
│ ID │ Start │ Time │ Size │ Throughput │ Name │
├────┼──────────┼────────┼────────────┼──────────────┼──────────────────────────────────┤
│ 16 │ 28.37 µs │ 1.51 s │ 21.975 GiB │ 14.539 GiB/s │ [copy device to pageable memory] │
└────┴──────────┴────────┴────────────┴──────────────┴──────────────────────────────────┘

I also used NVIDIA Nsight System to see if that would give some other information.
Begins: 78.843s
Ends: 94.8592s (+16.016 s)
DtoH memcpy 23,595,417,664 bytes
Source memory kind: Device
Destination memory kind: Pageable
Throughput: 1.37205 GiB/s
Launched from thread: 2284806
Latency: ←31.788 μs
Correlation ID: 3727
Stream: Stream 14

Call to cuMemcpyDtoHAsync
■ Memory copies
Begins: 78.843s
Ends: 94.8599s (+16.017 s)
Return value: 0
GPU: 0000:61:00.0 - NVIDIA A100 80GB PCIe
Stream: 14
Latency: 31.788 μs→
Correlation ID: 3727

Begins: 94.9076s
Ends: 97.4659s (+2.558 s)
DtoH memcpy 23,595,417,664 bytes
Source memory kind: Device
Destination memory kind: Pageable
Throughput: 8.58962 GiB/s
Launched from thread: 2284806
Latency: ←31.940 μs
Correlation ID: 3747
Stream: Stream 14

Call to cuMemcpyDtoHAsync
■ Memory copies
Begins: 94.9076s
Ends: 97.466s (+2.558 s)
Return value: 0
GPU: 0000:61:00.0 - NVIDIA A100 80GB PCIe
Stream: 14
Latency: 31.940 μs→
Correlation ID: 3747

That’s very bizarre; even though the actual copy is much slower, the time it takes to complete is dwarfed by the time cuMemcpyDtoHAsync blocks (50ms >> 15s). Are you perhaps doing anything concurrently? Even though the API is suffixed Async, it exhibits synchronous behavior in the case you’re invoking it, see CUDA Driver API :: CUDA Toolkit Documentation. It could be that the initial allocation of a non-pageable staging buffer takes long, but I’m not sure why; I haven’t run into this.

Everything is finished on the GPU side. This is ran to collect all the results of my calculation. (Though as mentioned before the CPU will be in uses by other users)

I tried using the pin function from the CUDA package to do this manually and got similar behavior and took 10 seconds to allocate and 1.7 second to transfer the results. Though I pin for my entire array which is not a real equivalent.

By using multiple threads and pinned buffers I get a performance of 1.8 second to allocate and 2.79 to transfer. This is till only a throughput of 7.86 GiB/s but this will have to suffice (it is not the major bottleneck anymore).

I’ll be going forward with using pinned memory due to it performing better.
Thank you for your answers.

It took 10 seconds to pin, or to allocate the array? A simple allocation should never take that long. Its expected though that pin + copy takes as long as an unpinned copy, because that’s what CUDA is doing in the background. The advantage (among other things) is that you can keep the pinned buffer around, or re-use it, to avoid having to pay the cost multiple times.

The 10 seconds was just for allocating the memory. It was relatively consistent so I don’t think it was because it was waiting for the resources to become available.