Hi everyone,
I’m running into an issue where the first attempt to copy data from the device to the host is extremely slow. This problem persists even after running the function multiple times, so I don’t think it’s related to JIT compilation.
I’m using an A100 GPU, which should theoretically provide around 32 GiB/s bandwidth. While I understand performance might be lower due to using pageable memory instead of pinned memory, I’m only seeing about 1 GiB/s, which seems unusually low. Below is the relevant part of the code where with time for different parts. These times are about the same every time I test it.
#allocate array
result_complex = Matrix{ComplexF64}(undef, 38402, 38402)
complex_array = KernelAbstractions.allocate(backend, ComplexF64, size(result_complex ))
#test performance afther allocation
GiB = prod(size(complex_array)) * sizeof(ComplexF64) / 2^30
time_to_transfer_with_copy = @elapsed begin
copyto!(result_cpu, complex_array)
end
@show time_to_transfer_with_copy #time_to_transfer_with_copy = 13.971949017
println("GiB/s = ", GiB / time_to_transfer_with_copy) #GiB/s = 1.572790245743567
KernelAbstractions.synchronize(backend)
time_to_transfer_with_copy = @elapsed begin
copyto!(result_cpu, complex_array)
end
@show time_to_transfer_with_copy #time_to_transfer_with_copy = 1.51503066
println("GiB/s = ", GiB / time_to_transfer_with_copy) #GiB/s = 14.504620736826553
#other part of my code where results get stored in gpu_array
...
#part to get data to the CPU
make_complex(backend)(complex_array, gpu_array, ndrange = (numfunctions(test_functions), numfunctions(trial_functions)))
KernelAbstractions.synchronize(backend)
time_to_transfer_with_copy = @elapsed begin
copyto!(result_complex, complex_array)
end
@show time_to_transfer_with_copy #time_to_transfer_with_copy = 14.026707694
println("GiB/s = ", GiB / time_to_transfer_with_copy) # GiB/s = 1.5666502508898734
KernelAbstractions.synchronize(backend)
time_to_transfer_with_copy = @elapsed begin
copyto!(result_complex, complex_array)
end
@show time_to_transfer_with_copy #time_to_transfer_with_copy = 1.496582015
println("GiB/s = ", GiB / time_to_transfer_with_copy) #time_to_transfer_with_copy = 1.496582015
So extra info that might be relevant. I’m running this on a cluster node (bare metal). There might be high usage of the CPU and memory but no one else is using the GPU. But even when not a lot of people are using the node this still happens.
Next to this issue I have an issue where for the first kernel that is called it takes around 10 second extra to excute. This might not be related and is only a minor issue, but is you know why this happend any info is apriceted.