I am new to GPU computation and I want to start from julia.
I use CUDA.jl
to do this and write some code following the instruction.
However, it takes very long time for the following simple code:
module TestMyCUDA
function test()
CUDA.memory_status()
num_c=100
num_x=512
@time array1 = CuArray{Float32}(rand(Float64,num_c,3))
@time array2 = CUDA.zeros(Complex{Float32}, num_x,num_x,num_x) # allocate the lattice on the GPU
@time @cuda dosomething!(array2,array1,num_x)
return nothing
end
function dosomething!(array2,array1,num_x)
index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
stride = gridDim().x * blockDim().x
for N=index:stride:num_x^3
i = (N-1) ÷ num_x^2 +1
j = ((N-1) % num_x^2 ) ÷ num_x + 1
k = ((N-1) % num_x^2 ) % num_x + 1
array2[i,j,k]=1.0+1.0im
end
return nothing
end
end
import .TestMyCUDA
TestMyCUDA.test()
In this example code, I simply do 2 things: allocating two CuArray
’s, and giving array2
some value.
It works out when running it for the first time:
Effective GPU memory usage: 19.83% (3.172 GiB/16.000 GiB)
Memory pool usage: 2.000 GiB (2.031 GiB reserved)
0.000130 seconds (14 allocations: 4.141 KiB)
0.014412 seconds (851 allocations: 33.094 KiB, 82.98% compilation time)
0.371885 seconds (86.76 k allocations: 7.021 MiB, 4.08% gc time, 19.74% compilation time)
However, when running it for second time, the first allocation will take a surprisingly long time:
Effective GPU memory usage: 26.08% (4.172 GiB/16.000 GiB)
Memory pool usage: 2.000 GiB (3.031 GiB reserved)
63.046641 seconds (15 allocations: 4.156 KiB)
0.000133 seconds (42 allocations: 1.766 KiB)
0.579747 seconds (67.56 k allocations: 5.685 MiB, 4.38% compilation time)
Interestingly, it is not always the allocation that takes very long time.
If I insert some other code, the dosomething!
function will become very slow and the allocation function becomes normal speed.
Do anyone know why this happens?