What is the optimal way of updating CuArray?

cudanative

#1

Hi,
I observed an interesting behavior during the optimisation of my code with CUDAnative.jl
I have the following code

using CUDAnative
using CUDAdrv
for a=1:100 
    for b=1:100 
        X = ... # rather complicated computation of X::Vector{Int64}, length(X) ~ 5,000
        @time cuX = CuArray(X)
        ...
    end
end

I observed that the time spend on theoperation cuX = CuArray(X) is not uniform: in the first few loop it is fast, but afterwards it takes more than 10s for each operation, i.e. 5 order of magnitude longer time.

My question is, what would be the best way to modify cuX ?


#2
count=0
  0.000004 seconds (8 allocations: 1.188 KiB)
  0.000027 seconds (3 allocations: 39.438 KiB)
Array{Int64,1}
  0.000409 seconds (4 allocations: 96 bytes)
 31.677854 seconds (21.81 M allocations: 1.069 GiB, 1.63% gc time)
count=1
  0.000005 seconds (8 allocations: 1.188 KiB)
  0.000028 seconds (3 allocations: 39.438 KiB)
Array{Int64,1}
  0.000074 seconds (4 allocations: 96 bytes)
  0.000045 seconds (92 allocations: 4.141 KiB)
count=2
  0.000002 seconds (8 allocations: 1.188 KiB)
  0.000019 seconds (3 allocations: 39.438 KiB)
Array{Int64,1}
 18.613582 seconds (4 allocations: 96 bytes)
  0.000038 seconds (92 allocations: 4.141 KiB)
... ...
count=36
  0.000004 seconds (8 allocations: 1.188 KiB)
  0.000026 seconds (3 allocations: 39.438 KiB)
Array{Int64,1}
 18.646781 seconds (4 allocations: 96 bytes)
  0.000050 seconds (92 allocations: 4.141 KiB)
count=37
  0.000004 seconds (8 allocations: 1.188 KiB)
  0.000024 seconds (3 allocations: 39.438 KiB)
Array{Int64,1}
 18.647137 seconds (4 allocations: 96 bytes)
  0.000055 seconds (92 allocations: 4.141 KiB)

the longest time elapsed is the operation cuX = CuArray(X)


#3

copyto!(cu_X,X[:]) gives the same performance


#4

I can’t reproduce this. CUDA’s malloc is known to be slow, that’s why CuArrays.jl (the userfriendlier CUDA array type) has a pooling memory allocator that reuses previous allocations, but this isn’t the kind of slowdown I’d expect. Also, copyto! doesn’t allocate but just uploads memory, so this looks like a problem with your set-up…

I tried with:

using CUDAnative
using CUDAdrv
arrays = []
for a=1:100 
    for b=1:100 
        X = randn(5000)
        @time cuX = CuArray(X)
        push!(arrays, cuX) # to make sure the arrays aren't freed
    end
end

#5

One possible issue is that the GPU is still finishing your previous operation (GPU operations are generally asynchronous) and you’re just waiting for your previous operation to finish.


#6

Hi Tim,
thanks for the quick response!
I also tried this by myself and got the same conclusion.
Actually the operation mentioned earlier is followed by further CUDAnative calls. Here is the entire piece of code (it is a frequency sum of two Green’s functions :stuck_out_tongue: ):

            count = 0
            for (iq,q_) in enumerate(Π_MKgrid.GRID[:])
                for (iω,ω_) in enumerate(Π_ω)
                    println("count=",count)
                    t0 = time_ns()
                    #----------------------BLOCK 1------------------------------
                    t0 = time_ns()
                    ωstart = (ω_+G2.ω[1]) - G2ext.ω[1] + 1
                    ωend = (ω_+G2.ω[end]) - G2ext.ω[1] + 1
                    IND = indG2ext[ωstart:ωend,transformed_index(q_,G1.K)][:]
                    t1 = time_ns()
                    #----------------------BLOCK 2------------------------------
                    t1 = time_ns()
                    cu_Ind2 = CuArray(IND)        # THE TROUBLESOME OPERATION
                    t2 = time_ns()
                    #----------------------BLOCK 3------------------------------
                    t2 = time_ns()
                    println("time1 = ", (t1-t0)/1e6, "ms")
                    println("time2 = ", (t2-t1)/1e6, "ms")
                    t2 = time_ns()
                    #----------------------BLOCK 4------------------------------
                    t2 = time_ns()
                    @cuda threads=(1024,1,1) blocks=(NG,dimX,dimY) shmem=1024 TrAXBY_CUDA!(cuTMP,cu_G1.ωk,cu_G2ext.ωk,cu_Ind2, IV1,JV1,VV1,IV2,JV2,VV2,CTRL_PARAM)
                    t3 = time_ns()
                    #----------------------BLOCK 5------------------------------
                    t3 = time_ns()
                    println("time3 = ", (t3-t2)/1e6, "ms")
                    t3 = time_ns()
                    #----------------------BLOCK 6------------------------------
                    t3 = time_ns()
                    @cuda threads=(1024,1) blocks=(dimX,dimY) shmem=1024 REDUCE_SPECIAL_VER!(iω,iq,NG,dimX,dimY,cuTMP,Π_ωk)
                    t4 = time_ns()
                    t4 = time_ns()
                    println("time4 = ", (t4-t3)/1e6, "ms")
                    println("------------------------------")
                    count += 1
                end
            end

It generates a steady timing profile as the following

count=0
time1 = 0.025915ms
time2 = 0.41788ms
time3 = 13158.3349ms
time4 = 18897.119132ms
------------------------------
count=1
time1 = 0.028788ms
time2 = 3.650909ms
time3 = 0.032666ms
time4 = 0.015446ms
------------------------------
count=2
time1 = 0.017196ms
time2 = 18585.719301ms
time3 = 0.038298ms
time4 = 0.0162ms
------------------------------
count=3
time1 = 0.025236ms
time2 = 18517.259823ms
time3 = 0.035288ms
time4 = 0.015438ms
------------------------------
count=4
time1 = 0.025769ms
time2 = 18517.42081ms
time3 = 0.035106ms
time4 = 0.01575ms
------------------------------
...
...

What troubles me was the line with the comment # THE TROUBLESOME OPERATION, which correspond to the timing output time2 = ... ms

I over-simplified the problem in the original question. You can see that it is anomalous: after a few loops time2 always get ~ 18000 ms, which is at least 5 order of magnitude than expected.

I suspect that my timing method is wrong.

But, somehow magically, if I comment out the @cuda ... calling in BLOCK 4 ( which is an essential calculation follows # THE TROUBLESOME OPERATION and takes its result cu_Ind2 as input ) , the timing of # THE TROUBLESOME OPERATION just restored to a normal time elapse (I haven’t copied the timing profile here).

I am so confused …


#7

Haven’t looked at your code in detail yet, but you can try adding some CUDAdrv.synchronize() calls to synchronize the GPU, those will “soak up” the time of running asynchronous kernels. Also, if you run under nvprof you’ll get to see the actual execution times of kernels running on the GPU.


#8

CUDAdrv.synchronize() helps! Indeed it is an effect of asynchronisation. @Keno is right, the processes wasn’t actually terminated as they appeared to be. I have inserted the CUDAdrv.synchronize() function in between the timing functions, and get some understandable timing profiles now. Thank you very much @maleadt @Keno !!