Questions about CUDA.dot() function

Hi,
I encountered two doubts when using the CUDA.dot() function. My understanding is that the CUDA.dot() function returns a CPU scalar after performing calculations on the GPU. I tested the running time of the CUDA.dot() function and the time to copy GPU data from the GPU to the CPU.

using CUDA
using BenchmarkTools

a = CUDA.rand(Float64,10000)
b = CUDA.rand(Float64,10000)
gpuvalue = CuArray([1.0])

@btime CUDA.@sync value1 = CUDA.dot(a,b)  #68.30us
@btime CUDA.@sync value2 = Array(gpuvalue)  #46.5us
  • The two measured times are very close, can I assume that the difference between the two measured times is the time for CUDA.dot() to do calculations on the GPU? That is, in this case, the main calculation time of the CUDA.dot() function is spent on copying GPU data to CPU data.

The second question concerns the order of execution when using the CUDA.dot() function in more complex expressions.

using CUDA
using BenchmarkTools

a = CUDA.rand(Float64,10000)
b = CUDA.rand(Float64,10000)
c = CUDA.rand(Float64,10000)
d = CUDA.rand(Float64,10000)

value = CUDA.dot(a,b)/CUDA.dot(c,d)
  • In the following expression, is it to calculate CUDA.dot(a,b), then calculate CUDA.dot(c,d), and finally divide the two scalar results to get the final result? Or the computation of CUDA.dot(a,b) and CUDA.dot(c,d) will be performed in parallel? This means that the scalar values ​​for the numerator and denominator are obtained in parallel and the final result is obtained.

Thanks!

Not necessarily, as part of the Array(gpuvalue) execution time may be due to Julia overhead that could happen in parallel to GPU computations. The best way to figure that out is to use NSight Compute.

They will not; within a Julia task, all CUDA operations are executed on the same stream, and are thus executed serially. If you want operations to overlap, you should use separate tasks (e.g., you could wrap both operations in a @async and fetch their results, on the condition that you first synchronize after producing the inputs).

1 Like

Thank you for your answer!

Hi, I try to follow the method to overlap 2 dot operations (as function P) and I compared it to one stream function Q. I run them 1000 times.

function Q(a ,b ,c)
    d1 = CUDA.dot(a, b)/CUDA.dot(a, c)

    return d1
end

function P(a, b, c)

    t1 = CUDA.@async begin
        CUDA.dot(a, b)
    end
    t3 = CUDA.@async begin
        CUDA.dot(a, c)
    end

    return fetch(t1)/fetch(t3)
end

The result given by @time is:

  0.052379 seconds (11.00 k allocations: 171.875 KiB)
  0.633777 seconds (85.01 k allocations: 4.734 MiB, 3.90% gc time, 17.16% compilation time)

which shows that P is really slow… Do you know the reason? Thank you very much~

dot requires a reduction, so even if we split computations across different threads, in the end, we need to sum the partial sums computed by each of them. Parallelism is limited on this operation, making it a bottleneck in many linear algebra codes on GPU.