Questions about CUDA.dot() function

fred_wu · May 24, 2023, 8:51am

Hi,
I encountered two doubts when using the CUDA.dot() function. My understanding is that the CUDA.dot() function returns a CPU scalar after performing calculations on the GPU. I tested the running time of the CUDA.dot() function and the time to copy GPU data from the GPU to the CPU.

using CUDA
using BenchmarkTools

a = CUDA.rand(Float64,10000)
b = CUDA.rand(Float64,10000)
gpuvalue = CuArray([1.0])

@btime CUDA.@sync value1 = CUDA.dot(a,b)  #68.30us
@btime CUDA.@sync value2 = Array(gpuvalue)  #46.5us

The two measured times are very close, can I assume that the difference between the two measured times is the time for CUDA.dot() to do calculations on the GPU? That is, in this case, the main calculation time of the CUDA.dot() function is spent on copying GPU data to CPU data.

The second question concerns the order of execution when using the CUDA.dot() function in more complex expressions.

using CUDA
using BenchmarkTools

a = CUDA.rand(Float64,10000)
b = CUDA.rand(Float64,10000)
c = CUDA.rand(Float64,10000)
d = CUDA.rand(Float64,10000)

value = CUDA.dot(a,b)/CUDA.dot(c,d)

In the following expression, is it to calculate CUDA.dot(a,b), then calculate CUDA.dot(c,d), and finally divide the two scalar results to get the final result? Or the computation of CUDA.dot(a,b) and CUDA.dot(c,d) will be performed in parallel? This means that the scalar values for the numerator and denominator are obtained in parallel and the final result is obtained.

Thanks!

maleadt · May 24, 2023, 10:28am

Not necessarily, as part of the Array(gpuvalue) execution time may be due to Julia overhead that could happen in parallel to GPU computations. The best way to figure that out is to use NSight Compute.

They will not; within a Julia task, all CUDA operations are executed on the same stream, and are thus executed serially. If you want operations to overlap, you should use separate tasks (e.g., you could wrap both operations in a @async and fetch their results, on the condition that you first synchronize after producing the inputs).

fred_wu · May 25, 2023, 11:16am

Thank you for your answer！

Lhongpei · August 25, 2024, 7:07am

Hi, I try to follow the method to overlap 2 dot operations (as function P) and I compared it to one stream function Q. I run them 1000 times.

function Q(a ,b ,c)
    d1 = CUDA.dot(a, b)/CUDA.dot(a, c)

    return d1
end

function P(a, b, c)

    t1 = CUDA.@async begin
        CUDA.dot(a, b)
    end
    t3 = CUDA.@async begin
        CUDA.dot(a, c)
    end

    return fetch(t1)/fetch(t3)
end

The result given by @time is:

  0.052379 seconds (11.00 k allocations: 171.875 KiB)
  0.633777 seconds (85.01 k allocations: 4.734 MiB, 3.90% gc time, 17.16% compilation time)

which shows that P is really slow… Do you know the reason? Thank you very much~

amontoison · August 25, 2024, 5:05pm

dot requires a reduction, so even if we split computations across different threads, in the end, we need to sum the partial sums computed by each of them. Parallelism is limited on this operation, making it a bottleneck in many linear algebra codes on GPU.

Topic		Replies	Views
Questions about using CUDA.jl for GPU concurrent programming: Computational results cannot be obtained when overlapping GPU and CPU operations GPU question	2	428	April 12, 2023
Fastest way to compute adjoint(x)Ax in CUDA? GPU question , cuda	19	158	November 2, 2024
Dot-product of CuArray views is slow GPU performance , memory-allocation , views	10	1538	May 11, 2021
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	237	June 4, 2025
GPU Julia vs GPU Matlab New to Julia gpu	61	1064	November 18, 2024

Questions about CUDA.dot() function

Related topics