Hi,

I encountered two doubts when using the CUDA.dot() function. My understanding is that the CUDA.dot() function returns a CPU scalar after performing calculations on the GPU. I tested the running time of the CUDA.dot() function and the time to copy GPU data from the GPU to the CPU.

```
using CUDA
using BenchmarkTools
a = CUDA.rand(Float64,10000)
b = CUDA.rand(Float64,10000)
gpuvalue = CuArray([1.0])
@btime CUDA.@sync value1 = CUDA.dot(a,b) #68.30us
@btime CUDA.@sync value2 = Array(gpuvalue) #46.5us
```

- The two measured times are very close, can I assume that the difference between the two measured times is the time for CUDA.dot() to do calculations on the GPU? That is, in this case, the main calculation time of the CUDA.dot() function is spent on copying GPU data to CPU data.

The second question concerns the order of execution when using the CUDA.dot() function in more complex expressions.

```
using CUDA
using BenchmarkTools
a = CUDA.rand(Float64,10000)
b = CUDA.rand(Float64,10000)
c = CUDA.rand(Float64,10000)
d = CUDA.rand(Float64,10000)
value = CUDA.dot(a,b)/CUDA.dot(c,d)
```

- In the following expression, is it to calculate CUDA.dot(a,b), then calculate CUDA.dot(c,d), and finally divide the two scalar results to get the final result? Or the computation of CUDA.dot(a,b) and CUDA.dot(c,d) will be performed in parallel? This means that the scalar values for the numerator and denominator are obtained in parallel and the final result is obtained.

Thanks!