# Peculiar GPU behavior: zero performance cost for math functions?

I have a use case where the performance bottleneck is in computing trigonometric functions. I thought that maybe it would benefit from some more parallelism, and realized that having access to a university computing cluster is a good time to learn GPU computing for the first time .

I found that surprisingly, not only do I benefit from the parallelism advantage, but also that seemingly, on the GPU the cost of computing trigonometric functions turns out to be negligible compared to trivial arithmetic. See the MWE example below where `cos` takes approximately the same time as `x -> x^` to map over the GPU, but ~8 times as long on the CPU.

Could anyone shed any light on why that happens, and if maybe I can βbackportβ this behavior to my present CPU-implemented use case? My one guess is that maybe on the GPU the calculation is memory-bandwidth-limited in both calculations, but the memory bandwidth for this device is 900 GB/s and `(N*sizeof(eltype(x))) / (900 * 2^30) ` predicts approximately 70 microseconds, and so it doesnβt seem to be the predominant bottleneck.

``````using CUDA, BenchmarkTools, Test

println(CUDA.name(device()))
# 	Tesla V100-SXM2-32GB

N = 2^24;
x_d = CUDA.rand(N) ;
x = collect(x_d);

CUDA.@sync map!(cos, x_d, x_d); map!(cos, x, x);
println(isapprox(x, collect(x_d)))
# 	true

@btime  CUDA.@sync map!(x -> x^2, \$(x_d), \$(x_d));
#	196.627 us (65 allocations: 2.95 KiB)
@btime  map!(x -> x^2, \$(x), \$(x));
#	11.456 ms (0 allocations: 0 bytes)

@btime  CUDA.@sync map!(cos, \$(x_d), \$(x_d));
#	203.606 us (65 allocations: 2.95 KiB) 	---> basically same as arithmatic
@btime  map!(cos, \$(x), \$(x));
# 	87.118 ms (0 allocations: 0 bytes)    	---> considerably slower than arithmetic
``````

Try using `CUDA.@profile` to see some more accurate timings of just the kernel. In the case of my RTX A6000, Iβm getting:

``````julia> f = x -> x^2
julia> CUDA.@profile map!(f, (x_d), (x_d))
Profiler ran for 300.17 Β΅s, capturing 6 events.

Host-side activity: calling CUDA APIs took 234.37 Β΅s (78.08% of the trace)
ββββββββββββ¬ββββββββββββ¬ββββββββ¬ββββββββββββ¬ββββββββββββ¬ββββββββββββ¬βββββββββββββββββ
β Time (%) β      Time β Calls β  Avg time β  Min time β  Max time β Name           β
ββββββββββββΌββββββββββββΌββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββββββ€
β   77.68% β 233.17 Β΅s β     1 β 233.17 Β΅s β 233.17 Β΅s β 233.17 Β΅s β cuLaunchKernel β
ββββββββββββ΄ββββββββββββ΄ββββββββ΄ββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββββββ

Device-side activity: GPU was busy for 30.52 Β΅s (10.17% of the trace)
ββββββββββββ¬βββββββββββ¬ββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Time (%) β     Time β Calls β Avg time β Min time β Max time β Name                                                β―
ββββββββββββΌβββββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β   10.17% β 30.52 Β΅s β     1 β 30.52 Β΅s β 30.52 Β΅s β 30.52 Β΅s β _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Flo β―
ββββββββββββ΄βββββββββββ΄ββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 column omitted

julia> CUDA.@profile map!(cos, (x_d), (x_d))
Profiler ran for 276.8 Β΅s, capturing 6 events.

Host-side activity: calling CUDA APIs took 200.27 Β΅s (72.35% of the trace)
ββββββββββββ¬ββββββββββββ¬ββββββββ¬ββββββββββββ¬ββββββββββββ¬ββββββββββββ¬βββββββββββββββββ
β Time (%) β      Time β Calls β  Avg time β  Min time β  Max time β Name           β
ββββββββββββΌββββββββββββΌββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββββββ€
β   71.83% β 198.84 Β΅s β     1 β 198.84 Β΅s β 198.84 Β΅s β 198.84 Β΅s β cuLaunchKernel β
ββββββββββββ΄ββββββββββββ΄ββββββββ΄ββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββββββ

Device-side activity: GPU was busy for 36.48 Β΅s (13.18% of the trace)
ββββββββββββ¬βββββββββββ¬ββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Time (%) β     Time β Calls β Avg time β Min time β Max time β Name                                                β―
ββββββββββββΌβββββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β   13.18% β 36.48 Β΅s β     1 β 36.48 Β΅s β 36.48 Β΅s β 36.48 Β΅s β _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Flo β―
ββββββββββββ΄βββββββββββ΄ββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
``````

i.e. about 20% slower when doing `cos` instead of cubing.

1 Like

Interesting! My point is that those 20% are a far cry from the factor of ~8x on the CPU. I think Iβm seeing a smaller delta, around ~4%, on my setup. Thoughts? Iβm having some issues pasting out the formatted printing from my remote session, to here are just the pertinent lines:

``````julia> CUDA.@profile map!(f, x_d, x_d);
Device-side activity: GPU was busy for 189.54 Β΅s (23.23% of the trace)
|23.23% | 189.54 Β΅s |     1 | 189.54 Β΅s | 189.54 Β΅s | 189.54 Β΅s | _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Float32...

julia> CUDA.@profile map!(cos, x_d, x_d);
Device-side activity: GPU was busy for 196.7 Β΅s (24.09% of the trace)
| 24.09% | 196.7 Β΅s |     1 | 196.7 Β΅s | 196.7 Β΅s | 196.7 Β΅s | _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Float3....``````

The answer is generally latency hiding β the GPU can cheaply context switch away from threads that are waiting for memory to another thread doing compute β but youβre right that in this case the bandwidth numbers donβt seem to make sense. The kernel should be reading/writing 128MiB, which in 30us gives 4TB/s while my device only does 960GB/s. Iβd need to take a closer look. Still, itβs very likely that such a simple kernel is entirely bandwidth bound, and thus is ideally suited for hiding the latency of expensive arithmetic.

1 Like

To get full performance on a CPU you may need more than one thread for memory bandwidth and some help for SIMD trig functions. I think the LoopVectorization package can help with both. (Paging @Oscar_Smith whoβs been informative on related questions.)

1 Like

You can probably get an extra factor of 2 or so by using LoopVectorization for the `cos` code (since it will use a vectorized `cos` implimentation. Other than that, thereβs not much to do here. The difference between CPU and gpu here is that the GPU is spending its time entirely moving memory around while on the cpu, the computation can also be a bottleneck. That said, for both, you generally want to use a larger kernel than a single `cos` call since the fewer passes over memory you do, the faster your code will (usually) be.