I have a use case where the performance bottleneck is in computing trigonometric functions. I thought that maybe it would benefit from some more parallelism, and realized that having access to a university computing cluster is a good time to learn GPU computing for the first time .
I found that surprisingly, not only do I benefit from the parallelism advantage, but also that seemingly, on the GPU the cost of computing trigonometric functions turns out to be negligible compared to trivial arithmetic. See the MWE example below where cos takes approximately the same time as x -> x^ to map over the GPU, but ~8 times as long on the CPU.
Could anyone shed any light on why that happens, and if maybe I can βbackportβ this behavior to my present CPU-implemented use case? My one guess is that maybe on the GPU the calculation is memory-bandwidth-limited in both calculations, but the memory bandwidth for this device is 900 GB/s and (N*sizeof(eltype(x))) / (900 * 2^30) predicts approximately 70 microseconds, and so it doesnβt seem to be the predominant bottleneck.
using CUDA, BenchmarkTools, Test
println(CUDA.name(device()))
# Tesla V100-SXM2-32GB
N = 2^24;
x_d = CUDA.rand(N) ;
x = collect(x_d);
CUDA.@sync map!(cos, x_d, x_d); map!(cos, x, x);
println(isapprox(x, collect(x_d)))
# true
@btime CUDA.@sync map!(x -> x^2, $(x_d), $(x_d));
# 196.627 us (65 allocations: 2.95 KiB)
@btime map!(x -> x^2, $(x), $(x));
# 11.456 ms (0 allocations: 0 bytes)
@btime CUDA.@sync map!(cos, $(x_d), $(x_d));
# 203.606 us (65 allocations: 2.95 KiB) ---> basically same as arithmatic
@btime map!(cos, $(x), $(x));
# 87.118 ms (0 allocations: 0 bytes) ---> considerably slower than arithmetic
Try using CUDA.@profile to see some more accurate timings of just the kernel. In the case of my RTX A6000, Iβm getting:
julia> f = x -> x^2
julia> CUDA.@profile map!(f, (x_d), (x_d))
Profiler ran for 300.17 Β΅s, capturing 6 events.
Host-side activity: calling CUDA APIs took 234.37 Β΅s (78.08% of the trace)
ββββββββββββ¬ββββββββββββ¬ββββββββ¬ββββββββββββ¬ββββββββββββ¬ββββββββββββ¬βββββββββββββββββ
β Time (%) β Time β Calls β Avg time β Min time β Max time β Name β
ββββββββββββΌββββββββββββΌββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββββββ€
β 77.68% β 233.17 Β΅s β 1 β 233.17 Β΅s β 233.17 Β΅s β 233.17 Β΅s β cuLaunchKernel β
ββββββββββββ΄ββββββββββββ΄ββββββββ΄ββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββββββ
Device-side activity: GPU was busy for 30.52 Β΅s (10.17% of the trace)
ββββββββββββ¬βββββββββββ¬ββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Time (%) β Time β Calls β Avg time β Min time β Max time β Name β―
ββββββββββββΌβββββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 10.17% β 30.52 Β΅s β 1 β 30.52 Β΅s β 30.52 Β΅s β 30.52 Β΅s β _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Flo β―
ββββββββββββ΄βββββββββββ΄ββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 column omitted
julia> CUDA.@profile map!(cos, (x_d), (x_d))
Profiler ran for 276.8 Β΅s, capturing 6 events.
Host-side activity: calling CUDA APIs took 200.27 Β΅s (72.35% of the trace)
ββββββββββββ¬ββββββββββββ¬ββββββββ¬ββββββββββββ¬ββββββββββββ¬ββββββββββββ¬βββββββββββββββββ
β Time (%) β Time β Calls β Avg time β Min time β Max time β Name β
ββββββββββββΌββββββββββββΌββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌβββββββββββββββββ€
β 71.83% β 198.84 Β΅s β 1 β 198.84 Β΅s β 198.84 Β΅s β 198.84 Β΅s β cuLaunchKernel β
ββββββββββββ΄ββββββββββββ΄ββββββββ΄ββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββββββ
Device-side activity: GPU was busy for 36.48 Β΅s (13.18% of the trace)
ββββββββββββ¬βββββββββββ¬ββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Time (%) β Time β Calls β Avg time β Min time β Max time β Name β―
ββββββββββββΌβββββββββββΌββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 13.18% β 36.48 Β΅s β 1 β 36.48 Β΅s β 36.48 Β΅s β 36.48 Β΅s β _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Flo β―
ββββββββββββ΄βββββββββββ΄ββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
i.e. about 20% slower when doing cos instead of cubing.
Interesting! My point is that those 20% are a far cry from the factor of ~8x on the CPU. I think Iβm seeing a smaller delta, around ~4%, on my setup. Thoughts? Iβm having some issues pasting out the formatted printing from my remote session, to here are just the pertinent lines:
julia> CUDA.@profile map!(f, x_d, x_d);
Device-side activity: GPU was busy for 189.54 Β΅s (23.23% of the trace)
|23.23% | 189.54 Β΅s | 1 | 189.54 Β΅s | 189.54 Β΅s | 189.54 Β΅s | _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Float32...
julia> CUDA.@profile map!(cos, x_d, x_d);
Device-side activity: GPU was busy for 196.7 Β΅s (24.09% of the trace)
| 24.09% | 196.7 Β΅s | 1 | 196.7 Β΅s | 196.7 Β΅s | 196.7 Β΅s | _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Float3....
The answer is generally latency hiding β the GPU can cheaply context switch away from threads that are waiting for memory to another thread doing compute β but youβre right that in this case the bandwidth numbers donβt seem to make sense. The kernel should be reading/writing 128MiB, which in 30us gives 4TB/s while my device only does 960GB/s. Iβd need to take a closer look. Still, itβs very likely that such a simple kernel is entirely bandwidth bound, and thus is ideally suited for hiding the latency of expensive arithmetic.
To get full performance on a CPU you may need more than one thread for memory bandwidth and some help for SIMD trig functions. I think the LoopVectorization package can help with both. (Paging @Oscar_Smith whoβs been informative on related questions.)
You can probably get an extra factor of 2 or so by using LoopVectorization for the cos code (since it will use a vectorized cos implimentation. Other than that, thereβs not much to do here. The difference between CPU and gpu here is that the GPU is spending its time entirely moving memory around while on the cpu, the computation can also be a bottleneck. That said, for both, you generally want to use a larger kernel than a single cos call since the fewer passes over memory you do, the faster your code will (usually) be.