Peculiar GPU behavior: zero performance cost for math functions?

I have a use case where the performance bottleneck is in computing trigonometric functions. I thought that maybe it would benefit from some more parallelism, and realized that having access to a university computing cluster is a good time to learn GPU computing for the first time :slight_smile: .

I found that surprisingly, not only do I benefit from the parallelism advantage, but also that seemingly, on the GPU the cost of computing trigonometric functions turns out to be negligible compared to trivial arithmetic. See the MWE example below where cos takes approximately the same time as x -> x^ to map over the GPU, but ~8 times as long on the CPU.

Could anyone shed any light on why that happens, and if maybe I can β€œbackport” this behavior to my present CPU-implemented use case? My one guess is that maybe on the GPU the calculation is memory-bandwidth-limited in both calculations, but the memory bandwidth for this device is 900 GB/s and (N*sizeof(eltype(x))) / (900 * 2^30) predicts approximately 70 microseconds, and so it doesn’t seem to be the predominant bottleneck.

using CUDA, BenchmarkTools, Test

println(CUDA.name(device()))
# 	Tesla V100-SXM2-32GB

N = 2^24;
x_d = CUDA.rand(N) ;
x = collect(x_d);

CUDA.@sync map!(cos, x_d, x_d); map!(cos, x, x);
println(isapprox(x, collect(x_d)))
# 	true

@btime  CUDA.@sync map!(x -> x^2, $(x_d), $(x_d));
#	196.627 us (65 allocations: 2.95 KiB)
@btime  map!(x -> x^2, $(x), $(x));
#	11.456 ms (0 allocations: 0 bytes)

@btime  CUDA.@sync map!(cos, $(x_d), $(x_d));
#	203.606 us (65 allocations: 2.95 KiB) 	---> basically same as arithmatic
@btime  map!(cos, $(x), $(x));
# 	87.118 ms (0 allocations: 0 bytes)    	---> considerably slower than arithmetic

Try using CUDA.@profile to see some more accurate timings of just the kernel. In the case of my RTX A6000, I’m getting:

julia> f = x -> x^2
julia> CUDA.@profile map!(f, (x_d), (x_d))
Profiler ran for 300.17 Β΅s, capturing 6 events.

Host-side activity: calling CUDA APIs took 234.37 Β΅s (78.08% of the trace)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Time (%) β”‚      Time β”‚ Calls β”‚  Avg time β”‚  Min time β”‚  Max time β”‚ Name           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   77.68% β”‚ 233.17 Β΅s β”‚     1 β”‚ 233.17 Β΅s β”‚ 233.17 Β΅s β”‚ 233.17 Β΅s β”‚ cuLaunchKernel β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Device-side activity: GPU was busy for 30.52 Β΅s (10.17% of the trace)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Time (%) β”‚     Time β”‚ Calls β”‚ Avg time β”‚ Min time β”‚ Max time β”‚ Name                                                β‹―
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   10.17% β”‚ 30.52 Β΅s β”‚     1 β”‚ 30.52 Β΅s β”‚ 30.52 Β΅s β”‚ 30.52 Β΅s β”‚ _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Flo β‹―
└──────────┴──────────┴───────┴──────────┴──────────┴──────────┴──────────────────────────────────────────────────────
                                                                                                      1 column omitted


julia> CUDA.@profile map!(cos, (x_d), (x_d))
Profiler ran for 276.8 Β΅s, capturing 6 events.

Host-side activity: calling CUDA APIs took 200.27 Β΅s (72.35% of the trace)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Time (%) β”‚      Time β”‚ Calls β”‚  Avg time β”‚  Min time β”‚  Max time β”‚ Name           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   71.83% β”‚ 198.84 Β΅s β”‚     1 β”‚ 198.84 Β΅s β”‚ 198.84 Β΅s β”‚ 198.84 Β΅s β”‚ cuLaunchKernel β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Device-side activity: GPU was busy for 36.48 Β΅s (13.18% of the trace)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Time (%) β”‚     Time β”‚ Calls β”‚ Avg time β”‚ Min time β”‚ Max time β”‚ Name                                                β‹―
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   13.18% β”‚ 36.48 Β΅s β”‚     1 β”‚ 36.48 Β΅s β”‚ 36.48 Β΅s β”‚ 36.48 Β΅s β”‚ _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Flo β‹―
└──────────┴──────────┴───────┴──────────┴──────────┴──────────┴──────────────────────────────────────────────────────

i.e. about 20% slower when doing cos instead of cubing.

1 Like

Interesting! My point is that those 20% are a far cry from the factor of ~8x on the CPU. I think I’m seeing a smaller delta, around ~4%, on my setup. Thoughts? I’m having some issues pasting out the formatted printing from my remote session, to here are just the pertinent lines:

julia> CUDA.@profile map!(f, x_d, x_d);
Device-side activity: GPU was busy for 189.54 Β΅s (23.23% of the trace)
|23.23% | 189.54 Β΅s |     1 | 189.54 Β΅s | 189.54 Β΅s | 189.54 Β΅s | _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Float32...

julia> CUDA.@profile map!(cos, x_d, x_d);
Device-side activity: GPU was busy for 196.7 Β΅s (24.09% of the trace)
| 24.09% | 196.7 Β΅s |     1 | 196.7 Β΅s | 196.7 Β΅s | 196.7 Β΅s | _Z10map_kernel15CuKernelContext13CuDeviceArrayI7Float3....

The answer is generally latency hiding – the GPU can cheaply context switch away from threads that are waiting for memory to another thread doing compute – but you’re right that in this case the bandwidth numbers don’t seem to make sense. The kernel should be reading/writing 128MiB, which in 30us gives 4TB/s while my device only does 960GB/s. I’d need to take a closer look. Still, it’s very likely that such a simple kernel is entirely bandwidth bound, and thus is ideally suited for hiding the latency of expensive arithmetic.

1 Like

To get full performance on a CPU you may need more than one thread for memory bandwidth and some help for SIMD trig functions. I think the LoopVectorization package can help with both. (Paging @Oscar_Smith who’s been informative on related questions.)

1 Like

You can probably get an extra factor of 2 or so by using LoopVectorization for the cos code (since it will use a vectorized cos implimentation. Other than that, there’s not much to do here. The difference between CPU and gpu here is that the GPU is spending its time entirely moving memory around while on the cpu, the computation can also be a bottleneck. That said, for both, you generally want to use a larger kernel than a single cos call since the fewer passes over memory you do, the faster your code will (usually) be.