Why is 'trace' slow on CUDA matrices?

Hello,

I am kind of surprised to see this benchmark:

julia> using BenchmarkTools, CUDA, LinearAlgebra

julia> A = rand(128, 128);

julia> Ad = cu(A);

julia> @btime tr($A)
  70.092 ns (0 allocations: 0 bytes)
64.9944482344768

julia> @btime tr($Ad)
  84.400 μs (76 allocations: 3.72 KiB)
64.994446f0

Computing the trace of a matrix is 1000 times slower with CUDA. Is there any way around this?

This is just a memory latency benchmark and CPUs have lower memory latency.

So if I use traces of CuMatrices in my subroutines it will not create a performance dip?

it will if your real use case is dominated by things like this example

128x128 inputs are tiny; the time it costs to just launch a kernel is about 20us, and (our current implementation of) tr requires two kernels. But even with larger inputs the GPU won’t be faster here, as the hardware needs some computational complexity to hide memory latency. You’re essentially doing no compute at all, hence you’re just benchmarking the memory.

1 Like