Why is 'trace' slow on CUDA matrices?

HenriDeh · February 16, 2023, 1:51pm

Hello,

I am kind of surprised to see this benchmark:

julia> using BenchmarkTools, CUDA, LinearAlgebra

julia> A = rand(128, 128);

julia> Ad = cu(A);

julia> @btime tr($A)
  70.092 ns (0 allocations: 0 bytes)
64.9944482344768

julia> @btime tr($Ad)
  84.400 μs (76 allocations: 3.72 KiB)
64.994446f0

Computing the trace of a matrix is 1000 times slower with CUDA. Is there any way around this?

Oscar_Smith · February 16, 2023, 2:39pm

This is just a memory latency benchmark and CPUs have lower memory latency.

HenriDeh · February 16, 2023, 2:46pm

So if I use traces of CuMatrices in my subroutines it will not create a performance dip?

jling · February 16, 2023, 3:00pm

it will if your real use case is dominated by things like this example

maleadt · February 16, 2023, 4:05pm

128x128 inputs are tiny; the time it costs to just launch a kernel is about 20us, and (our current implementation of) tr requires two kernels. But even with larger inputs the GPU won’t be faster here, as the hardware needs some computational complexity to hide memory latency. You’re essentially doing no compute at all, hence you’re just benchmarking the memory.

Topic		Replies	Views
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	216	June 4, 2025
Why CUDA is so slow on y = x*w + w0? GPU	1	242	March 28, 2024
CPU/GPU data transfer speed GPU	12	7459	December 6, 2019
Parallelizaton on GPU slower than on CPU...? Performance gpu	10	2331	January 21, 2020
Why Julia is much slower than MATLAB on GPU computing? GPU matlab , cuda	30	3268	November 20, 2023

Why is 'trace' slow on CUDA matrices?

Related topics