128x128 inputs are tiny; the time it costs to just launch a kernel is about 20us, and (our current implementation of) tr requires two kernels. But even with larger inputs the GPU won’t be faster here, as the hardware needs some computational complexity to hide memory latency. You’re essentially doing no compute at all, hence you’re just benchmarking the memory.