Fastest way to compute adjoint(x)Ax in CUDA?

danielwe · November 2, 2024, 9:28pm

It can definitely be. You should verify with benchmarks on the machine you’re using, with something that looks like your problem, but here’s my laptop with a dense random matrix (remember to use CUDA.@sync!):

julia> using CUDA, LinearAlgebra, BenchmarkTools

julia> x = CUDA.rand(ComplexF32, 1024);

julia> A = CUDA.rand(ComplexF32, 1024, 1024);

julia> tmp = similar(x);

julia> x_h, A_h, tmp_h = Array(x), Array(A), Array(tmp);  # copy to CPU

julia> @btime CUDA.@sync mul!($tmp, $A, $x);  # GPU
  101.661 μs (18 allocations: 288 bytes)

julia> @btime mul!($tmp_h, $A_h, $x_h);  # CPU
  236.681 μs (0 allocations: 0 bytes)

This is specifically about linear reductions such as the dot product between two vectors, x' * tmp or equivalently dot(x, tmp), which might be slower on the GPU than the CPU.

julia> @btime CUDA.@sync dot($x, $tmp);  # GPU
  23.040 μs (14 allocations: 224 bytes)

julia> @btime dot($x_h, $tmp_h);  # CPU
  307.195 ns (0 allocations: 0 bytes)

But for large N, the expectation value x^T A x is completely dominated by the matrix-vector product A * x. So if the matrix-vector product gets a significant speedup on GPU, it’s worth using the GPU even if the final step is a little slower there.

julia> @btime CUDA.@sync dot($x, mul!($tmp, $A, $x));  # GPU
  112.486 μs (32 allocations: 512 bytes)

julia> @btime dot($x_h, mul!($tmp_h, $A_h, $x_h));  # CPU
  244.372 μs (0 allocations: 0 bytes)

You can also try a best-of-both-worlds strategy where you do A * x on the GPU and then copy the result to the CPU to compute the dot product, but on my machine this is slower due to the overhead of moving data from the GPU to the CPU:

julia> @btime begin
           mul!($tmp, $A, $x)  # GPU
           dot(Array($x), Array($tmp))  # CPU
       end;
  131.419 μs (30 allocations: 16.69 KiB)

Topic		Replies	Views
Efficient and in-place computation of A + adjoint(A) Performance question	14	723	January 24, 2023
Accessing array elements too slow? GPU	10	583	April 23, 2021
Adjoint/transpose wrapper question GPU	18	522	April 21, 2024
Fastest algorithm to compute `[x[i]*x[j] for i <= j where 1 <= i, j <= length(x)]`? New to Julia	8	608	January 23, 2019
Possible slow fallback of adjoint matrix multiplication for different element types Performance linearalgebra	0	276	May 10, 2021

Fastest way to compute adjoint(x)*A*x in CUDA?

Related topics

Fastest way to compute adjoint(x)Ax in CUDA?