Fastest way to compute adjoint(x)*A*x in CUDA?

It can definitely be. You should verify with benchmarks on the machine you’re using, with something that looks like your problem, but here’s my laptop with a dense random matrix (remember to use CUDA.@sync!):

julia> using CUDA, LinearAlgebra, BenchmarkTools

julia> x = CUDA.rand(ComplexF32, 1024);

julia> A = CUDA.rand(ComplexF32, 1024, 1024);

julia> tmp = similar(x);

julia> x_h, A_h, tmp_h = Array(x), Array(A), Array(tmp);  # copy to CPU

julia> @btime CUDA.@sync mul!($tmp, $A, $x);  # GPU
  101.661 μs (18 allocations: 288 bytes)

julia> @btime mul!($tmp_h, $A_h, $x_h);  # CPU
  236.681 μs (0 allocations: 0 bytes)

This is specifically about linear reductions such as the dot product between two vectors, x' * tmp or equivalently dot(x, tmp), which might be slower on the GPU than the CPU.

julia> @btime CUDA.@sync dot($x, $tmp);  # GPU
  23.040 μs (14 allocations: 224 bytes)

julia> @btime dot($x_h, $tmp_h);  # CPU
  307.195 ns (0 allocations: 0 bytes)

But for large N, the expectation value x^T A x is completely dominated by the matrix-vector product A * x. So if the matrix-vector product gets a significant speedup on GPU, it’s worth using the GPU even if the final step is a little slower there.

julia> @btime CUDA.@sync dot($x, mul!($tmp, $A, $x));  # GPU
  112.486 μs (32 allocations: 512 bytes)

julia> @btime dot($x_h, mul!($tmp_h, $A_h, $x_h));  # CPU
  244.372 μs (0 allocations: 0 bytes)

You can also try a best-of-both-worlds strategy where you do A * x on the GPU and then copy the result to the CPU to compute the dot product, but on my machine this is slower due to the overhead of moving data from the GPU to the CPU:

julia> @btime begin
           mul!($tmp, $A, $x)  # GPU
           dot(Array($x), Array($tmp))  # CPU
       end;
  131.419 μs (30 allocations: 16.69 KiB)
2 Likes