It can definitely be. You should verify with benchmarks on the machine you’re using, with something that looks like your problem, but here’s my laptop with a dense random matrix (remember to use CUDA.@sync
!):
julia> using CUDA, LinearAlgebra, BenchmarkTools
julia> x = CUDA.rand(ComplexF32, 1024);
julia> A = CUDA.rand(ComplexF32, 1024, 1024);
julia> tmp = similar(x);
julia> x_h, A_h, tmp_h = Array(x), Array(A), Array(tmp); # copy to CPU
julia> @btime CUDA.@sync mul!($tmp, $A, $x); # GPU
101.661 ÎĽs (18 allocations: 288 bytes)
julia> @btime mul!($tmp_h, $A_h, $x_h); # CPU
236.681 ÎĽs (0 allocations: 0 bytes)
This is specifically about linear reductions such as the dot product between two vectors, x' * tmp
or equivalently dot(x, tmp)
, which might be slower on the GPU than the CPU.
julia> @btime CUDA.@sync dot($x, $tmp); # GPU
23.040 ÎĽs (14 allocations: 224 bytes)
julia> @btime dot($x_h, $tmp_h); # CPU
307.195 ns (0 allocations: 0 bytes)
But for large N, the expectation value x^T A x is completely dominated by the matrix-vector product A * x
. So if the matrix-vector product gets a significant speedup on GPU, it’s worth using the GPU even if the final step is a little slower there.
julia> @btime CUDA.@sync dot($x, mul!($tmp, $A, $x)); # GPU
112.486 ÎĽs (32 allocations: 512 bytes)
julia> @btime dot($x_h, mul!($tmp_h, $A_h, $x_h)); # CPU
244.372 ÎĽs (0 allocations: 0 bytes)
You can also try a best-of-both-worlds strategy where you do A * x on the GPU and then copy the result to the CPU to compute the dot product, but on my machine this is slower due to the overhead of moving data from the GPU to the CPU:
julia> @btime begin
mul!($tmp, $A, $x) # GPU
dot(Array($x), Array($tmp)) # CPU
end;
131.419 ÎĽs (30 allocations: 16.69 KiB)