Fastest way to compute adjoint(x)Ax in CUDA?

joseph-t-lee · November 2, 2024, 3:48am

I am trying to compute an expectation value of an observable, which just comes down to computing: adjoint(x) * A * x, where x is an N dimendional vector with ComplexF32 entries and A in an N by N matrix.

Initially A and x are both matrices. I have been converting these to CUDA arrays via

N=1024
x_d = CuArray{Float32}(undef, N)
A_d = CuArray{Float32}(undef, N, N)
copyto!(x_d, x) 
copyto!(A_d, A)

What would be the fastest way to compute adjoint(x) * A * x?

I have tried

adjoint(x_d) * A_d * x_d

This seems to be the slowest of all my options, as I get the impression that multiple matrix multiplications are slower. I also found that adjoint(x_d) has a large overhead cost.

I have settled for:

tmp = A_d * x_d
Exp = dot(x_d, tmp)

Is there a faster option to do this? Also, is there any way to use sparse arrays? And is it necessarily faster?

gdalle · November 2, 2024, 6:32am

There’s a 3-argument dot(y, A, x) function in LinearAlgebra, you could try it and see if it is overloaded for CuArrays?

joseph-t-lee · November 2, 2024, 5:53pm

It looks like dot(y, A, x) has a huge overhead cost and is quite slow. I was comparing the following:

d = 1048
x = rand(ComplexF32, d)
A = rand(ComplexF32, d, d)
x_d = CuArray{ComplexF32}(undef, d)
A_d = CuArray{ComplexF32}(undef, d, d)
copyto!(x_d, x)
copyto!(A_d, A)

@time tmp = A_d * x_d
@time res1 = dot(x_d, tmp)
@time res2 = adjoint(x_d) * A_d * x_d
@time res3 = dot(x_d, A_d, x_d)
println()
@btime tmp = $A_d * $x_d
@btime res1 = dot($x_d, $tmp)
@btime res2 = adjoint($x_d) * $A_d * $x_d
@btime res3 = dot($x_d, $A_d, $x_d)

yielding:

0.000144 seconds (23 allocations: 496 bytes)
 0.001023 seconds (15 allocations: 240 bytes)
 0.000358 seconds (39 allocations: 752 bytes)
26.118309 seconds (15.39 M allocations: 301.947 MiB, 0.15% gc time)

 5.660 μs (23 allocations: 496 bytes)
 20.800 μs (14 allocations: 224 bytes)
 87.400 μs (37 allocations: 720 bytes)
 26.185 s (15390956 allocations: 301.95 MiB)

danielwe · November 2, 2024, 5:59pm

I don’t think three-arg dot is implemented for CuArrays (at least it didn’t work for me), so you’re probably falling back to a generic loop with scalar indexing, hence the massive overhead.

Try adding some parentheses to reassociate, i.e., try x' * (A * x). That’s the fastest version on my end, unless you go down the route of preallocating a temp vector.

Btw. note that

tmp = A * x
dot(x, tmp)

can be simplified to the equivalent dot(x, A * x).

joseph-t-lee · November 2, 2024, 6:05pm

I’m still finding that

tmp = A * x
dot(x, tmp)

computes in the same order of magnitude as

dot(x, A * x)

with @time, and the former is faster with @btime.

I don’t know if I fully understand how GPU computing works. When I write a line like y * (A * x) does my computer interpret this as do A*x first, then y * (Ax)?

danielwe · November 2, 2024, 6:57pm

That’s strange. Here’s what I get:

julia> using CUDA, LinearAlgebra, BenchmarkTools

julia> N = 1024;

julia> x = CUDA.rand(ComplexF32, N);

julia> A = CUDA.rand(ComplexF32, N, N);

julia> @btime begin
           tmp = $A * $x
           dot($x, tmp)
       end;
  118.237 μs (37 allocations: 720 bytes)

julia> @btime dot($x, $A * $x);
  116.874 μs (37 allocations: 720 bytes)

Yes. This is just what parentheses do, regardless of whether A or x are GPU arrays.

RomeoV · November 2, 2024, 7:20pm

FYI I don’t think CUDA itself supports x^T A x directly. So regardless how you implement it in Julia, in the end you’ll have to compute something like x^T (A x) anyways.

Also x^T y will always be “slow” on GPUs as you’re reducing a vector, which GPUs can’t do particularly well. If you need more speedup, you may look into whether you can batch multiple of those products together (if your program works that way).
I’m not sure at what size it’s worth it to transfer to CPU and reduce there…

danielwe · November 2, 2024, 7:27pm

Is this true? A GPU can do parallel tree reduction with complexity O(log N), while on a CPU this is always an O(N) operation (tree reduction can still reduce accumulated roundoff error, but not complexity). Sure you probably need a pretty large N to overcome the slower clock speed of the GPU, but it seems to have the advantage of scaling. (Caveat: haven’t run any benchmarks.)

joseph-t-lee · November 2, 2024, 7:34pm

I redid the benchmarking - doing the multiplication in 2 steps is equally as fast as dot($x, $A * $x), so I messed something up beforehand.

I was also under the impression that any type of matrix multiplication, dot product included, was faster on GPU. Though admittedly I am quite new to GPU computing and have no idea what the details are.

RomeoV · November 2, 2024, 8:04pm

Is this true?

Let me be more clear. Reducing on the GPU is slow compared to other operations on the GPU. Let’s compare summing a large matrix along one dimension to summing (reducing) a vector on a GPU.
~~Even though summing the matrix takes a lot more mathematical operations, it is faster than reducing the vector. The reason for this is what you mentioned.~~ redacted as per below.

using CUDA
x = cu(rand(10_000))
A = cu(rand(10_000, 10_000))

# EDIT: These numbers seem to be misleading.
julia> CUDA.@profile sum(x)
Host-side activity: calling CUDA APIs took 80.11 µs (38.44% of the trace)
# 80.11 µs
[...]
julia> CUDA.@profile sum(A; dims=1)
Host-side activity: calling CUDA APIs took 36.48 µs (0.79% of the trace)
# 36.48 µs
[...]

So the reduction is “slow” in the sense that it doesn’t compare favorably to other operations on the GPU.

danielwe · November 2, 2024, 8:21pm

Not sure this is how to read the output of CUDA.@profile, unless I’m misunderstanding your claim. Here’s what plain timings show:

julia> x = CUDA.rand(10000);

julia> A = CUDA.rand(10000, 10000);

julia> @btime CUDA.@sync sum($x);
  91.865 μs (95 allocations: 2.88 KiB)

julia> @btime CUDA.@sync sum($A; dims=1);
  110.844 ms (34 allocations: 976 bytes)

julia> @btime CUDA.@sync sum($A; dims=2);
  5.980 ms (34 allocations: 976 bytes)

RomeoV · November 2, 2024, 8:35pm

Yes you’re right, I am getting the same as you when measuring in another way. Thanks for the correction. I suppose CUDA.@profile doesn’t report what I thought, and also I had a thinking error.
When reducing a NxN matrix the GPU needs to reduce N times. When reducing a vector only log2(N) times. So we would expect sum(x) to be faster.

However, it still stays true that 1D reductions on the GPU are “slow” in the sense that GPUs can’t exploit their full parallelism.

I’ve also tried to benchmark x' * A * x vs collect(x)' * collect(A * x) where A and x are GPU arrays. However, I’m not sure if the results I’m getting are true. (My benchmark indicates that they take about the same time, across different sizes).

N = 10_000
x=cu(rand(Float32, N)); A=cu(rand(Float32, N, N))
julia> @btime CUDA.@sync x' * (A*x);
  2.379 ms (39 allocations: 752 bytes)

julia> @btime CUDA.@sync collect(x)' * collect(A*x);
  2.435 ms (44 allocations: 157.12 KiB)

danielwe · November 2, 2024, 8:39pm

At N = 10000, my dinky laptop GPU certainly isn’t of any help compared to the CPU. Continuing from the above:

julia> x_h = Array(x);

julia> @btime sum($x_h);  # Compare GPU: 91.865 μs
  1.364 μs (0 allocations: 0 bytes)

danielwe · November 2, 2024, 8:41pm

Probably because the timings are completely dominated by the matrix-vector product at every size you’re checking?

RomeoV · November 2, 2024, 8:44pm

Yeah that seems about right.

danielwe · November 2, 2024, 8:45pm

For an N x N matrix you can do N tree reductions in parallel (or some interpolation between the two extremes). I don’t know what sum is actually doing here, but it may be limited more by the number of threads that can actually run in parallel, rather than algorithmic limitations, depending on hardware.

danielwe · November 2, 2024, 8:54pm

I wanted to circle back to this.

Sparse arrays are generally only useful if they are really sparse, which typically means that only a few percent of the entries are nonzero. If your problem can be instantiated at multiple sizes N, the number of nonzero elements in the matrix should probably scale slower than N^2.
You can use sparse arrays both on CPU and GPU. I’ve never tried it on GPU, but the relevant Cuda module is CUDA.CUSPARSE. However, I wouldn’t expect to see a speedup of GPU compared to CPU as often for sparse problems. GPUs really excel at large, dense matrix-matrix multiplication—the more elements and the more work per element, the better.

danielwe · November 2, 2024, 9:06pm

Getting back to the actual problem: If you’re calculating lots of expectation values at the same system size, you can save a some time by preallocating a temporary array to hold A * x, so you don’t have to allocate a new one every time. It’s not a huge speedup, and relatively speaking it gets smaller as the system size increases (since A * x scales as N^2 while the allocation scales as N), but it makes a measurable difference at N = 1024 on my machine:

julia> using CUDA, LinearAlgebra, BenchmarkTools

julia> x = CUDA.rand(ComplexF32, 1024);

julia> A = CUDA.rand(ComplexF32, 1024, 1024);

julia> tmp = similar(x);

julia> @btime dot($x, $A * $x);  # not using tmp
  118.572 μs (37 allocations: 720 bytes)

julia> @btime dot($x, mul!($tmp, $A, $x));  # using tmp
  111.612 μs (32 allocations: 512 bytes)

joseph-t-lee · November 2, 2024, 9:10pm

Thanks for the input everyone. So my takeaways are the following:

Matrix multiplication is faster on the GPU (of course for small N it is probably not worthwhile). I can be smart about my parenthesis etc. to tell my computer what order to do things in. (I am assuming it is true that A*x is also still faster even if x is a vector? Though maybe I am misinterpreting the benchmarking above).
I’m not sure if I fully understood the discussion above, but I take it that sums over vectors (i.e. x^T A x type calculations) won’t be that beneficial on a GPU.

danielwe · November 2, 2024, 9:28pm

It can definitely be. You should verify with benchmarks on the machine you’re using, with something that looks like your problem, but here’s my laptop with a dense random matrix (remember to use CUDA.@sync!):

julia> using CUDA, LinearAlgebra, BenchmarkTools

julia> x = CUDA.rand(ComplexF32, 1024);

julia> A = CUDA.rand(ComplexF32, 1024, 1024);

julia> tmp = similar(x);

julia> x_h, A_h, tmp_h = Array(x), Array(A), Array(tmp);  # copy to CPU

julia> @btime CUDA.@sync mul!($tmp, $A, $x);  # GPU
  101.661 μs (18 allocations: 288 bytes)

julia> @btime mul!($tmp_h, $A_h, $x_h);  # CPU
  236.681 μs (0 allocations: 0 bytes)

This is specifically about linear reductions such as the dot product between two vectors, x' * tmp or equivalently dot(x, tmp), which might be slower on the GPU than the CPU.

julia> @btime CUDA.@sync dot($x, $tmp);  # GPU
  23.040 μs (14 allocations: 224 bytes)

julia> @btime dot($x_h, $tmp_h);  # CPU
  307.195 ns (0 allocations: 0 bytes)

But for large N, the expectation value x^T A x is completely dominated by the matrix-vector product A * x. So if the matrix-vector product gets a significant speedup on GPU, it’s worth using the GPU even if the final step is a little slower there.

julia> @btime CUDA.@sync dot($x, mul!($tmp, $A, $x));  # GPU
  112.486 μs (32 allocations: 512 bytes)

julia> @btime dot($x_h, mul!($tmp_h, $A_h, $x_h));  # CPU
  244.372 μs (0 allocations: 0 bytes)

You can also try a best-of-both-worlds strategy where you do A * x on the GPU and then copy the result to the CPU to compute the dot product, but on my machine this is slower due to the overhead of moving data from the GPU to the CPU:

julia> @btime begin
           mul!($tmp, $A, $x)  # GPU
           dot(Array($x), Array($tmp))  # CPU
       end;
  131.419 μs (30 allocations: 16.69 KiB)

Topic		Replies	Views
Efficient and in-place computation of A + adjoint(A) Performance question	14	723	January 24, 2023
Accessing array elements too slow? GPU	10	583	April 23, 2021
Adjoint/transpose wrapper question GPU	18	520	April 21, 2024
Fastest algorithm to compute `[x[i]*x[j] for i <= j where 1 <= i, j <= length(x)]`? New to Julia	8	608	January 23, 2019
Possible slow fallback of adjoint matrix multiplication for different element types Performance linearalgebra	0	276	May 10, 2021

Fastest way to compute adjoint(x)*A*x in CUDA?

Related topics

Fastest way to compute adjoint(x)Ax in CUDA?