GPU kernel?

sim · December 22, 2022, 3:55pm

Hi everyone, I need help for understanding well how to take the most out of my GPU.

I have a large Hermitian matrix A, but very sparse (zeroes everywhere except exactly 3 entries equal to 1 on each row, so total number of nonzero entries is 3n).
I want to use the following function f:\mathbb{R}^n \to \mathbb{R}^n defined by f(x)_i = \sum_{j=1}^n A_{i,j}\sin(x_i - x_j) .
For those who know, it’s the flow in the Kuramoto model and my endgoal is to solve \dot{x} = f(x), optimized as possible.

As you see, each component is a simple operation involving 3 entries of x.

My naive implementation

f(x) = sum(sin.((x .- transpose(x)) .* A), dims=2)

Here, A is stored as a (dense) Array on CPU or a CuArray on GPU. Hereafter n=1000. I’m testing f(x) for x = rand(n).

[ Info: Testing on CPU
  4.487 ms (7 allocations: 7.64 MiB)
[ Info: Testing on GPU
  449.815 μs (111 allocations: 7.52 KiB)

Then I wanted to check what happened if A is stored as a SparseMatricCSC{Int64, Int64} matrix on the CPU, and as a CUDA.CUSPARSE.CuSparseMatrixCSC{Int64, Int32} on the GPU. To my surprise, I got a small degradation:

[ Info: Testing on CPU
  6.190 ms (35 allocations: 15.39 MiB)
[ Info: Testing on GPU
  473.239 μs (199 allocations: 11.73 KiB)

Sparsity ?

The implementation of f obviously does not take into account the sparsity of the matrix A and perform a whole bunch of useless operations (eg computing all the x_i - x_j for all i,j). Storing A as sparse Arrays does not improve the performance of f, which makes me think I could vastly optimize f by leveraging the sparsity of A. Could anyone help me in this direction ?

rveltz · December 22, 2022, 4:12pm

why dont you expand the cos and do 2 matrix vector products?

sim · December 22, 2022, 4:17pm

Hi, thanks this is a good remark — however I was actually planning to study more general models than Kuramoto’s, defined for example as f(x)_i = \sum_{j=1}^n A_{i,j}\sigma(x_i - x_j) for very general \sigma for which trig formulas are not valid.

PetrKryslUCSD · December 22, 2022, 4:22pm

I would use the matrix as I, J, V. The matrix product should be quite a bit faster then, no?

mcabbott · December 22, 2022, 4:32pm

There ought to be a huge speedup from sparsity, N^2 to 3N terms. A quick attempt:

using Tullio, Random
f1(x, A) = sum(sin.((x .- transpose(x)) .* A), dims=2) |> vec  # as above
f2(x, A) = @tullio out[i] := A[i,j] * sin(x[i] - x[j])  # same alg
f3(x, js) = @tullio out[i] := sin(x[i] - x[js[i,k]])  # uses sparsity

let n = 1000
  x = randn(n)
  # js = rand(1:n, n, 3)   # locations of the 1s... but this has repeats!
  js = reduce(hcat, [shuffle(1:n)[1:3] for _ in 1:n]) |> permutedims
  A = zeros(n,n)
  foreach(eachrow(A), eachrow(js)) do a, j
    a[j] .= 1
  end
  y1 = @btime f1($x, $A)
  y2 = @btime f2($x, $A)
  y3 = @btime f3($x, $js)  # 200x quicker at n=1000
  y1 ≈ y2 ≈ y3
end

using LoopVectorization, KernelAbstractions, CUDAKernels  # next...

sim · December 22, 2022, 4:43pm

I suppose you mean I,J are the head/tail edge matrices ? Will give it a try !

sim · December 22, 2022, 4:44pm

Thanks, will give it a try ! I didnt know your Tullio.jl package and I’ll definitely try it. Could you explain what is the @tullio macro ?

edit: I also got a x200 speedup on my machine just by throwing your code. Impressive.

PetrKryslUCSD · December 22, 2022, 4:44pm

Right. In fact, I would sort the I. It should make memory faults rarer.

sim · December 22, 2022, 11:01pm

ON CPU, your einsum-like solution using @tullio is really performant. However I had trouble understanding how it translates to GPU. Just as in Tullio’s doc I am using CUDAKernels, KernelAbstractions, CUDA before defining my function f(x,L) = @tullio .... but there’s no difference on CPU or GPU.

mcabbott · December 22, 2022, 11:31pm

It should probably look something like @btime CUDA.@sync f1($(cu(x)), $(cu(A)));. This appears to be slower than CPU at n=1000 (and much slower if you include the time to transfer).

Tullio is not always fast at GPU stuff. You can also make use of the spartisity without it, e.g. with f4(x, js) = sum(sin.(x .- @view x[js]); dims=2) |> vec

Topic		Replies	Views
(fast) sparse solve on GPU General Usage question , gpu	7	1685	December 28, 2019
Sparse LU factorization on GPU GPU linearalgebra , factorization	12	532	November 2, 2024
Linear algebra of sparse matrices on GPU GPU gpu , linearalgebra	0	472	April 10, 2022
Extracting dense submatrix from sparse matrix GPU	2	671	May 17, 2021
Sparse Matrix with CUDA.jl Performance question	3	4014	June 10, 2021

GPU kernel?

My naive implementation

Sparsity ?

Related topics