GPU kernel?

Hi everyone, I need help for understanding well how to take the most out of my GPU.

I have a large Hermitian matrix A, but very sparse (zeroes everywhere except exactly 3 entries equal to 1 on each row, so total number of nonzero entries is 3n).
I want to use the following function f:\mathbb{R}^n \to \mathbb{R}^n defined by f(x)_i = \sum_{j=1}^n A_{i,j}\sin(x_i - x_j) .
For those who know, it’s the flow in the Kuramoto model and my endgoal is to solve \dot{x} = f(x), optimized as possible.

As you see, each component is a simple operation involving 3 entries of x.

My naive implementation

f(x) = sum(sin.((x .- transpose(x)) .* A), dims=2)

Here, A is stored as a (dense) Array on CPU or a CuArray on GPU. Hereafter n=1000. I’m testing f(x) for x = rand(n).

[ Info: Testing on CPU
  4.487 ms (7 allocations: 7.64 MiB)
[ Info: Testing on GPU
  449.815 μs (111 allocations: 7.52 KiB)

Then I wanted to check what happened if A is stored as a SparseMatricCSC{Int64, Int64} matrix on the CPU, and as a CUDA.CUSPARSE.CuSparseMatrixCSC{Int64, Int32} on the GPU. To my surprise, I got a small degradation:

[ Info: Testing on CPU
  6.190 ms (35 allocations: 15.39 MiB)
[ Info: Testing on GPU
  473.239 μs (199 allocations: 11.73 KiB)

Sparsity ?

The implementation of f obviously does not take into account the sparsity of the matrix A and perform a whole bunch of useless operations (eg computing all the x_i - x_j for all i,j). Storing A as sparse Arrays does not improve the performance of f, which makes me think I could vastly optimize f by leveraging the sparsity of A. Could anyone help me in this direction ?

1 Like

why dont you expand the cos and do 2 matrix vector products?

Hi, thanks this is a good remark — however I was actually planning to study more general models than Kuramoto’s, defined for example as f(x)_i = \sum_{j=1}^n A_{i,j}\sigma(x_i - x_j) for very general \sigma for which trig formulas are not valid.

I would use the matrix as I, J, V. The matrix product should be quite a bit faster then, no?

There ought to be a huge speedup from sparsity, N^2 to 3N terms. A quick attempt:

using Tullio, Random
f1(x, A) = sum(sin.((x .- transpose(x)) .* A), dims=2) |> vec  # as above
f2(x, A) = @tullio out[i] := A[i,j] * sin(x[i] - x[j])  # same alg
f3(x, js) = @tullio out[i] := sin(x[i] - x[js[i,k]])  # uses sparsity

let n = 1000
  x = randn(n)
  # js = rand(1:n, n, 3)   # locations of the 1s... but this has repeats!
  js = reduce(hcat, [shuffle(1:n)[1:3] for _ in 1:n]) |> permutedims
  A = zeros(n,n)
  foreach(eachrow(A), eachrow(js)) do a, j
    a[j] .= 1
  end
  y1 = @btime f1($x, $A)
  y2 = @btime f2($x, $A)
  y3 = @btime f3($x, $js)  # 200x quicker at n=1000
  y1 ≈ y2 ≈ y3
end

using LoopVectorization, KernelAbstractions, CUDAKernels  # next...
2 Likes

I suppose you mean I,J are the head/tail edge matrices ? Will give it a try !

Thanks, will give it a try ! I didnt know your Tullio.jl package and I’ll definitely try it. Could you explain what is the @tullio macro ?

edit: I also got a x200 speedup on my machine just by throwing your code. Impressive.

Right. In fact, I would sort the I. It should make memory faults rarer.

ON CPU, your einsum-like solution using @tullio is really performant. However I had trouble understanding how it translates to GPU. Just as in Tullio’s doc I am using CUDAKernels, KernelAbstractions, CUDA before defining my function f(x,L) = @tullio .... but there’s no difference on CPU or GPU.

It should probably look something like @btime CUDA.@sync f1($(cu(x)), $(cu(A)));. This appears to be slower than CPU at n=1000 (and much slower if you include the time to transfer).

Tullio is not always fast at GPU stuff. You can also make use of the spartisity without it, e.g. with f4(x, js) = sum(sin.(x .- @view x[js]); dims=2) |> vec

1 Like