Sparse matrix multiplication for Metal

I need to perform a multiplication between two fairly large sparse matrices C = A * B and I would like to try using the GPU for that. The problem arises in a context where A represents a fixed 2D convolution kernel, while B represents a list of 2D images (each of which is a column of B). Therefore, if necessary, I can easily change the storage scheme of A without a significant penalty.

I can easily do this operation with CUDA, but I could not find any way to perform sparse matrix products with Metal. Assuming there is no official implementation available, does anyone know a generic implementation of generic sparse matrix multiplication SpGEMM for GPUs? Ideally, one could perhaps write a suitable kernel using KernelAbstractions.

1 Like

Did you try to use AppleAccelerate ?

@LaurentPlagne as far as I know AppleAccelerate is not using the GPU.

For sparse arrays you can try Sparse Arrays · The Julia Language.

Why don’t you try to perform the convolutions in the Fourier domain?
You precompute the FFT of the padded kernel once and then in a loop (or stream) you multiply it element-wise with the FFT of each of the images and then take the inverse FFT of each result.

It should be faster for 5x5 and larger kernels even for a single image..

My question was associated to Metal, so to the Apple GPU library. In other words, I want to use the GPU to speed up the computation.

Thank you @pitsianis. The problem with this approach is that I will need to perform a matrix-matrix multiplication, not a matrix-vector multiplication. The latter could be performed using 2D FFT, but the former requires either 3D FFT or a repeated 2D FFT, and I am not sure I will gain much given the sparsity of the matrix (but I will give a try). Additionally, I am not sure Metal.jl has FFT (probably not?).

Maybe take a look at this issue on Metal.jl.

1 Like

There are some hardware-agnostic sparse matrix types and kernels by @AntoineBut in GitHub - AntoineBut/GPUGraphs.jl: Workspace for my semester project in HPNALGS @ EPFL

Just as a record, I tried FFT, but on the CPU it is significantly slower (by a factor ~30) than the sparse matrix approach. Also, it requires a lot of data transfer, so I do not think it is sensible to try this path on the GPU.

Thank you @gdalle. I had a look and could only find matrix-vector kernels, but perhaps I am missing something.

1 Like

maybe mlx has it. Id say MLX.jl does not have it yet though

Oh right, I had missed the matmat aspect. Perhaps these basic matvec kernels can be a starting point though

1 Like

I am surprised by your comment. We are using the Fourier domain approach in FastLocalCorrelationCoefficients.jl

Also, regarding the data transfers, you can work with sub images (at an added computational cost of redoing the sub-image border).

If you do not mind, please share the code of your experiment, at least one of us is going to learn something new :slight_smile:

Thank you. I checked MLX and it has matrix multiplication. However, I did not find any indication that it has a sparse matrix type/ do you know how to use sparse matrices with MLX?

nope. I am not on my computer but mlx is where I would look if metal.jl does not have what I want

1 Like