Hi, I’m new to GPU programming in general and julia-GPU in specific. So sorry if my question turns out to be naive.
I have a largish sparse matrix; given a sorted array of indices I need to extract the corresponding submatrix into a dense submatrix. I need to do this for a lot of submatrices (ranging from tiny to ~1000 x 1000; due to structure, the requested submatrices are all relatively dense-ish, even though the large sparse matrix is really sparse), and I care about throughput only, not latency (i.e. I can forsee which submatrices I will need when). On the way, it is helpful to do some minor elementwise computations, but I can also do elementwise computations afterwards. [*]
Horribly slow REPL example:
using Random, SparseArrays
a = sprand(100, 100, 0.1);
idxset=[1,5,7,8,9,10,11,75];
mat=Matrix(a[idxSet, idxSet]);
The large sparse matrix would fit into GPU memory. Values and indices are 32 bit. I am free with respect to memory layout (e.g. I currently use something like CSR, but with indices and values interspersed as struct matEntry idx::Int32 val::Int32 end
). The matrix happens to be symmetric, if that helps.
Any pointers whether and how to accelerate this on GPU? To repeat, I am free to use any layout I want; and if there is some library like cuSparse that already has a similar functionality, I’m happy to read the relevant code and adapt it.
Unfortunately I could not find dense submatrix extraction in the cuSparse API docs, nor could I find something like this on google, so I either failed at RTFM or this is difficult, or it is a niche problem.
Currently everything runs on CPU; the dense submatrix extraction is my main bottleneck (I need to do this often!). Current alg on CPU goes mergesort-style (for each column, iterate over entries and selected indices; when both match, write output and advance both; when selectedIndex < entry.index
then write zero and advance selected index; when selectedIndex > entry.index
then advance entry). This looks like it could be done on GPU – it is basically the same as mergesort, and there are plenty of gpu mergesort implementations lying around.
[*] “Minor elementwise computations” means that I really only need a boolean or bitmatrix result, a la mat .> threshold
. I know how to do the bitpacking + padding as desired on the CPU with SIMD, via the most awesome pmovmskb, in order to get the desired end result; and this style of thing is fast enough on CPU. Would still be nice to compress the 32 bit down to 8 bit or even 1 bit on the GPU side, before transferring back to CPU.