Extracting dense submatrix from sparse matrix

Hi, I’m new to GPU programming in general and julia-GPU in specific. So sorry if my question turns out to be naive.

I have a largish sparse matrix; given a sorted array of indices I need to extract the corresponding submatrix into a dense submatrix. I need to do this for a lot of submatrices (ranging from tiny to ~1000 x 1000; due to structure, the requested submatrices are all relatively dense-ish, even though the large sparse matrix is really sparse), and I care about throughput only, not latency (i.e. I can forsee which submatrices I will need when). On the way, it is helpful to do some minor elementwise computations, but I can also do elementwise computations afterwards. [*]

Horribly slow REPL example:

using Random, SparseArrays
a = sprand(100, 100, 0.1);
mat=Matrix(a[idxSet, idxSet]);

The large sparse matrix would fit into GPU memory. Values and indices are 32 bit. I am free with respect to memory layout (e.g. I currently use something like CSR, but with indices and values interspersed as struct matEntry idx::Int32 val::Int32 end). The matrix happens to be symmetric, if that helps.

Any pointers whether and how to accelerate this on GPU? To repeat, I am free to use any layout I want; and if there is some library like cuSparse that already has a similar functionality, I’m happy to read the relevant code and adapt it.

Unfortunately I could not find dense submatrix extraction in the cuSparse API docs, nor could I find something like this on google, so I either failed at RTFM or this is difficult, or it is a niche problem.

Currently everything runs on CPU; the dense submatrix extraction is my main bottleneck (I need to do this often!). Current alg on CPU goes mergesort-style (for each column, iterate over entries and selected indices; when both match, write output and advance both; when selectedIndex < entry.index then write zero and advance selected index; when selectedIndex > entry.index then advance entry). This looks like it could be done on GPU – it is basically the same as mergesort, and there are plenty of gpu mergesort implementations lying around.

[*] “Minor elementwise computations” means that I really only need a boolean or bitmatrix result, a la mat .> threshold. I know how to do the bitpacking + padding as desired on the CPU with SIMD, via the most awesome pmovmskb, in order to get the desired end result; and this style of thing is fast enough on CPU. Would still be nice to compress the 32 bit down to 8 bit or even 1 bit on the GPU side, before transferring back to CPU.

Not familiar with sparse matrices, but extended BSR seems to have something:

The advantage of the BSRX format is that the developer can specify a submatrix in the original BSR format by modifying bsrRowPtrA and bsrEndPtrA while keeping bsrColIndA and bsrValA unchanged.

Maybe that maps well onto your dense subblocks?

Alternatively, maybe you could use CSR/CSC and write a kernel that uses a thread per compressed row or thread, uncompresses the row/column indices, and then loops over the wanted column/row values? Key here would be to have threads perform memory operations on consecutive memory addresses to have them coalesce, but I guess that would work since the submatrices are somewhat dense. There’s probably better options though, just spitballing here :slight_smile:

Unfortunately that doesn’t help me: I have no block structure :frowning:

The setting is more like: I have a big sparse weighted graph (density ~ 1e-3 to 1e-2), and need to look at local neighborhood subgraphs (i.e. the subgraph induced by the set of all neighbors of some node). These local neighborhood subgraphs tend to be much denser (>10%) than the entire graph, so it makes sense to extract a dense representation.

On CPU the kernel loop would look something like (pseudo-code)

struct Entry

function rowKernel!(output::AbstractVector{Bool}, outPtr::Int, entries::AbstractVector{Entry}, entryPtr::Int, entryEnd::Int, threshold::UInt32, rows::AbstractVector{Int32})
  rowPtr = 1
  nRows = length(rows)
  while(entryPtr < entryEnd && rowPtr <= nRows)
    if entries[entryPtr].rowIdx == rows[rowPtr]
      output[rowPtr] = entries[entryPtr].val > threshold
      outPtr += 1
      rowPtr += 1
      entryPtr += 1
    else if entries[entryPtr].rowIdx > rows[rowPtr]
      rowPtr += 1
    else if entries[entryPtr].rowIdx < rows[rowPtr]
      outPtr += 1
      entryPtr += 1

function extract!(output::AbstractVector{Bool}, entries::AbstractVector{Entry}, rowPtrs::AbstractVector{Int}, threshold::UInt32, rows::AbstractVector{Int32})
  fill!(output, false)
  rowsize = Base.num_bit_chunks(length(rows))
  #assertions are only for documentation/discourse
  @assert length(output) >= rowsize * length(rows)
  @assert (reinterpret(UInt64, pointer(output)) & 0xff) == 0
  # we guaranteed that two different kernel! invocations will never attempt to write into the same cacheline
  #start a bunch of thread-groups here
  for idx = 1:length(rows)
    rowKernel!(output, 1 + (idx - 1) * rowsize, entries, rowPtrs[rows[idx]], rowPtrs[rows[idx] + 1], threshold, rows )

with obvious modifications: inbounds and manual LICM where needed; refactor until the main kernel loop is branchfree (uses CMOV only on CPU, and a single branch to check whether the loop is finished).

The work size for one rowKernel! call would be far too small for something heavy like a julia task; but one can probably spin up an entire group at once? (i.e. on CPU, one would parallelize one level up, between different extract! calls)

On CPU, the kernel would choke on latency of the CMOV for the conditional pointer advancement (worst possible dependency chain). So one would need to interleave multiple rows in order to saturate all the superscalar execution ports.

Is the same necessary for CUDA? Or is the architecture for each thread simple enough that this does not matter? Or is the compiler smart enough to recognize this and perform this transformation automatically? (since the semantics are much more restrictive than CPU, this could conceivably be a valid program transformation)

On CPU, this also has a memory bandwidth problem: The rows is shared between all rows over which we run the kernel; output is still warm in the cache; but each relevant slice of the entries is streamed over only once, with no way of reuse. This looks like a perfect match for the high memory bandwidth of GPUs.

The way I understand it, each group of 32 threads would execute in lockstep, i.e. I pay for the longest loop-length in the group? That probably is OK; worst case, one could sort by size first.

Me being a complete neophyte on GPU: Is this kind of code a good match for GPU? Or is it too branchy (e.g. loop length for a kernel! call not known at compile time, since it depends on entryEnd - entryPtr)?