Help speeding up submatrix extraction from SparseCSC

Hello again,
I’d like to ask for help optimizing the extraction of a dense submatrix from a SparseCSC matrix. In other words,

submatrix_reference(spmat, rows, cols) = Matrix(spmat[rows, cols])

where spmat is a sparse matrix, and rows and cols are ordered lists of row and column indices.

I have the following code:

using SparseArrays
using BenchmarkTools

function intersect_kernel!(res, rows, rowval, nzval, resOffset, rowvalPtr, rowvalEnd, default)
    rowUntil = length(rows)
    rowIdx = 1
    while rowvalPtr <= rowvalEnd && rowIdx <= rowUntil
        @inbounds row = rows[rowIdx] 
        @inbounds rowval_ = rowval[rowvalPtr]
        @inbounds res[resOffset + rowIdx] = ifelse(row == rowval_, nzval[rowvalPtr], default)
        rowIdx = ifelse(row <= rowval_, rowIdx + 1, rowIdx)
        rowvalPtr = ifelse(rowval_ <= row, rowvalPtr + 1, rowvalPtr)

function submatrix!(res, spmat::SparseMatrixCSC, rows, cols, default)
    fill!(res, default)
    nrows = length(rows)
    nzval = spmat.nzval
    rowval = spmat.rowval
    colptr = spmat.colptr
    for colIdx in 1:length(cols)
        resOffset = (colIdx-1)*nrows
        @inbounds col = cols[colIdx]
        @inbounds rowvalPtr = Int(colptr[col])
        @inbounds rowvalEnd = colptr[col+1]-1
        intersect_kernel!(res, rows, rowval, nzval, resOffset, rowvalPtr, rowvalEnd, default)
function submatrix(spmat::SparseMatrixCSC{V,T}, rows, cols) where {V,T}
    res = zeros(V, length(rows), length(cols))
    submatrix!(res, spmat, rows, cols, zero(T))

submatrix_reference(spmat, rows, cols) = Matrix(spmat[rows, cols])

I can test that with

spmat = sprand(Int32, 1000, 1000, 0.1); rows = collect(1:10:1000); cols = collect(2:10:1000); res = zeros(eltype(spmat), length(rows), length(cols));
@assert submatrix_reference(spmat, rows, cols) == submatrix(spmat, rows, cols)
@btime submatrix!(res, spmat, rows, cols, zero(eltype(spmat)))
(bel = @belapsed submatrix!(res, spmat, rows, cols, zero(eltype(spmat)));); bel * 1e9 / length(res)

which yields

julia> @btime submatrix!(res, spmat, rows, cols, zero(eltype(spmat)))
  39.777 μs (0 allocations: 0 bytes)
julia> (bel = @belapsed submatrix!(res, spmat, rows, cols, zero(eltype(spmat)));); bel * 1e9 / length(res)

In other words, ~4ns / extracted entry. Not too shabby! The generated x86 assembly looks reasonably tight as well.

That’s ~16 cycles per entry of output on my machine; in the above example, ~8 cycles per examined entry (typically need to examine two entries of input per entry of output with above stats). The latencies in the critical path (load from L1, compare, SETLE, ADD) should add to roughly this ballpark (but maybe someone clever on this forum can shave off some cycles? I would not be surprised if one could go down to 6 cycles?)

In order to go faster on CPU, one needs to vectorize/parallelize. The idea would be to parallelize over SIMD lanes: In other words, each simd-lane would run a different column. Is there a working package for that, a la ISPC (eg now defunct ISPC.jl)? I.e. a package that does the necessary plumbing to broadcast a kernel that includes some control flow (interserct_branchless!) over SIMD lanes (using masked moves in lieu of ifelse and gather/scatter in lieu of getindex / setindex!)? Or do I need to do that by hand, presumably via SIMD.jl?

Second, this is a imo pretty natural operation that appears to be unfortunately missing from both SparseArrays and CuSparse. Does anyone know whether this is an operation that has a “standard name” in other sparse array packages (possibly on other languages)?

Do you think that functionality belongs into stdlib SparseArrays?

Do you think this is a good fit for GPU and are there any initial comments for GPU-naive people like myself?

PS. In real-life, the matrix is quite sparse, but the extracted submatrix is quite dense (there’s a structural reason for extracting the specific submatrices). In real life, I care about SparseCSC{Int32, Int32}. I need to extract a lot of different submatrices of the same medium-sized (much larger than cache, much smaller than main memory) constant sparse matrix. I care primarily about throughput, not latency: I can saturate all cores by doing multiple matrices in parallel; hence multi-threading does not help here.

Just came across this, as far as I know we don’t have anything like ISPC, closest is SIMD.jl and also VectorizationBase.jl with more automation. If you know exactly what SIMD instructions you are going call, you can use llvmcall or ccall to call them directly.