Hello again,

I’d like to ask for help optimizing the extraction of a dense submatrix from a SparseCSC matrix. In other words,

```
submatrix_reference(spmat, rows, cols) = Matrix(spmat[rows, cols])
```

where `spmat`

is a sparse matrix, and `rows`

and `cols`

are ordered lists of row and column indices.

I have the following code:

```
using SparseArrays
using BenchmarkTools
function intersect_kernel!(res, rows, rowval, nzval, resOffset, rowvalPtr, rowvalEnd, default)
rowUntil = length(rows)
rowIdx = 1
while rowvalPtr <= rowvalEnd && rowIdx <= rowUntil
@inbounds row = rows[rowIdx]
@inbounds rowval_ = rowval[rowvalPtr]
@inbounds res[resOffset + rowIdx] = ifelse(row == rowval_, nzval[rowvalPtr], default)
rowIdx = ifelse(row <= rowval_, rowIdx + 1, rowIdx)
rowvalPtr = ifelse(rowval_ <= row, rowvalPtr + 1, rowvalPtr)
end
end
function submatrix!(res, spmat::SparseMatrixCSC, rows, cols, default)
fill!(res, default)
nrows = length(rows)
nzval = spmat.nzval
rowval = spmat.rowval
colptr = spmat.colptr
for colIdx in 1:length(cols)
resOffset = (colIdx-1)*nrows
@inbounds col = cols[colIdx]
@inbounds rowvalPtr = Int(colptr[col])
@inbounds rowvalEnd = colptr[col+1]-1
intersect_kernel!(res, rows, rowval, nzval, resOffset, rowvalPtr, rowvalEnd, default)
end
end
function submatrix(spmat::SparseMatrixCSC{V,T}, rows, cols) where {V,T}
res = zeros(V, length(rows), length(cols))
submatrix!(res, spmat, rows, cols, zero(T))
res
end
submatrix_reference(spmat, rows, cols) = Matrix(spmat[rows, cols])
```

I can test that with

```
spmat = sprand(Int32, 1000, 1000, 0.1); rows = collect(1:10:1000); cols = collect(2:10:1000); res = zeros(eltype(spmat), length(rows), length(cols));
@assert submatrix_reference(spmat, rows, cols) == submatrix(spmat, rows, cols)
@btime submatrix!(res, spmat, rows, cols, zero(eltype(spmat)))
(bel = @belapsed submatrix!(res, spmat, rows, cols, zero(eltype(spmat)));); bel * 1e9 / length(res)
```

which yields

```
julia> @btime submatrix!(res, spmat, rows, cols, zero(eltype(spmat)))
39.777 μs (0 allocations: 0 bytes)
julia> (bel = @belapsed submatrix!(res, spmat, rows, cols, zero(eltype(spmat)));); bel * 1e9 / length(res)
3.9779
```

In other words, ~4ns / extracted entry. Not too shabby! The generated x86 assembly looks reasonably tight as well.

That’s ~16 cycles per entry of output on my machine; in the above example, ~8 cycles per examined entry (typically need to examine two entries of input per entry of output with above stats). The latencies in the critical path (load from L1, compare, SETLE, ADD) should add to roughly this ballpark (but maybe someone clever on this forum can shave off some cycles? I would not be surprised if one could go down to 6 cycles?)

In order to go faster on CPU, one needs to vectorize/parallelize. The idea would be to parallelize over SIMD lanes: In other words, each simd-lane would run a different column. Is there a working package for that, a la ISPC (eg now defunct ISPC.jl)? I.e. a package that does the necessary plumbing to broadcast a kernel that includes some control flow (`interserct_branchless!`

) over SIMD lanes (using masked moves in lieu of `ifelse`

and gather/scatter in lieu of getindex / setindex!)? Or do I need to do that by hand, presumably via `SIMD.jl`

?

Second, this is a imo pretty natural operation that appears to be unfortunately missing from both SparseArrays and CuSparse. Does anyone know whether this is an operation that has a “standard name” in other sparse array packages (possibly on other languages)?

Do you think that functionality belongs into stdlib `SparseArrays`

?

Do you think this is a good fit for GPU and are there any initial comments for GPU-naive people like myself?

PS. In real-life, the matrix is quite sparse, but the extracted submatrix is quite dense (there’s a structural reason for extracting the specific submatrices). In real life, I care about `SparseCSC{Int32, Int32}`

. I need to extract a lot of different submatrices of the same medium-sized (much larger than cache, much smaller than main memory) constant sparse matrix. I care primarily about throughput, not latency: I can saturate all cores by doing multiple matrices in parallel; hence multi-threading does not help here.