# How to get a sub matrix of a large sparse matrix / array efficiently?

Hi all,

I have a large sparse matrix / 2-d-array, some random indices and want to obtain a sub matrix of the large sparse matrix for these random indices. I noticed that a `view()` is quite fast for doing this (order of `~ 500 ns`), however I struggle to get a copy of the sub matrix at a similar performance (order of `~ 5 ms`). Please see my benchmark below. The matrices are quadratic in my specific case. Is there a way to get the sub matrix more efficiently (as a copy)? I kind of need a copy/SparseMatrixCSC again, because I cannot use sparse arithmetics on the view (see Slow arithmetic on views of sparse matrices and https://github.com/JuliaLang/julia/issues/21796). I’m very happy for any help!

``````using SparseArrays
using Random
using BenchmarkTools

const n = 1000000
const sparr = sprand(n, n, 0.0001) # a large sparse matrix
const subinds = unique(sort(rand(1:n, 1000))) # indices for the sub matrix

const sparr_sub_view = view(sparr, subinds, subinds) # sub matrix view type
const sparr_sub_copy = sparr[subinds, subinds] # sub matrix copy/SparseMatrixCSC type
sparr_sub_copy == SparseMatrixCSC(sparr_sub_view) # this is 'true' as a check

# benchmarks:
@btime view(sparr, subinds, subinds) # 566.207 ns (3 allocations: 112 bytes)
@btime sparr[subinds, subinds] # 4.749 ms (7 allocations: 7.64 MiB)
``````

Note that you are benchmarking instantiating a view object, which is a very cheap operation.

Possibly the relevant benchmark is using the result (of view or the submatrix), and you may find that making a copy is faster in the end, but this depends on what you are doing.

2 Likes

yes, working with the copy appears faster for me; also due to the fact that a view object does not have sparse methods such as `findnz()`, `rowvals()` and `nonzeros()`. (Maybe there is a way to get this for a view?? I just didn’t find it yet.)

But is this plain slicing to get the copy really the fastest way here (`sparr_sub_copy = sparr[subinds, subinds]`)? I hoped that there is a more clever solution using sparse matrix iteration, but I was not successfull to improve this so far.

These could be implemented, but I don’t think it has been done (yet).

I stumbled upon the same problem and it is not super easy to get what you want because of the compressed format. I think the simplest is to work at the level of the vectors `I,J,K` given by `findnz`.

yeah, I don’t know, what I tried was:

``````function getsubmat(sparr, subinds)
sparr_sub = spzeros(length(subinds), length(subinds))
lookup = Dict(subinds .=> 1:length(subinds))

rows = rowvals(sparr)
vals = nonzeros(sparr)

for j=1:size(sparr_sub)
for i in nzrange(sparr, subinds[j])
row = rows[i]
val = vals[i]
if row in subinds
sparr_sub[lookup[row], j] = val
end
end
end
sparr_sub
end

const sparr_sub_copy2 = getsubmat(sparr, subinds)
sparr_sub_copy == sparr_sub_copy2 # is 'true' as a check

@btime getsubmat(sparr, subinds) # 52.227 ms (36 allocations: 120.47 KiB)
``````

This makes use of the sparse matrix iteration scheme using `rowvals()` and `nonzeros()`. It allocates less memory compared to the slicing version, but is terribly slow. All quite unsatisfactory. It seems to me that there must be a better way to “just” read out a submatrix in a below milli seconds range, but I cannot see it at the moment…

Could you elaborate on this a bit more?

You loop over I,J to find if the indices are in your subindices. You get new Isub, Jsub and Ksub and give this to `sparse`

Something like this? I just sketched the code; the `for` loop is fast (`~100 ns`), however the `if`’s make it extremely slow.

``````const I, J, K = findnz(sparr)

function getsubmat(I, J, K, subinds)
Isub = Int[]
Jsub = Int[]
Ksub = Float64[]

for (i, j, k) in zip(I, J, K)
if j ∈ subinds && i ∈ subinds
# do stuff
end
end
# sparse(Isub, Jsub, Ksub)
end

@btime getsubmat(I, J, K, subinds) # 53.698 s (3 allocations: 240 bytes)
``````

you should propably loop over subinds instead and take advantage of I being ordered

I think I don’t see what you mean… This is then basically what I tried above already, isn’t it?

This is a loop over subinds to get the columns of `sparr`

1 Like

To extend my previous answer a bit, another version would be

``````function getsubmat(sparr, subinds)
Isub = Int[]
Jsub = Int[]
Ksub = Float64[]

lookup = Dict(subinds .=> 1:length(subinds))

rows = rowvals(sparr)
vals = nonzeros(sparr)

for j=1:length(subinds)
for i in nzrange(sparr, subinds[j])
row = rows[i]
val = vals[i]

if row in subinds
push!(Isub, lookup[row])
push!(Jsub, j)
push!(Ksub, val)
end
end
end
sparse(Isub, Jsub, Ksub)
end

@btime getsubmat(sparr, subinds) # 55.237 ms (54 allocations: 141.97 KiB)
``````

which has again quite similar (terrible) performance.