Is working with SubArrays of a SparseMatrix a bad idea?

I’ve been trying to work with views of a SparseMatrix for a while now. The goal is to avoid extra copies when you want to operate inplace on a subsection of the sparse matrix. I’ve run into enough issues that I’m starting to think that this is simply a bad idea. The issues fall into two categories: 1) a valid method doesn’t exist for a SubArray of a SparseArray or 2) some valid method exists, but the result for SparseArray is terribly slow which more than negates the advantages of using a sparse array in the first place.

Examples of 1) are #286 and #16. Obviously these can be or have been fixed.

There are lots of examples of 2) but a very simple one is that there is no optimized method for sum for SubArray{T, N, AbstractSparseArray}, which has a major performance impact.

using SparseArrays, BenchmarkTools
a = sprand(1000, 1000, .01)
aview = view(a, :, :)
amat = Matrix(a)
sum2(x::SubArray{T, N, <:AbstractSparseArray}) where {T, N} = sum(nonzeros(x))

@btime sum($a)        # 1.604 μs (0 allocations: 0 bytes)
@btime sum($amat)     # 182.875 μs (0 allocations: 0 bytes)
@btime sum($aview)    # 4.646 ms (0 allocations: 0 bytes)
@btime sum2($aview)   # 1.642 μs (0 allocations: 0 bytes)

I don’t intend this to be criticism of the SparseArrays developers - they’ve been awesome. The package itself is quite good and mature. With enough time and developer effort I know all this can be fixed.

However, supporting operations on SubArray in Julia in general seems extremely cumbersome for developers since you don’t get things like SubArrray{T, 1, SomeArray} <: SomeArray{T, 1} by default. This seems to result in the need for tons of custom methods, type unions, etc that I know have been challenging for package developers to keep up with. E.g. “Why don’t you support operation X efficiently on a view of a reshape of a view of an UpperTriangular of your special array?”

I’m curious to know how others have approached these sorts of issues. I’m also interested in general advice for working with Julia in this context - even if it’s just to keep posting issues until it all works.

Sparse matrix internal structure is optimized for operations, especially multiplication, rather than arbitrary indexing Sparse matrix - Wikipedia and Sparse matrix - Wikipedia. Creating and/or using a subarray requires such arbitrary indexing. Operations on the subarray can no longer take advantage of the efficient structure of the original sparse matrix. They are likely to pass through an indexing stage. If you have several operations to perform with the subarray, then you should create a new sparse matrix for the subarray.

1 Like

Forget about the timing! I think there is a correctness bug here:

julia> sum(@view a[:,1:5])
25.236381449783398

julia> sum(nonzeros(@view a[:,1:5]))
5140.889898161122

The bug itself is quite obvious, in sparsematrix.jl:

nonzeros(S::SparseMatrixCSCView)  = nonzeros(S.parent)

nonzeros goes directly to parent without filtering output to view.
(and the next two nonzeros specializations for Triangular matrices look fishy as well).

Is this true if you limit yourself to views that include whole columns or groups of adjacent columns?

That depends on the internals of the view, which I don’t know about, but those could be fast(er).

I see that now too.

The right code should be more along the lines of:

import SparseArrays: nonzeros

function nonzeros(S::SparseArrays.SparseMatrixCSCView)
    s = S.parent.colptr[first(S.indices[2])]
    e = S.parent.colptr[last(S.indices[2])+1]
    if S.indices[1]==axes(S.parent,1)
        return @view S.parent.nzval[s:e-1]
    else
        res = eltype(S.parent)[]
        for i in s:e-1
            if S.parent.rowval[i] in S.indices[1]
                push!(res, S.parent.nzval[i])
            end
        end
        return res
    end
end

And as some other nonzeros definitions are suspicious, we have for example:

julia> ut = UpperTriangular(sprand(5,5,0.5))
5×5 UpperTriangular{Float64, SparseMatrixCSC{Float64, Int64}}:
 0.0  0.0       0.919615  0.364471  0.0
  ⋅   0.268043  0.0       0.254744  0.441155
  ⋅    ⋅        0.0       0.0       0.0
  ⋅    ⋅         ⋅        0.935346  0.792978
  ⋅    ⋅         ⋅         ⋅        0.0

julia> sum(ut)
3.97635045952992

julia> sum(nonzeros(ut))
7.379883839287259

The current package does appears to work for 1D views:

julia> sum(a[:,1])
5.826271341749841

julia> sum(view(a, :, 1))
5.826271341749841

julia> sum(nonzeros(view(a, :, 1)))
5.826271341749841

The problem is the specializations in SparseArray package.
So it would pop-up only on sparse matrix views.

Here is a version for sparse upper triangular matrices of nonzeros:

function nonzeros(S::SparseArrays.UpperTriangular{<:Any,<:SparseArrays.SparseMatrixCSCUnion})
    res = eltype(S.data)[]
    c = 1
    nextc = S.data.colptr[c+1]
    for i in eachindex(S.data.nzval)
        while i >= nextc
            c += 1
            nextc = S.data.colptr[c+1]
        end
        if S.data.rowval[i]<=c
            push!(res, S.data.nzval[i])
        end
    end
    return res
end

Lower triangular version is identical with <=c replaced with >=c.
While the Triangular versions are fast. In case of sparse matrices, there should be a debate about forcing triangle structure in the underlying representation (as performance considerations different than dense matrices).

Looks related to https://github.com/JuliaSparse/SparseArrays.jl/issues/64