Non-sorted SparseMatrixCSC

Adding elements to a sparse one after another is slow due to sorting, allocating and copying. Is there another AbstractSparseMatrix that is more friendly to bit-by-bit construction. Or even one that can leverage the fact that elements are added in the correct (sorted) sequence.

My quick and dirty prototyping led me to a performance bottleneck that is appending to existing vectors. Can someone point me to a Julia vector data structure that grows in chunks, maybe even without the need to copy old elements?

Can you show a little bit more about how you are appending these vectors?

push! and append! do grow the underlying vector storage in chunks. Perhaps sizehint! will fit your needs. It is difficult to say without more context.

Assuming you don’t need to use the matrix until it is fully constructed, then you could add your data to (I,J,V) arrays element-by-element in an arbitrary order and then call the sparse(I, J, V) constructor.

2 Likes

Assuming you don’t need to use the matrix until it is fully constructed, then you could add your data to (I,J,V) arrays element-by-element in an arbitrary order and then call the sparse(I, J, V) constructor.

That was my plan.

@platawiec thanks for the advice:

using SparseArrays
mutable struct MySparseMatrix{Tv,Ti<:Integer} <: SparseArrays.AbstractSparseMatrixCSC{Tv,Ti}
    I::Array{Ti,1} # row indices
    J::Array{Ti,1} # column indices
    V::Array{Tv,1} # value
    Imax::Ti # number of rows
    Jmax::Ti # number of columns
    count::Ti # number of non-zero
    maxcount::Ti # counter that triggers the allocation of more memory
end

import Base.setindex!
function setindex!(A::MySparseMatrix, x, i1::Int64, i2::Int64, I::Int64...)
    if abs(x) > 0
        if A.count == A.maxcount
            A.maxcount += 1000000
            sizehint!(A.I, A.maxcount)
            sizehint!(A.J, A.maxcount)
            sizehint!(A.V, A.maxcount)
        end
        append!(A.I, i1)
        append!(A.J, i1)
        append!(A.V, i1)
        A.count += 1
    end
end

I have a test that allocates 7 million entries. It spends most time in sizehint! and resize! and takes 1.6s. Do I use sizehint! correctly?

What I had was:


using SparseArrays
mutable struct MySparseMatrix{Tv,Ti<:Integer} <: SparseArrays.AbstractSparseMatrixCSC{Tv,Ti}
    I::Array{Ti,1}
    J::Array{Ti,1}
    V::Array{Tv,1}
    Imax::Ti
    Jmax::Ti
    count::Ti
    maxcount::Ti
end

import Base.setindex!
function setindex!(A::MySparseMatrix, x, i1::Int64, i2::Int64, I::Int64...)
    @inbounds if abs(x) > 1e-12
        if A.count == A.maxcount
            A.maxcount += 1000000
            append!(A.I, Array{Int64}(undef, 1000000))
            append!(A.J, Array{Int64}(undef, 1000000))
            append!(A.V, Array{Float64}(undef, 1000000))
        end
        A.I[A.count] = i1
        A.J[A.count] = i2
        A.V[A.count] = x
        A.count += 1
    end
end

Which takes 1s and spends a lot of time in copyto!.

Hence, my question about more clever vector data structures.

Note that you can already construct SparseMatrix instances with the I,J,V you dont’ need to reinvent the wheel here.

A litte context: I want to provide ForwardDiff.jacobian! with a sparse matrix instead of a dense matrix, because the Jacobian is very large but sparse.

In this line the elements of the resulting Jacobian matrix are set one-by-one.

Sounds like https://github.com/JuliaDiff/SparseDiffTools.jl might be useful

True! But what I had in mind does not rely on any a priori sparsity, that is sparsity of the Jacobian that holds globally over the space of input variables.
Image a function which has a Jacobian that is a priori dense, because every element of the Jacobian has a spot in the space of input variables where it is non-zero. But ex post, when evaluating the function’s Jacobian at a certain point in the state space, it comes out extremely sparse.
Maybe the dense Jacobian does not fit into memory, but the chunks that ForwardDiff work on do. Or the extreme sparsity gives a performance advantage when using the Jacobian for some calculation.

1 Like

It seems you are trying to re-implement the already built-in array growing yourself. There is no need. Julia itself already does that, just use append! on its own. You can use a single sizehint! if you know approximately how many non-zero entries the final arrays will have but calling it like you are doing now is pointless (in fact it can be pessimising).

2 Likes

I see, but what happens when I append beyond the size hint given? Do I read you correctly that Julia grows (allocate, copy, maybe garbage collect) the vector by a large chunk when necessary so that the next thousands of append!s are cheap?

I do not know how many elements the vector may have.

With the vectors becoming very large I imaging that it is increasingly difficult to store them continuously in memory. So I will experiment with Vector{Vector{Int64}}. Give a size hint whenever I add another inner vector and add another when it is full. Does that make sense?

Yes, except that the amount of over-allocation is proportional to the current size of the array, rather than being some fixed amount. This turns out to be important to achieve amortized constant time for push!. This is the same behavior of pretty much any other dynamic array type (like std::vector in C++).

This seems like unnecessary complexity. Try just using push! or append! first and see if that is sufficient.

5 Likes

Hi you may have a look at Extendable Sparse.jl.

Disclaimer: I’m the author…

3 Likes

This is perfect! ExtendableSparse.jl is about as fast as my prototype in a test where I pre-allocate vectors that fit all elements that are successively added!

When this issue is resolved, one should be able to provide an ExtendableSparseMatrix to ForwardDiff.jacobian!, right?

In the moment I doubt this. I’ll investigate this as it indeed would be intriguing if it would work.
The way I use it is in the assembly of coupled nonlinear PDEs based on edge callbacks (FEM with cell callbacks should work in a similar way). From each edge or cell I get a dense local matrix from ForwardDiff.jacobian! and sort these entries into the global matrix of type ExtendableSparse. So I don’t make any assumptions on sparsity handling in ForwardDiff . In a way this is not optimal if I have sparse couplings between the PDEs.

Ok, with v0.3.0 ExtendableSparse cooperates well with ForwardDiff.jl . Thanks for asking - I never found the time to try this out before.

I can confirm that a ExtendableSparseMatrix works as Jacobian argument in a ForwardDiff.jacobian! call. However, the performance is much worse than when using a SparseMatrix:

function f!(dx, x)
    dx .= x[1]
end

using ForwardDiff, SparseArrays, ExtendableSparse, BenchmarkTools
n = 100
m = 100000
jac_ext = ExtendableSparseMatrix(m, n);
@btime ForwardDiff.jacobian!(jac_ext, f!, zeros(m), zeros(n));

jac = spzeros(m, n);
@btime ForwardDiff.jacobian!(jac, f!, zeros(m), zeros(n));

all(jac .== jac_ext)

Using the code from ExtendableSparses README yields the same results.

This is despite a ExtendableSparseMatrix being much faster than a SparseMatrix in a test where I successively add elements, like ForwardDiff does.

@j-fu have you done performance tests?

Well, not for this kind of use with ForwardDiff. In fact, the performance very much depends on the way things are called internally in ForwardDiff which I did not investigate.

Your function is kind of the worst case for the linked list internal format used in ExtendableSparse, as it generates a full row in the matrix (will have to mention this caveat in the readme: I assume that “sparse” means that all rows/columns have << n entries). But it seems that this is not the problem here.

However, this brings me back to my previous post: “atomic” assembly of the finite difference operator without too much assumptions on the inner workings of ForwardDiff. So let me extend your example a bit (VoronoiFVM is my package):

using ExtendableSparse
using ForwardDiff
using LinearAlgebra
using SparseArrays
using VoronoiFVM

#
# Finite difference operator for
# -\Delta u^2 +  u^2 = 1 
#
function ffd(y,x)
    n=length(x)
    h=1/(n-1)
    y[1]=(x[1]^2-1)*0.5*h
    for i=2:n-1
        y[i]=(x[i]^2-1.0)*h
    end
    y[end]=(x[end]^2-1)*0.5*h

    for i=1:n-1
        dx=(x[i+1]^2-x[i]^2)/h
        y[i+1]+=dx
        y[i]-=dx
    end
end

function flux!(f,u,edge)
    f[1]=u[1,1]^2-u[1,2]^2
end

## Storage term
function storage!(f,u,node)
    f[1]=u[1]
end

function reaction!(f,u,node)
    f[1]=u[1]^2-1.0
end

function runtest(n)
    
    jac_ext = ExtendableSparseMatrix(n, n);
    @time begin
        ForwardDiff.jacobian!(jac_ext, ffd, ones(n), ones(n));
        flush!(jac_ext)
    end
    
    jac = spzeros(n, n);
    @time ForwardDiff.jacobian!(jac, ffd, ones(n), ones(n));
    all(jac .== jac_ext)

    # Set up stuff for VoronoiFVM which uses atomic assembly
    h=1.0/convert(Float64,n-1)
    X=collect(0:h:1)
    grid=VoronoiFVM.Grid(X)
    physics=VoronoiFVM.Physics(flux=flux!, storage=storage!,reaction=reaction!)
    sys=VoronoiFVM.System(grid,physics,unknown_storage=:dense)
    enable_species!(sys,1,[1])
    solution=unknowns(sys,inival=1.0)
    oldsol=unknowns(sys,inival=1.0)
    residual=unknowns(sys,inival=1.0)
    tstep=1.0e100 # large timestep aproximates stationary problem
    
    @time begin
        VoronoiFVM.eval_and_assemble(sys,solution,oldsol,residual,tstep)
        flush!(sys.matrix)
    end
    
    all(sys.matrix.≈jac_ext)
end

second run:

julia> runtest(20000)
  5.806884 seconds (32 allocations: 7.174 MiB)
  3.925591 seconds (41 allocations: 6.275 MiB)
  0.025753 seconds (117.03 k allocations: 3.849 MiB)
true

I intentionally use @time here, as @btime probably won’t be dominated by
the initial structure buildup phase. This is the kind of things I also benchmarked
before (see the benchmarking stuff with fdrand!() in ExtendableSparse).

In VoronoiFVM, I call ForwardDiff with flux!, reaction!, storage! and get local matrices which I then assemble into the global matrix.

Just an add-on here: here is a timing plot:

compare

Seemingly, ForwardDiff seems to consider all possible matrix entries due to being O(n^2), and I guess this can be improved only with a priori information on sparsity.
OTOH, “atomic” assembly in VoronoiFVM automatically takes the sparsity into account in an implicit manner.

The explantion of the worse performance of ExtenableSparse with ForwardDiff ist then the following:

function Base.setindex!(ext::ExtendableSparseMatrix{Tv,Ti}, v, i,j) where{Tv,Ti<:Integer}
    k=findindex(ext.cscmatrix,i,j)
    if k>0
        ext.cscmatrix.nzval[k]=v
    else
        if ext.lnkmatrix==nothing
            ext.lnkmatrix=SparseMatrixLNK{Tv, Ti}(ext.cscmatrix.m, ext.cscmatrix.n)
        end
        ext.lnkmatrix[i,j]=v
    end
end

I assume the matrix is always a sum of a cscmatrix and a lnkmatrix. This allows to update entries and flush! multiple times. If we try to insert a zero into a non-existing slot, we have to look up both, so searching twice. This IMHO explains the performance difference between ExtendableSparse and SparseMatrixCSC if we try to insert mostly zero values.

Inserting a zero into SparseMatrixCSC is connected with a search as well, see
sparsematrix.jl

I guess if this case would be catched in ForwardDiff, we would see a different picture. As far as I can see in jacobian.jl this is not done.

Thank you, for that detailed explanation!

I think the only way forward is to make adjustments in ForwardDiff. And it seems an to be an easy fix!