Repeated Solving of Linear System with the Same Structure

UPDATE WITH MORE COMPLETE EXAMPLE
The advice already provided has been extremely helpful, and I am adding this update to provide greater detail and a more concrete example in case it spurs any other thoughts/advice. The following should provide a bit more clarity as to the specific sparse, banded structure of the system.

I hope to solve a large number of sparse, banded, linear system of the form M*x=v, where M=I - B .*P, P is a sparse stochastic (i.e. transition) matrix, and B is a discount vector that is a vector of numbers close to 1 (between .999 and 1). The code below creates an example of such a matrix M that is ordered in a way to make the bandwidth small. The sparsity structure of M never changes between needing to solve M*x=v, but its values do. It occurred to me that the solver being called is likely spending a good bit of time determining the sparsity structure and appropriate algorithm for the system. Since the structure of the system never changes, I’m hoping that I can improve the speed of my solves by doing this part of the work of solving the system only once.

The code below is a bit long since it needs to accurately generate the sparsity structure of M, however one can simply cut and paste without needing to understand or worry about the lengthy function get_M.

For the curious, the code is intended to solve for the present discounted values of future cash flows in an inventory management problem. An inventory manager has a given inventory capacity that can be filled by three different types of products. Products arrive and leave according to rates that depend on the current inventory level. To see that the system solves for the present discounted value of cash flow, observe that if v is a vector representing the cash flow received at any inventory level that x=v + B.*P*v .+ (B.*P)^2*v .+ (B.*P)^3*v .+ ..., the present-discounted cash flow.

CODE

using SparseArrays, LinearAlgebra, BenchmarkTools
const Φ=3

function get_M(capacity)
    function add_index!(w,Ď•,ww_test,ww_fill,Vstates,plus_inds,minus_inds)
        w[Ď•]+=1
        for ww=(ww_test+1):(ww_fill-1)
            if Vstates[ww]==w
                w[Ď•]-=1
                plus_inds[ww_test,Ď•] = ww
                minus_inds[ww,Ď•] = ww_test
                return ww_fill
            end
        end
        Vstates[ww_fill].= w
        w[Ď•]-=1
        plus_inds[ww_test,Ď•] = ww_fill
        minus_inds[ww_fill,Ď•] = ww_test
        return ww_fill+1
    end

    function get_state_order(capacity::Int64)
        num_states = binomial(capacity+Φ,Φ)
        Vstates=[zeros(Int64,Φ) for bb=1:num_states]
        plus_inds = zeros(Int64,num_states,Φ)
        minus_inds = zeros(Int64,num_states,Φ)
        ww_test=1
        ww_fill=2
        while ww_test<=num_states
            w=Vstates[ww_test]
            if (sum(w)<capacity)
                for ϕ=1:Φ
                    ww_fill=add_index!(w,Ď•,ww_test,ww_fill,Vstates,plus_inds,minus_inds)
                end
            end
            ww_test +=1
        end
        return Vstates, plus_inds, minus_inds
    end

    function makeM(plus_inds,minus_inds,B,P_plus,P_minus)
        N_w = size(plus_inds,1)
        I_A = repeat(collect(1:N_w),Φ)
        J_A = vec(plus_inds)
        V_A = vec(B .* P_plus)
        I_D = repeat(collect(1:N_w),Φ)
        J_D = vec(minus_inds)
        V_D = vec(B .* P_minus)
        Is = vcat(I_A,I_D)
        Js = vcat(J_A,J_D)
        Vs = vcat(V_A,V_D)
        has_J = (Js .!= 0)
        Is=Is[has_J]
        Js=Js[has_J]
        Vs=Vs[has_J]

        #Actual premult matrix.
        BxP = sparse(Is,Js,Vs)
        M = sparse(I,N_w,N_w) .-   (BxP)

        return M
    end

    #Get the states and the adjacency map.
    Vstates, plus_inds, minus_inds = get_state_order(capacity)

    #Create transition probabilities.
    P_plus = rand(Float64,size(plus_inds))
    P_minus = rand(Float64,size(plus_inds))
    #Ensure that transition probabilities are zero when at capacity or empty.
    P_plus[plus_inds.==0] .= 0
    P_minus[minus_inds.==0] .= 0
    #Normalize transition probabilities to sum to 1.
    P_norm = sum(P_plus,dims=2) .+ sum(P_minus,dims=2)
    P_plus = P_plus ./ (sum(P_plus,dims=2) .+ sum(P_minus,dims=2))

    discount_rate = .999 .+ rand(Float64,size(P_plus,1)) ./1000.

    return makeM(plus_inds,minus_inds,discount_rate,P_plus,P_minus)
end

capacity = 40
@time M = get_M(capacity)
v = randn(size(M,1))

@btime $M\$v

ORIGINAL POST BELOW
I have a large, very sparse banded matrix M. I need to compute M\v many times with slight changes to the values (but not the sparsity structure) of M. It occurs to me that much of the time spent in M\v or other solvers is spent determining and exploiting the sparsity structure. I wonder whether there is a way not to repeat that work. In my case, the code below produces times that are similar for “solve once” and “solve twice simultaneously” but twice as long for “solve twice.” This suggests to me that if there were a way to do the first part of the solver’s work in exploiting the sparsity structure just once that it would improve my computation time by an order of magnitude or more. Does anyone know whether there is a way to do this (without writing my own linear systems solver in C,C++,or Fortran)? Thank you!

#Time 1:Solve once.
v1 = rand(size(M,1))
@btime $M\$v1 

#Time 2:Solve twice sequentially.
function tf(M,v)
	out = M\v
	out .= M\v
	return out
end
@btime tf($M,$v1)

#Time 3: Solve twice simultaneously.
v2 = rand(size(M,1),2)
@btime $M\$v2 

@dlfivefifty, thank for the suggestion! Unfortunately, the bands are rather wide and sparse. In the mean case, the dimension of M is approximately 12_000 and the bandwidth is approximately 300. Many cases are much worse. Given that I have <=7 nonzero elements per column, banded matrices would require holding at least 50x the data in memory, which is prohibitive in my case.

Sounds like you are solving finite difference/finite element problems. Here is the solution I worked up for using UMFPack (the sparse direct solver that the backslash operator uses):

The data is stored here. Note that the A matrix and the fac object share the the same underlying arrays. This requires changing the integer arrays back and forth between 1 based (what Julia expect) and 0 based (what UMFPack expects).

The actual solve is done here. It reuses the symbolic factorization (the part where UMFPack figures out the sparsity pattern of the factored matrix) unless the SKYLAKE_STACKSMASH variable is true (I ran into a weird problem on an Intel Skylake processor), and only repeats the numeric factorization.

I wrote a new ccall wrapper for the UMFPack solve routine here. Looking back at the code, I can’t remember why I did this exactly, It might have been necessary to avoid repeating the factorization at the time (this code was originally written on Julia 0.4), but it may not be necessary anymore.

Hope this helps.

2 Likes
1 Like

Being able to seed an initial condition to an iterative solver would be one way that could help under certain conditions.

2 Likes

To give a little background on @JaredCrean2’s answer, sparse direct solvers (what A\b is calling under the hood when A is a sparse matrix) have 3 phases

  1. Symbolic factorization
  2. Numerical factorization
  3. Solve (forward and backward substitution)

If you have a bunch of matrices with the same sparsity structure but different numerical values, then you can reuse the symbolic factorization for all the matrices but you’ll need to redo the numerical factorization every time. You can save a little bit of time this way but the numerical factorization is by far the most computationally intensive of the three stages so the savings won’t be anywhere near as large as the OP would hope.

However, in the benchmarks of the OP, you seem to be looking at a different problem. You’re looking at the time to solve two systems of linear equations that both have the same matrix (sparsity structure and numerical values) but different right hand sides. In this case, you can reuse the same numerical factorization to solve for multiple right hand sides and this will be a great deal faster than calling A\b repeatedly in a loop. For example, on 1.0:

using BenchmarkTools, SuiteSparse, SparseArrays
julia> A = sprandn(10000,10000,1e-4);
julia> A = A'*A + 10*I;
julia> b = randn(10000);
julia> @btime L = cholesky($A);
  2.467 ms (33 allocations: 2.37 MiB)

julia> @btime x = $L\$b;
  114.755 ÎĽs (10 allocations: 547.24 KiB)

Sparse direct solvers are pretty amazing. If you want to really dig into how they work, check out Tim Davis’ book

https://epubs.siam.org/doi/book/10.1137/1.9780898718881

1 Like

Thank you all for your responses!

@dlfivefifty, thanks for the suggestion. Unfortunately, the bands are not well-described by blocks either :-(.

@JaredCrean2 it sounds like what you’ve done is very similar to what I was hoping to do! I will try to study your code closely to see if I can build on it. Thanks!

@ChrisRackauckas, that definitely does help. I didn’t mention it here, but I’m currently using a well-seeded IterativeSolver since I change the values in M and v each time by relatively small amounts, so the previous solution is often a good starting point.

@Pbellive, thanks for the intuitive explanation. You’re right that my OP benchmarks don’t include two different M with the same sparsity structure. This was mainly to make it concise and to avoid ugly code modifying M.nzval. I will definitely be modifying both M and v in my actual application and was hoping that (1) actually took much of the time and that (2) was comparatively fast. (It sounds like this is not the case, however, which is rather unfortunate.)

It also occurs to me that in my actual use-case the linear system solves substantially more slowly than in the example provided by @PBellive. This suggests to me that \ is calling a sub-optimal solving algorithm for my use-case. I will modify try to modify my OP to better reflect my actual use-case in case the easiest/fastest thing is to change the algorithm that \ is telling the linear algebra solver to use a different algorithm.