How to optimize simple matrix equation solving

Hello !

I implemented a method which aims at detecting change-point in categorical time-serie, which gives correct results but is way too slow (slower than the python version already available, which is itself already slow). I’ve prealocated memory everywhere I could and done my maximum but the code is still 4x slower than python.
I have determined that the bottleneck here is this step (it’s embeded in a for loop):

theta = (H + lambda*Matrix{Float64}(I, b, b))\h

H is a square matrix of dimension 300x300 and h is a vector of length 300. So in the end it comes down to the following question : How to solve the equation (A + Lambda*I)X = B the most eficiently ?

Do you guys have an idea how I could it faster than with the \ operator ?

Cheers !

One slightly more efficient way would be not to materialize I into a matrix like this:

theta = (H + lambda * I)\h

But are you really sure that this part of your code is the problem? If all you’re trying to do is calculating eigenvectors, it’s probably better to just use the built-in methods. Without further information about the rest of your code, I don’t think you can get much better performance for this example. There are some inplace BLAS methods to get rid of some unnecessary allocations, but I think one would have to take a look at your actual code to get rid of these performance bottlenecks.

Well when I use the profiler, it shows that almost 2/3 of the run time, if not more, is spend on doing this task repeatedly so I think it’s where I should try to optimize.

I will try what you suggested. What do you mean with the build-in methods ?
Should I post the code ? It’s 80 lines-long and a bit messy (I think ?) when one has not read the paper which describes the method used.

Thanks for your help !

Is your matrix H special in any way? How many right hand sides (or shifts) do you need to solve? What problem are you ultimately trying to solve?

H is a square real-valued matrix, but apart from that i’m not aware of it having other specific characteristics.

The problem I am trying to solve is that of change-point detection : finding out when, in a time-series, the probability distribution of the values changes. Meaning we approximate the time-series as a piece-wise stationary process and we want to find out where the points of “discontinuity” are.
For that, I slide along the time-series and take samples before and after each given point of interest and see if the distribution have similar properties or not.

Estimating the distribtion directly is difficult and unreliable, so it is better to estimate the difference or the division of the distribution. with my code I want to estimate the difference.
The paper Density-Difference Estimation from Masashi Sugiyama et al. (2012) dscribes the method, basically one modelizes the density difference g(x) with a linear-in-parameter model Capture
with theta the parameters to find. Cl are the (determined in advance) centers of the gaussian kernels and x is the point at which you want to estimate the difference in density. After a bit of algebra one finds that : Capture%40
with Capture2 .
Lambda is a regularization parameter.
What is done is looping through all the possible values of sigma and lambda, (calculating the corresponding theta) and then see which combination of sigma and theta gives you a minimum for the cost function (described in the paper).
This is what I want to solve faster. Currently Python is roughly 2 - 3 time faster than my Julia implementation.

Are you solving the equations repeatedly with the same H and λ? Did you try factorizing the matrix, e.g.

Hlu = lu(H + λ*I)

and then reusing that factorization in the loop:

θ = Hlu \ h

For a random 300x300 H on my laptop working with the LU-factorization gives a 40x speedup. If λ varies, there might be some way to quickly update Hlu, though that’s outside my area.

1 Like

No I actually loop through all the possible values of sigma and lambda (i forgot to mention it, it just edited my last reply), in order to find the value of theta that will minimize a cost function.

If you find the Hessenberg factorization of H first then you can solve each new lambda in O(n^2) time rather than O(n^3). You can also preallocate the solution vector and use ldiv! to save an allocation. And I hope you are not benchmarking in global scope.

(Ultimately it sounds like all of your time here is spent in dense matrix algorithms, so if there is a difference between Python and Julia here for the same algorithm it is probably just what BLAS library you are linking, e.g. OpenBLAS vs. MKL.)

But I’m confused about a more fundamental point. You speak about “looping over all parameters” to minimize a function. Why not use an optimization library instead?


Alright I will check that out !
As for the optimization package, I didn’t use them because I didn’t think i would be possible to use them for my problem (I’m new to julia) and i have never used them previously to be honest

I have been looking at a similar problem, so here is an example of how this can be solved quite a bit quicker

H = randn(100,100)
b = randn(100)
λs = 1.0:1:1000

@btime testslow(H, λs, b) #677.874 ms (7001 allocations: 154.48 MiB)
@btime testfast(H, λs, b)  #48.968 ms (3018 allocations: 670.38 KiB)

with the code

using LinearAlgebra
# Solve H\rhs and store result in rhs, where H is on upper hessenberg form
function backsolve_hess!(H::AbstractMatrix{T}, rhs::AbstractArray{T}) where {T}
    # Make H UpperTriangular and apply givens to rhs
    n = size(H,1)
    for i = 1:n-1
        Gi, r = givens(H, i, i+1, i)
        # Note: slow indexing
        lmul!(Gi, H)
        lmul!(Gi, rhs)
    # H is now upper triangular
    ldiv!(UpperTriangular(H), rhs)
    return rhs

# Solve tmp = (Q*H*Q'+λI)\rhs, hiven hessenberg H, unitary Q, and b = Q'*rhs 
function hessenberg_solve!(tmp, H, Q, λ, b, tmpH)
    tmpH .= H
    # tmp = Q'*rhs
    tmp .= b
    for i = 1:size(tmpH,1)
        tmpH[i,i] += λ
    # tmp = (H+λI)⁻¹Q'*rhs
    backsolve_hess!(tmpH, tmp)
    # tmp = Q*(H+λI)⁻¹Q'*rhs
    lmul!(Q, tmp)
    return tmp

function testslow(A, λs, b)
    s = zero(eltype(A))
    for λ in λs
        tmp = (A + λ*I)\b
        s += norm(tmp)
    return s

function testfast(A, λs, b)
    # A = Q*H*Q'
    HF = hessenberg(A)
    # Type insability here
    H = HF.H::typeof(A)
    Q = HF.Q::LinearAlgebra.HessenbergQ{eltype(A), typeof(A)}
    tmpH = similar(A)
    tmp = similar(b)
    # b2 = Q⁻¹b=Q'*b
    b2 = Q'*b
    s = zero(eltype(HF))
    for λ in λs
        hessenberg_solve!(tmp, H, Q, λ, b2, tmpH)
        s += norm(tmp)
    return s

It is noteworthy that the line lmul!(Gi, H) takes most of the time. It actually accesses H roughly like H[k,:], so it is probably possible to make the code even faster by working with the transpose some way.

1 Like

It should be pretty straightforward to use an optimization library like to solve this. For example, here’s a toy problem which does not exactly match your formulation, but still shows that you can do things like optimize functions which include operations like matrix inverses or linear solves. All you need to do is write a function which takes a vector of parameters (your theta and lambda) and returns a scalar cost:

julia> using LinearAlgebra

julia> using Optim

julia> function H(θ, λ)
         fill(θ, 3, 3) .+ λ * I
H (generic function with 1 method)

julia> function cost(x)
         θ = x[1]
         λ = x[2]

         # an arbitrary function I picked at random
         norm(ones(3) \ H(θ, λ))
cost (generic function with 1 method)

Then you can let Optim optimize those parameters for you:

julia> guess = [rand(), rand()]
2-element Array{Float64,1}:

julia> results = optimize(cost, guess, autodiff=:forward)
Results of Optimization Algorithm
 * Algorithm: Nelder-Mead
 * Starting Point: [0.02156578662131392,0.8464637430482358]
 * Minimizer: [0.05512968234098739,-0.16538904906902221, ...]
 * Minimum: 1.181293e-09
 * Iterations: 40
 * Convergence: true
   *  √(Σ(yᵢ-ȳ)²)/n < 1.0e-08: true
   * Reached Maximum Number of Iterations: false
 * Objective Calls: 78

julia> θ, λ = Optim.minimizer(results)
2-element Array{Float64,1}:

This was a good idea, but the optimization is taking way too long (10 times slower than the loop version) and since it’s roughly known in which range the parameters have to be I think i’ll stick with that and check what @mfalt suggested

Then you are using the optimization incorrectly, or are using the wrong algorithm. Properly used, optimization should essentially never be slower than a simple grid search.

(If you roughly know in what range the parameters have to be, you can specify bound constraints to many optimization algorithms to tell it only to search that region.)

1 Like

So if I understood correctly, what you are doing is adding I*λ directly to the Hessenberg decomposition of the matrix and then you solve ? Does that mean that the Hessenberg decomposition leaves the diagonal of a matrix invariant ? Please correct me if I’m wrong, the world of algorithmic optimisation of linear algebra is very new to me.
Also, are there any limitation on your approach, in the way the matrix A has to be, or is it general ?

The Hessenberg factorization, which I belive works for any square matrix creates A=Q*H*Q' here Q is unitary and H is upper triangular, with one subdiagonal (hessenberg). Since A+Iλ=A+λQ*Q'=Q*H*Q'+Q*λ*Q'=Q*(H+Iλ)*Q' we can add the diagonal directly to H. The only way I know this could fail is if the matrix A+Iλ is not invertable, then the givens rotations in backsolve_hess will fail.

As a fix to that backsolve_hess! works on rows instread of columns, the following, quite ugly, but a bit faster code seems to improve it by another 10-20%:

function backsolve_hess!(H::AbstractMatrix{T}, rhs::AbstractArray{T}) where {T}
    # Make H UpperTriangular and apply givens to rhs
    n = size(H,1)
    Ht = H'
    rotations = LinearAlgebra.Givens{T}[]
    for i = n:-1:2
        Gi, r = givens(Ht, i, i-1, i)
        push!(rotations, Gi')
        rmul!(H, Gi')
    # H is now upper triangular
    ldiv!(UpperTriangular(H), rhs)
    for i in n-1:-1:1
        lmul!(rotations[i], rhs)
    return rhs
@btime testfast(H, λs, b) #48.689 ms (2017 allocations: 654.73 KiB)
@btime testfast2(H, λs, b) # 40.799 ms (9017 allocations: 8.65 MiB)

some work can probably be done on the allocations, and proper testing of this new code is probably needed.

Update on an old thread: in Julia 1.3, if you compute H = hessenberg(...), it will now compute (H + λ*I) \ b with a much more efficient algorithm, and you can do it in-place in b with ldiv!(H + λ*I, b) with only O(m) storage and O(m²) time (for an m×m matrix), with the shifted matrix H + λ*I formed implicitly (no matrix allocations) for each shift λ. See: