Review: schur-pade matrix powers speedup?

Hello,

tldr can anyone compare the below algorithm to its Matlab equivalent, or suggest improvements to the \ bottleneck?

I tried to re-implement the Higham-Lin algorithm for fractional matrix powers, code here.

The basic idea in the computation is to use square roots until the matrix norm approaches 1, take the “Pade approximants”, then square power back up again. Calculating these approximants seems to be a bottleneck and overall performance of this sophisticated algorithm is about x10-x100 slower than the brute-force method in base, which just finds the eigenvalues & vectors, applies the matrix power to the eigenvalues, and recombines everything.

A = rand(100,100)
A = A'*A
m = 7/39

@benchmark A^m
memory estimate:  512.56 kb
allocs estimate:  40
--------------
minimum time:     2.364 ms (0.00% GC)
median time:      2.430 ms (0.00% GC)
mean time:        2.687 ms (0.84% GC)
maximum time:     50.587 ms (0.00% GC)

@benchmark SchurPade.powerm(A,m)
memory estimate:  58.03 mb
allocs estimate:  3365420
--------------
minimum time:     120.568 ms (4.00% GC)
median time:      122.703 ms (5.25% GC)
mean time:        123.034 ms (4.86% GC)
maximum time:     129.946 ms (2.87% GC)

unfortunately. Using ProfileView, I see

So clearly most of the time is spent in the _powerm_cf function, calculating the Pade approximation with ((I+S)\A). I am not an expert on this, does anyone have any advice on analysing this further ? If someone has access to Matlab and can compare times to the original, that would be instructive as well.

Most likely a type stability problem. A quick look suggests that S in _powerm_cf starts out as a scalar and is changed to a matrix in the loop.

function _powerm_cf{T<:Number}(A::Array{T,2},pfrac::Real,m::Int)
    k = 2*m
    n = size(A,1)
    I = eye(A)
    S = _coeff(pfrac,k)
    for j = (k-1):-1:1 # bottom-up
        S = _coeff(pfrac,j) * ((I+S)\A)
    end
    return I + S
end

When I try to reproduce your results, I get similar overall timings, but ProfileView shows most of the time spent in sqrtm, which is what I would expect since sqrtm isn’t optimized and has an O(n^3) loop. (This is with Julia 0.5). If it were optimized, based on op counts the Schur factorization should dominate, leading to something just a few times as expensive as the diagonalization scheme.

If profiling shows 95% in lapack a type stability issue shouldn’t matter much for runtime.

My impression was that the advantages of the Pade methods weren’t in performance, they are in accuracy. If the matrix is close to defective (i.e. if the eigenvector matrix is ill-conditioned), then the diagonalization approach can be very inaccurate.

Perhaps I wasn’t explicit enough. If profiling emphasizes LAPACK for this code, then there is something wrong with the profiler. I would ask @felix to report on the version of Julia he is using (presumably ProfileView is faithful to the underlying data), and for anyone else to chime in on the status of profiling in master, if that is the relevant base.

On the other hand, there is inefficiency in the triangular sqrtm, partly due to type instability (at least in Julia 0.5, the unions of complex and real percolate pretty far down). Using small internal functions to handle the inner loops can improve the performance of these, and would be a good project. Maybe @felix can sneak one in if he is preparing a PR. Here is a draft:

function floop(x,R,i,j)
    r = x
    @inbounds begin
        @simd for k = i+1:j-1
            r -= R[i,k]*R[k,j]
        end
    end
    r
end

# context omitted
        for i = j-1:-1:1
            r = floop(A[i,j],R,i,j)
            r==0 || (R[i,j] = r / (R[i,i] + R[j,j]))
        end
 

And while I’m at it, of course @stevengj is right that this algorithm is meant for accuracy. Higham and Lin emphasize applications in finance and health care. So I applaud the OP if he is suggesting that an accurate scheme should be the default here. I also see that Higham and Lin improved their algorithm further (citation ); perhaps that is also amenable.

If profiling shows 95% in lapack a type stability issue shouldn’t matter much for runtime.

You might be surprised. In the presence of type instability all kinds of funny things can happen. Maybe those particular lapack calls won’t even happen after the type instability is fixed. Or they may. Either way I wouldn’t pay much attention to what the profiler says until the type instability issues have been fixed.

1 Like

Type inference is an optimization and should not change the result of any code. There where some exceptions previously like array comprehensions but at least that case is fixed now. Could you give an example where a type instability causes a completely different lapack method (or any other method) to be called?

Type inference is an optimization and should not change the result of any code. There where some exceptions previously like array comprehensions but that at least that case is fixed now. Could you give an example where a type instability causes a completely different lapack method (or any other method) to be called?

Well, this is funny. I saw exactly that happening when I fixed the type instability I pointed out in this code:

S = _coeff(pfrac,k)
for j = (k-1):-1:1 # bottom-up
    S = _coeff(pfrac,j) * ((I+S)\A)
end

to

S = _coeff(pfrac,k) * I
for j = (k-1):-1:1 # bottom-up
    S = _coeff(pfrac,j) * ((I+S)\A)
end

However, turns out I not only fixed the type instability but also a bug in the code, allowing the computations to stay in the much faster triangular code path (naturally involving different lapack calls).

So I’ll admit I got this wrong and revise my recommendation to fix the type instabilities and whatever bugs that uncover, then pay attention to the profiling results.

@GunnarFarneback Many thanks Gunnar, your comment highlighting the type instability put me on the same track and you found a real mistake as well. That was very insightful :slight_smile: going by Higham-Lin, I believe it is S = _coeff(pfrac,k) * A.

@Ralph_Smith With the above bug eliminated, I now also see most of the time spent in sqrtm. I had the blocking optimisation of Deadman Higham Ralha on my radar, but would also be interested in the further improvements you suggest. I will have a second look at floop to try to get the meaning, but already thank you for the interesting pointer.

@stevengj you’re right of course, but not having yet figured out how to measure the accuracy nicely (any suggestions?), I wanted to improve the other metric :slight_smile:

I am currently running this on juliabox.com on Julia 0.5.0

The easiest thing is to do the computation in both single and double precision, since the difference is usually a good measure of the single-precision error. Then you can try making matrices closer and closer to defective and see what happens.

@Ralph_Smith The floop optimisation speeds up sqrtm by a dizzying x15 for me. That seems nice enough to get its own PR, so I’ll do that now.

That also gets the Schur-Pade function into a nice place performance-wise. :tada: Thank you!!

@benchmark A^m
memory estimate:  512.56 kb
allocs estimate:  40
--------------
minimum time:     2.244 ms (0.00% GC)
median time:      4.545 ms (0.00% GC)
mean time:        5.403 ms (0.93% GC)
maximum time:     78.446 ms (0.00% GC)

@benchmark SchurPade.powerm(A,m)
memory estimate:  8.28 mb
allocs estimate:  156063
--------------
minimum time:     20.370 ms (0.00% GC)
median time:      27.327 ms (0.00% GC)
mean time:        35.272 ms (3.63% GC)
maximum time:     87.963 ms (0.00% GC)

Small comment: you should consider using I instead of I=eye(n). The I in base is a special identity matrix that should be much faster than eye. I might not make a big difference in this example, though.

1 Like

@andreasnoack Sounds interesting! … How? I can’t find it by searching in base :sweat_smile:

sqrtm/floop is now PR#20214

1 Like

I get

help?> I
search: I IO if is Int in im Inf isa Int8 inv IPv6 IPv4 Int64 Int32 Int16 info imag ifft idct Inf64

  I

  An object of type UniformScaling, representing an identity matrix of any size.

on 0.5 and

help?> I
search: I IO if Int is in im Inf isa Int8 inv IPv6 IPv4 Int64 Int32 Int16 info imag ifft idct Inf64

  I

  An object of type UniformScaling, representing an identity matrix of any size.

     Example
    ≡≡≡≡≡≡≡≡≡

  julia> ones(5, 6) * I == ones(5, 6)
  true

  julia> [1 2im 3; 1im 2 3] * I
  2×3 Array{Complex{Int64},2}:
   1+0im  0+2im  3+0im
   0+1im  2+0im  3+0im

on master

4 Likes

Glad it worked so well. Also glad my alarmism about the profile was misplaced; the matrices that I tried never hit the branch that gave your original profile chart. You were fortunate that you had an unusual matrix which shouted about the error highlighted by Gunnar!

That serves as a reminder that an implementation of this algorithm needs careful testing; I did find that negative real eigenvalues trigger a domain error in _powerm2x2, for example - I think you need to force the working matrix to be complex somewhere (if any eigenvalues are not positive reals). Matlab might be more forgiving, at the cost of efficiency. But don’t give up! The current matrix power function in Base looks even harder to fix for these cases.

1 Like

Thanks, @Ralph_Smith. I worked on the typing logic, so I hope the domain error is eliminated (as well as the performance improved – below, the Schurf factorisation is the bottleneck!), the Schur-Pade algorithm is updated.

Next I’ll see about implementing the improved algorithm, and measuring the precision

I’ve been looking at the improved Schur-Pade algorithm for a while. I have a couple of detailed questions arising from the paper that I don’t feel confident in answering on my own.

A key step is analysis of how many expensive square roots have to be taken for the Pade approximant to be good enough – better than machine epsilon. It involves a hard-coded table of constants θ. The key comparison is between various θ and a certain matrix norm, which roughly halves with each square root. I have excerpted the paper here, I hope that is ok with the authors.

  • “we will allow at most two extra square roots” → is this an ad-hoc judgement call, or is there an obvious reason for this design? With the rule of thumb that square roots halve the norm, one square root should suffice, but in any case taking repeated square roots should reduce the norm enough without it being necessary to count how often it has been done.

  • Is the above selection of the Pade degree m logically equivalent to the following?

    1. if α_3 or α_4 < θ_6, let m be the smallest s.t. α_3 < θ_m;
    2. if 1/2*α_3 < θ_5, take another square root and start again;
    3. if α_3 or α_4 < θ_7, let m=7.

I realise these are very detailed questions. As this is outside of my expertise, I don’t feel comfortable going ahead without review. (I can implement their algorithm, but it’s no fun if I don’t believe it …) Any comment appreciated!

The criterion is based on an approximation, so if the approximation is found to be inadequate the limit is incremented. The reasoning for this and the degree selection is explained more clearly in the matrix logarithm paper.

Note that the improved matrix logarithm algorithm is implemented in base here, and most of it is identical to the powers algorithm. If you want to use estimated norms (instead of “exact” ones as in triangular.jl), you can easily modify the code in normestinv to work with positive integer powers instead of an inverse.

1 Like

Oh that’s brilliant. I’d skipped past the logm code and paper previously, but it looks very similar indeed. Maybe some of the common sections can be factored out?