Review: schur-pade matrix powers speedup?

felix · January 23, 2017, 2:28pm

Hello,

tldr can anyone compare the below algorithm to its Matlab equivalent, or suggest improvements to the \ bottleneck?

I tried to re-implement the Higham-Lin algorithm for fractional matrix powers, code here.

The basic idea in the computation is to use square roots until the matrix norm approaches 1, take the “Pade approximants”, then square power back up again. Calculating these approximants seems to be a bottleneck and overall performance of this sophisticated algorithm is about x10-x100 slower than the brute-force method in base, which just finds the eigenvalues & vectors, applies the matrix power to the eigenvalues, and recombines everything.

A = rand(100,100)
A = A'*A
m = 7/39

@benchmark A^m
memory estimate:  512.56 kb
allocs estimate:  40
--------------
minimum time:     2.364 ms (0.00% GC)
median time:      2.430 ms (0.00% GC)
mean time:        2.687 ms (0.84% GC)
maximum time:     50.587 ms (0.00% GC)

@benchmark SchurPade.powerm(A,m)
memory estimate:  58.03 mb
allocs estimate:  3365420
--------------
minimum time:     120.568 ms (4.00% GC)
median time:      122.703 ms (5.25% GC)
mean time:        123.034 ms (4.86% GC)
maximum time:     129.946 ms (2.87% GC)

unfortunately. Using ProfileView, I see

So clearly most of the time is spent in the _powerm_cf function, calculating the Pade approximation with ((I+S)\A). I am not an expert on this, does anyone have any advice on analysing this further ? If someone has access to Matlab and can compare times to the original, that would be instructive as well.

GunnarFarneback · January 23, 2017, 3:32pm

Most likely a type stability problem. A quick look suggests that S in _powerm_cf starts out as a scalar and is changed to a matrix in the loop.

function _powerm_cf{T<:Number}(A::Array{T,2},pfrac::Real,m::Int)
    k = 2*m
    n = size(A,1)
    I = eye(A)
    S = _coeff(pfrac,k)
    for j = (k-1):-1:1 # bottom-up
        S = _coeff(pfrac,j) * ((I+S)\A)
    end
    return I + S
end

Ralph_Smith · January 23, 2017, 4:34pm

When I try to reproduce your results, I get similar overall timings, but ProfileView shows most of the time spent in sqrtm, which is what I would expect since sqrtm isn’t optimized and has an O(n^3) loop. (This is with Julia 0.5). If it were optimized, based on op counts the Schur factorization should dominate, leading to something just a few times as expensive as the diagonalization scheme.

kristoffer.carlsson · January 23, 2017, 4:35pm

If profiling shows 95% in lapack a type stability issue shouldn’t matter much for runtime.

stevengj · January 23, 2017, 4:56pm

My impression was that the advantages of the Pade methods weren’t in performance, they are in accuracy. If the matrix is close to defective (i.e. if the eigenvector matrix is ill-conditioned), then the diagonalization approach can be very inaccurate.

Ralph_Smith · January 24, 2017, 4:19am

Perhaps I wasn’t explicit enough. If profiling emphasizes LAPACK for this code, then there is something wrong with the profiler. I would ask @felix to report on the version of Julia he is using (presumably ProfileView is faithful to the underlying data), and for anyone else to chime in on the status of profiling in master, if that is the relevant base.

On the other hand, there is inefficiency in the triangular sqrtm, partly due to type instability (at least in Julia 0.5, the unions of complex and real percolate pretty far down). Using small internal functions to handle the inner loops can improve the performance of these, and would be a good project. Maybe @felix can sneak one in if he is preparing a PR. Here is a draft:

function floop(x,R,i,j)
    r = x
    @inbounds begin
        @simd for k = i+1:j-1
            r -= R[i,k]*R[k,j]
        end
    end
    r
end

# context omitted
        for i = j-1:-1:1
            r = floop(A[i,j],R,i,j)
            r==0 || (R[i,j] = r / (R[i,i] + R[j,j]))
        end

And while I’m at it, of course @stevengj is right that this algorithm is meant for accuracy. Higham and Lin emphasize applications in finance and health care. So I applaud the OP if he is suggesting that an accurate scheme should be the default here. I also see that Higham and Lin improved their algorithm further (citation ); perhaps that is also amenable.

GunnarFarneback · January 24, 2017, 8:41am

If profiling shows 95% in lapack a type stability issue shouldn’t matter much for runtime.

You might be surprised. In the presence of type instability all kinds of funny things can happen. Maybe those particular lapack calls won’t even happen after the type instability is fixed. Or they may. Either way I wouldn’t pay much attention to what the profiler says until the type instability issues have been fixed.

kristoffer.carlsson · January 24, 2017, 9:42am

Type inference is an optimization and should not change the result of any code. There where some exceptions previously like array comprehensions but at least that case is fixed now. Could you give an example where a type instability causes a completely different lapack method (or any other method) to be called?

GunnarFarneback · January 24, 2017, 11:49am

Type inference is an optimization and should not change the result of any code. There where some exceptions previously like array comprehensions but that at least that case is fixed now. Could you give an example where a type instability causes a completely different lapack method (or any other method) to be called?

Well, this is funny. I saw exactly that happening when I fixed the type instability I pointed out in this code:

S = _coeff(pfrac,k)
for j = (k-1):-1:1 # bottom-up
    S = _coeff(pfrac,j) * ((I+S)\A)
end

to

S = _coeff(pfrac,k) * I
for j = (k-1):-1:1 # bottom-up
    S = _coeff(pfrac,j) * ((I+S)\A)
end

However, turns out I not only fixed the type instability but also a bug in the code, allowing the computations to stay in the much faster triangular code path (naturally involving different lapack calls).

So I’ll admit I got this wrong and revise my recommendation to fix the type instabilities and whatever bugs that uncover, then pay attention to the profiling results.

felix · January 24, 2017, 12:49pm

@GunnarFarneback Many thanks Gunnar, your comment highlighting the type instability put me on the same track and you found a real mistake as well. That was very insightful going by Higham-Lin, I believe it is S = _coeff(pfrac,k) * A.

@Ralph_Smith With the above bug eliminated, I now also see most of the time spent in sqrtm. I had the blocking optimisation of Deadman Higham Ralha on my radar, but would also be interested in the further improvements you suggest. I will have a second look at floop to try to get the meaning, but already thank you for the interesting pointer.

@stevengj you’re right of course, but not having yet figured out how to measure the accuracy nicely (any suggestions?), I wanted to improve the other metric

I am currently running this on juliabox.com on Julia 0.5.0

stevengj · January 24, 2017, 1:02pm

The easiest thing is to do the computation in both single and double precision, since the difference is usually a good measure of the single-precision error. Then you can try making matrices closer and closer to defective and see what happens.

felix · January 24, 2017, 2:19pm

@Ralph_Smith The floop optimisation speeds up sqrtm by a dizzying x15 for me. That seems nice enough to get its own PR, so I’ll do that now.

That also gets the Schur-Pade function into a nice place performance-wise. Thank you!!

@benchmark A^m
memory estimate:  512.56 kb
allocs estimate:  40
--------------
minimum time:     2.244 ms (0.00% GC)
median time:      4.545 ms (0.00% GC)
mean time:        5.403 ms (0.93% GC)
maximum time:     78.446 ms (0.00% GC)

@benchmark SchurPade.powerm(A,m)
memory estimate:  8.28 mb
allocs estimate:  156063
--------------
minimum time:     20.370 ms (0.00% GC)
median time:      27.327 ms (0.00% GC)
mean time:        35.272 ms (3.63% GC)
maximum time:     87.963 ms (0.00% GC)

andreasnoack · January 24, 2017, 4:02pm

Small comment: you should consider using I instead of I=eye(n). The I in base is a special identity matrix that should be much faster than eye. I might not make a big difference in this example, though.

felix · January 24, 2017, 4:11pm

@andreasnoack Sounds interesting! … How? I can’t find it by searching in base

sqrtm/floop is now PR#20214

andreasnoack · January 24, 2017, 4:20pm

I get

help?> I
search: I IO if is Int in im Inf isa Int8 inv IPv6 IPv4 Int64 Int32 Int16 info imag ifft idct Inf64

  I

  An object of type UniformScaling, representing an identity matrix of any size.

on 0.5 and

help?> I
search: I IO if Int is in im Inf isa Int8 inv IPv6 IPv4 Int64 Int32 Int16 info imag ifft idct Inf64

  I

  An object of type UniformScaling, representing an identity matrix of any size.

     Example
    ≡≡≡≡≡≡≡≡≡

  julia> ones(5, 6) * I == ones(5, 6)
  true

  julia> [1 2im 3; 1im 2 3] * I
  2×3 Array{Complex{Int64},2}:
   1+0im  0+2im  3+0im
   0+1im  2+0im  3+0im

on master

Ralph_Smith · January 25, 2017, 3:49am

Glad it worked so well. Also glad my alarmism about the profile was misplaced; the matrices that I tried never hit the branch that gave your original profile chart. You were fortunate that you had an unusual matrix which shouted about the error highlighted by Gunnar!

That serves as a reminder that an implementation of this algorithm needs careful testing; I did find that negative real eigenvalues trigger a domain error in _powerm2x2, for example - I think you need to force the working matrix to be complex somewhere (if any eigenvalues are not positive reals). Matlab might be more forgiving, at the cost of efficiency. But don’t give up! The current matrix power function in Base looks even harder to fix for these cases.

felix · January 27, 2017, 2:48pm

Thanks, @Ralph_Smith. I worked on the typing logic, so I hope the domain error is eliminated (as well as the performance improved – below, the Schurf factorisation is the bottleneck!), the Schur-Pade algorithm is updated.

Next I’ll see about implementing the improved algorithm, and measuring the precision

felix · February 23, 2017, 4:13pm

I’ve been looking at the improved Schur-Pade algorithm for a while. I have a couple of detailed questions arising from the paper that I don’t feel confident in answering on my own.

A key step is analysis of how many expensive square roots have to be taken for the Pade approximant to be good enough – better than machine epsilon. It involves a hard-coded table of constants θ. The key comparison is between various θ and a certain matrix norm, which roughly halves with each square root. I have excerpted the paper here, I hope that is ok with the authors.

“we will allow at most two extra square roots” → is this an ad-hoc judgement call, or is there an obvious reason for this design? With the rule of thumb that square roots halve the norm, one square root should suffice, but in any case taking repeated square roots should reduce the norm enough without it being necessary to count how often it has been done.
Is the above selection of the Pade degree m logically equivalent to the following?
1. if α_3 or α_4 < θ_6, let m be the smallest s.t. α_3 < θ_m;
2. if 1/2*α_3 < θ_5, take another square root and start again;
3. if α_3 or α_4 < θ_7, let m=7.

I realise these are very detailed questions. As this is outside of my expertise, I don’t feel comfortable going ahead without review. (I can implement their algorithm, but it’s no fun if I don’t believe it …) Any comment appreciated!

Ralph_Smith · February 24, 2017, 2:48pm

The criterion is based on an approximation, so if the approximation is found to be inadequate the limit is incremented. The reasoning for this and the degree selection is explained more clearly in the matrix logarithm paper.

Note that the improved matrix logarithm algorithm is implemented in base here, and most of it is identical to the powers algorithm. If you want to use estimated norms (instead of “exact” ones as in triangular.jl), you can easily modify the code in normestinv to work with positive integer powers instead of an inverse.

felix · February 24, 2017, 4:01pm

Oh that’s brilliant. I’d skipped past the logm code and paper previously, but it looks very similar indeed. Maybe some of the common sections can be factored out?

Topic		Replies	Views
Review: pth matrix roots General Usage performance , review	4	1142	February 10, 2017
Performance of a key step in Strang Splitting method Performance pde	10	678	March 17, 2021
Julia is slower than MATLAB at diagonalizing matrices New to Julia question , matlab , linearalgebra , matrices , eigenvalues	41	4403	March 30, 2022
Optimizing a Spectral Proper Orthogonal Decomposition Method Performance	19	702	May 6, 2022
Some eigenpairs from a large, sparse, nonsymmetric matrix: Julia vs Matlab Numerics sparse , eigenvalues , eigenvectors	22	1806	January 12, 2024

Review: schur-pade matrix powers speedup?

Related topics