Recursive-blocked matrix square root

Hello, this is about a recursive implementation of the sqrtm algorithm. The idea is that, instead of finding the solution X to X^2 = A on a pointwise basis, calculating each entry starting with the diagonal and working outwards, A is split up into [A11 A12; 0 A22], smaller solutions X11, X22 are found as before, and X12 is the result of the Sylvester equation. This follows Deadman Higham Ralha and should produce a sizeable speedup. … But it’s not as fast as I would like it to be.

Any improvements that can be made to recsqrtm? Also in memory usage

function recsqrtm{T}(A::AbstractArray{T})
    S = schurfact(A)
    S[:vectors] * _recsqrtm(S[:Schur]) * S[:vectors]'
end
function _recsqrtm{T}(A::AbstractArray{T})
    n = Base.LinAlg.checksquare(A)
    t = typeof(sqrt(zero(T)))
    R = zeros(t,n,n)

    @inbounds if n == 1
        R[1,1] = sqrt(A[1,1])
    elseif n == 2
        R[1,1] = r11 = sqrt(A[1,1])
        R[2,2] = r22 = sqrt(A[2,2])
        R[1,2] = A[1,2] / (r11 + r22)
    else
        n1 = Int(ceil(n/2))
        r1 = 1:n1
        r2 = (n1+1):n
        R[r1,r1] = _recsqrtm(A[r1,r1])
        R[r2,r2] = _recsqrtm(A[r2,r2])
        R[r1,r2] = sylvester(R[r1,r1],R[r2,r2],-A[r1,r2])::Matrix{t}
    end
    R
end

From ProfileView most of the time is spent in trsyl!.

How does one calibrate this kind of thing? See below for block-wise recursive code. How does one choose an appropriate block size? The block size (the point at which one reverts to the pointwise sqrtm method) doesn’t seem to matter much, whether 1 (as above, in recsqrtm) or 64 as below. Is there a standard set of matrices/matrix types to test against, or systems to test on? Looks to me like recursive performance is an improvement from matrices of dimension 1000 - 2000, so should that be hard-coded? What about parallel computation?

function _blksqrtm{T}(A::AbstractArray{T})
    n = Base.LinAlg.checksquare(A)
    
    @inbounds if n < 65
        return sqrtm(UpperTriangular(A),Val{true})
    else
        t = typeof(sqrt(zero(T)))
        R = zeros(t,n,n)
        n1 = Int(ceil(n/2))
        if !iszero(A[n1+1,n1]) # hit a 2x2-block on diagonal, avoid
            n1 -= 1
        end
        r1 = 1:n1
        r2 = (n1+1):n
        R[r1,r1] = _blksqrtm(A[r1,r1])
        R[r2,r2] = _blksqrtm(A[r2,r2])
        R[r1,r2] = sylvester(R[r1,r1],R[r2,r2],-A[r1,r2])::Matrix{t}
    end
    R
end

Try putting @views in front of function (requires Compat in Julia 0.5).