Is half precision a killer app for RecursiveFactorization?

@YingboMa you and your collaborators have done good and noble work with this.

I’ve been working on half precision for a while and one of the pain points is that LAPACK/BLAS does not support Float16, even though Apple silicon does. I just tried RecursiveFactorizations and it’s a far better choice for lu in half.

To my surprise the difference is still present on my 2019 Intel Mac. Open BLAS is doing better on the M2 MAC, but is still far from good.

While half is still 3-6x slower than double, it is far better than what I was getting with OpenBlas in Float16. I made no attempt to tune anything in RecursiveFactorization.

The backward error in the factorization seems to be better with RecursiveFactorization. The SIAM paper doesn’t say anything about that. My (limited) experiments with half precision moderately ill-conditioned matrices consistently look better with RF than with Open Blas.

Herewith a couple tables (code below). I compare LU factorizations of random NxN matrices in half (both Open Blas (OB) and RecursiveFactorization (RF) to each other and to an OB factorization in double precision. I also tabulate the ratios of the half precision factorization times to the double precision times (the last two columns).

MAC M2 Pro, 8 performance cores

N    Double    HalfOB    HalfRF   RatioOB   RatioRF 
1024  4.20e-03  1.28e-01  1.83e-02  3.05e+01  4.37e+00 
2048  2.35e-02  1.03e+00  5.84e-02  4.40e+01  2.48e+00 
4096  1.59e-01  8.10e+00  4.79e-01  5.09e+01  3.01e+00 
8192  1.17e+00  6.44e+01  6.61e+00  5.52e+01  5.67e+00 

Intel Core i9, 8 cores

N    Double    HalfOB    HalfRF   RatioOB   RatioRF 
1024  5.21e-03  7.37e-01  2.37e-02  1.41e+02  4.55e+00 
2048  3.12e-02  6.10e+00  1.51e-01  1.95e+02  4.84e+00 
4096  2.07e-01  4.75e+01  1.03e+00  2.29e+02  4.97e+00 
8192  1.34e+00  3.78e+02  1.02e+01  2.82e+02  7.63e+00 

My testing ran like this

using Random
using BenchmarkTools
using RecursiveFactorization
using LinearAlgebra
using Printf

julia> Half_Time_Show();

and used these functions

Compare LU for Blas (double, half) and RecursiveFactorization (half)
function Half_Time_Show(p = 4, T=Float16)
    pv = collect(1:1:p)
    ppv = 2 .^ pv
    nv = 512 .* ppv
    Time_LU = zeros(p, 6)
    for p in pv
        n = nv[p]
        AD = rand(n, n)
        AH = T.(AD)
# Time the factorizations
        td = @belapsed lu($AD)
        tob16 = @belapsed lu($AH)
        trf16 = @belapsed rlu($AH)
# Rations of factorization time half/double
        rtrf = trf16 / td
        rtob = tob16 / td
        Time_LU[p, :] = [n, td, tob16, trf16, rtob, rtrf]
    return Time_LU

function repltab(TabMac)
    headers = ["N", "Double", "HalfOB", "HalfRF", "RatioOB", "RatioRF"]
    formats = "%d  %7.2e  %7.2e  %7.2e  %7.2e  %7.2e \n"
    fprintREPL(headers, formats, TabMac)
function fprintREPL(headers, formats, data)
    printf(fmt::String, args...) = @eval @printf($fmt, $(args...))
    (mr, mc) = size(data)
    @printf("%s ", headers[1])
    for i = 2:mc
        @printf("%9s ", headers[i])
    bigform = string(formats)
    for i = 1:mr
        printf(bigform, data[i, :]...)

# LU with RecursiveFactorization

function rlu!(C)
    CF =!(C, Vector{Int}(undef, size(C, 2)))
    return CF 
function rlu(C)
    B = copy(C)
    CF =!(B, Vector{Int}(undef, size(B, 2)))
    return CF

LoopVectorization doesn’t have proper Float16 support, and TriangularSolve.jl explicitly only supports Float64 and Float32, because that is all I was thinking about at the time.

PRs to improve Float16 support are welcome, but I’d also suggest profiling to get an idea of how much improvement could be expected from (for example) adding Float16 support to TriangularSolve.jl.

Thanks for the flag on this. I looked and did see some promotions from half to single in VectorizationBase.jl . That could explain the occasional bump in backward errors and is something I need to mention if/when I publish results based on RecursiveFactorization

I took a deeper look and found that the promotion from half to single happens very early, which explains some of the inconsistencies in my results. Getting RecursiveFactorization to work with Float16 without upcasting to Float32 is very much above my pay grade/skill set/time left to live. So … I finally stumbled on generic_lu! deep in the Julia source and hacked it to do ok (not good). I will make a new post on that after I get my facts straight.