Is half precision a killer app for RecursiveFactorization?

ctkelley · August 2, 2023, 5:11pm

@YingboMa you and your collaborators have done good and noble work with this.

I’ve been working on half precision for a while and one of the pain points is that LAPACK/BLAS does not support Float16, even though Apple silicon does. I just tried RecursiveFactorizations and it’s a far better choice for lu in half.

To my surprise the difference is still present on my 2019 Intel Mac. Open BLAS is doing better on the M2 MAC, but is still far from good.

While half is still 3-6x slower than double, it is far better than what I was getting with OpenBlas in Float16. I made no attempt to tune anything in RecursiveFactorization.

The backward error in the factorization seems to be better with RecursiveFactorization. The SIAM paper doesn’t say anything about that. My (limited) experiments with half precision moderately ill-conditioned matrices consistently look better with RF than with Open Blas.

Herewith a couple tables (code below). I compare LU factorizations of random NxN matrices in half (both Open Blas (OB) and RecursiveFactorization (RF) to each other and to an OB factorization in double precision. I also tabulate the ratios of the half precision factorization times to the double precision times (the last two columns).

MAC M2 Pro, 8 performance cores

N    Double    HalfOB    HalfRF   RatioOB   RatioRF 
1024  4.20e-03  1.28e-01  1.83e-02  3.05e+01  4.37e+00 
2048  2.35e-02  1.03e+00  5.84e-02  4.40e+01  2.48e+00 
4096  1.59e-01  8.10e+00  4.79e-01  5.09e+01  3.01e+00 
8192  1.17e+00  6.44e+01  6.61e+00  5.52e+01  5.67e+00

Intel Core i9, 8 cores

N    Double    HalfOB    HalfRF   RatioOB   RatioRF 
1024  5.21e-03  7.37e-01  2.37e-02  1.41e+02  4.55e+00 
2048  3.12e-02  6.10e+00  1.51e-01  1.95e+02  4.84e+00 
4096  2.07e-01  4.75e+01  1.03e+00  2.29e+02  4.97e+00 
8192  1.34e+00  3.78e+02  1.02e+01  2.82e+02  7.63e+00

My testing ran like this

using Random
using BenchmarkTools
using RecursiveFactorization
using LinearAlgebra
using Printf


julia> Half_Time_Show();

and used these functions

"""
Compare LU for Blas (double, half) and RecursiveFactorization (half)
"""
function Half_Time_Show(p = 4, T=Float16)
    pv = collect(1:1:p)
    ppv = 2 .^ pv
    nv = 512 .* ppv
    Time_LU = zeros(p, 6)
    for p in pv
        Random.seed!(46071)
        n = nv[p]
        AD = rand(n, n)
        AH = T.(AD)
# Time the factorizations
        td = @belapsed lu($AD)
        tob16 = @belapsed lu($AH)
        trf16 = @belapsed rlu($AH)
# Rations of factorization time half/double
        rtrf = trf16 / td
        rtob = tob16 / td
        Time_LU[p, :] = [n, td, tob16, trf16, rtob, rtrf]
    end
    repltab(Time_LU)
    return Time_LU
end

function repltab(TabMac)
    headers = ["N", "Double", "HalfOB", "HalfRF", "RatioOB", "RatioRF"]
    formats = "%d  %7.2e  %7.2e  %7.2e  %7.2e  %7.2e \n"
    fprintREPL(headers, formats, TabMac)
end 
    
function fprintREPL(headers, formats, data)
    printf(fmt::String, args...) = @eval @printf($fmt, $(args...))
    (mr, mc) = size(data)
    @printf("%s ", headers[1])
    for i = 2:mc
        @printf("%9s ", headers[i])
    end 
    @printf("\n")
    bigform = string(formats)
    for i = 1:mr
        printf(bigform, data[i, :]...)
    end 
end 

# LU with RecursiveFactorization

function rlu!(C)
    CF = RecursiveFactorization.lu!(C, Vector{Int}(undef, size(C, 2)))
    return CF 
end 
    
function rlu(C)
    B = copy(C)
    CF = RecursiveFactorization.lu!(B, Vector{Int}(undef, size(B, 2)))
    return CF
end

Elrod · August 3, 2023, 7:08pm

LoopVectorization doesn’t have proper Float16 support, and TriangularSolve.jl explicitly only supports Float64 and Float32, because that is all I was thinking about at the time.

PRs to improve Float16 support are welcome, but I’d also suggest profiling to get an idea of how much improvement could be expected from (for example) adding Float16 support to TriangularSolve.jl.

ctkelley · August 4, 2023, 5:07pm

Thanks for the flag on this. I looked and did see some promotions from half to single in VectorizationBase.jl . That could explain the occasional bump in backward errors and is something I need to mention if/when I publish results based on RecursiveFactorization

I took a deeper look and found that the promotion from half to single happens very early, which explains some of the inconsistencies in my results. Getting RecursiveFactorization to work with Float16 without upcasting to Float32 is very much above my pay grade/skill set/time left to live. So … I finally stumbled on generic_lu! deep in the Julia source and hacked it to do ok (not good). I will make a new post on that after I get my facts straight.

Topic		Replies	Views
Apples to apples comparison of A\b with Float64 and Float16 on A64FX Performance question , linearalgebra	12	760	May 2, 2022
[ANN] RecursiveFactorization.jl Package Announcements package , linearalgebra , recursion	12	1927	July 8, 2020
Half precision Internals & Design question	2	563	February 1, 2022
Cholesky promotes Float16 Numerics question	16	1010	February 21, 2022
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3586	February 11, 2020

Is half precision a killer app for RecursiveFactorization?

Related topics