Making Float16 LU better in generic_lu.jl

ctkelley · August 8, 2023, 7:29pm

I finally stumbled across generic_lufact! here. This seems to be where Float16 goes to die.

I made a copy and threaded the critical loop and things got a lot better for half precision work on my Mac.

Threading this function gave me almost perfect parallelism in my half precision lu work. Adding an @simd before the inner loop made the improvement about 10x what I had a couple weeks ago. I made no attempt to connect the number of threads (8 in my case) to the size of the problem.

I used Polyester.@batch for this. That was faster for me than FLoops.@floop or Threads.@threads. But even they were 5-10x faster than doing nothing when coupled with @simd on the inner loop.

Is there a reason why things like generic_lufact! are not threaded?

All I did was change this

# Update the rest
            for j = k+1:n
                for i = k+1:m
                    A[i,j] -= A[i,k]*A[k,j]
                end
            end

to

# Update the rest
@batch for j = k+1:n
       @simd ivdep for i = k+1:m
            @inbounds    A[i,j] -= A[i,k]*A[k,j]
       end
end

Even with this, my Float16 LU takes 2.5–6x longer than LAPACK’s
Float64 LU. If anyone sees a way I can get some more speed out of this,
I’d be glad to try it.

Short of coding the Toledo algorithm, like LAPACK and RecursiveFactorization.jl (neither of which support Float16) do, I don’t see any way to make much more progress. When LAPACK/BLAS realize the mixed-precision dream, this will not be necessary.

Elrod · August 8, 2023, 7:53pm

Some libraries use it as a base case for small problems. It is the fastest there, and these optimizations probably hurt that.

ctkelley · August 8, 2023, 9:12pm

That makes sense. Optimization for me: pessimizaton for others.

ctkelley · August 26, 2023, 3:11pm

I played with it and found that minbatch = 16 was best for small problems and no worse for the larger ones.

Topic		Replies	Views
Apples to apples comparison of A\b with Float64 and Float16 on A64FX Performance question , linearalgebra	12	788	May 2, 2022
Is half precision a killer app for RecursiveFactorization? Numerics	2	402	August 4, 2023
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3616	February 11, 2020
Cholesky promotes Float16 Numerics question	16	1023	February 21, 2022
How to parallelize `lu!` factorization? Performance parallel , linearalgebra	15	194	July 22, 2025

Making Float16 LU better in generic_lu.jl

Related topics