I finally stumbled across
generic_lufact! here. This seems to be where
Float16 goes to die.
I made a copy and threaded the critical loop and things got a lot better for half precision work on my Mac.
Threading this function gave me almost perfect parallelism in my half precision lu work. Adding an
@simd before the inner loop made the improvement about 10x what I had a couple weeks ago. I made no attempt to connect the number of threads (8 in my case) to the size of the problem.
Polyester.@batch for this. That was faster for me than
Threads.@threads. But even they were 5-10x faster than doing nothing when coupled with
@simd on the inner loop.
Is there a reason why things like
generic_lufact! are not threaded?
All I did was change this
# Update the rest for j = k+1:n for i = k+1:m A[i,j] -= A[i,k]*A[k,j] end end
# Update the rest @batch for j = k+1:n @simd ivdep for i = k+1:m @inbounds A[i,j] -= A[i,k]*A[k,j] end end
Even with this, my
Float16 LU takes 2.5–6x longer than LAPACK’s
Float64 LU. If anyone sees a way I can get some more speed out of this,
I’d be glad to try it.
Short of coding the Toledo algorithm, like LAPACK and RecursiveFactorization.jl (neither of which support Float16) do, I don’t see any way to make much more progress. When LAPACK/BLAS realize the mixed-precision dream, this will not be necessary.