I finally stumbled across generic_lufact!
here. This seems to be where Float16
goes to die.
I made a copy and threaded the critical loop and things got a lot better for half precision work on my Mac.
Threading this function gave me almost perfect parallelism in my half precision lu work. Adding an @simd
before the inner loop made the improvement about 10x what I had a couple weeks ago. I made no attempt to connect the number of threads (8 in my case) to the size of the problem.
I used Polyester.@batch
for this. That was faster for me than FLoops.@floop
or Threads.@threads
. But even they were 5-10x faster than doing nothing when coupled with @simd
on the inner loop.
Is there a reason why things like generic_lufact!
are not threaded?
All I did was change this
# Update the rest
for j = k+1:n
for i = k+1:m
A[i,j] -= A[i,k]*A[k,j]
end
end
to
# Update the rest
@batch for j = k+1:n
@simd ivdep for i = k+1:m
@inbounds A[i,j] -= A[i,k]*A[k,j]
end
end
Even with this, my Float16
LU takes 2.5–6x longer than LAPACK’s
Float64
LU. If anyone sees a way I can get some more speed out of this,
I’d be glad to try it.
Short of coding the Toledo algorithm, like LAPACK and RecursiveFactorization.jl (neither of which support Float16) do, I don’t see any way to make much more progress. When LAPACK/BLAS realize the mixed-precision dream, this will not be necessary.