Making Float16 LU better in generic_lu.jl

I finally stumbled across generic_lufact! here. This seems to be where Float16 goes to die.

I made a copy and threaded the critical loop and things got a lot better for half precision work on my Mac.

Threading this function gave me almost perfect parallelism in my half precision lu work. Adding an @simd before the inner loop made the improvement about 10x what I had a couple weeks ago. I made no attempt to connect the number of threads (8 in my case) to the size of the problem.

I used Polyester.@batch for this. That was faster for me than FLoops.@floop or Threads.@threads. But even they were 5-10x faster than doing nothing when coupled with @simd on the inner loop.

Is there a reason why things like generic_lufact! are not threaded?

All I did was change this

# Update the rest
            for j = k+1:n
                for i = k+1:m
                    A[i,j] -= A[i,k]*A[k,j]
                end
            end

to

# Update the rest
@batch for j = k+1:n
       @simd ivdep for i = k+1:m
            @inbounds    A[i,j] -= A[i,k]*A[k,j]
       end
end

Even with this, my Float16 LU takes 2.5–6x longer than LAPACK’s
Float64 LU. If anyone sees a way I can get some more speed out of this,
I’d be glad to try it.

Short of coding the Toledo algorithm, like LAPACK and RecursiveFactorization.jl (neither of which support Float16) do, I don’t see any way to make much more progress. When LAPACK/BLAS realize the mixed-precision dream, this will not be necessary.

Some libraries use it as a base case for small problems. It is the fastest there, and these optimizations probably hurt that.

That makes sense. Optimization for me: pessimizaton for others.

I played with it and found that minbatch = 16 was best for small problems and no worse for the larger ones.