Cholesky promotes Float16

I’m seeing Cholesky promote half precision to single on its own. Is this
supposed to happen? lu does not do this.

Running 1.7.2 on an M1 Macbook. Same results on Intel Mac with MKL Blas.

Bug or feature?

julia> R=rand(Float16,3,3)
3×3 Matrix{Float16}:
 4.00391e-01  8.24219e-01  2.50000e-01
 2.34375e-02  2.90527e-01  1.09375e-01
 7.85156e-01  9.42383e-01  3.76465e-01

julia> P=R'*R
3×3 Matrix{Float16}:
 7.77344e-01  1.07617e+00  3.98438e-01
 1.07617e+00  1.65234e+00  5.92773e-01
 3.98438e-01  5.92773e-01  2.16187e-01

julia> PF=lu(P)
LU{Float16, Matrix{Float16}}
L factor:
3×3 Matrix{Float16}:
 1.00000e+00  0.00000e+00  0.00000e+00
 7.22168e-01  1.00000e+00  0.00000e+00
 3.70117e-01  1.58325e-01  1.00000e+00
U factor:
3×3 Matrix{Float16}:
 1.07617e+00   1.65234e+00   5.92773e-01
 0.00000e+00  -1.17188e-01  -2.95410e-02
 0.00000e+00   0.00000e+00   1.50299e-03

julia> QF=cholesky(P)
Cholesky{Float32, Matrix{Float32}}
U factor:
3×3 UpperTriangular{Float32, Matrix{Float32}}:
 8.81671e-01  1.22060e+00  4.51912e-01
      ⋅       4.03073e-01  1.02135e-01
      ⋅            ⋅       3.91250e-02

It looks like it’s done on purpose. It promotes to at least Float32, based on the type of the square root of an element of the matrix:
https://github.com/JuliaLang/julia/blob/623ceb7834de47538eddeadfa84a8bf2d9741248/stdlib/LinearAlgebra/src/cholesky.jl#L181
I’m not sure why, but it could be that they wanted to ensure some minimum precision for the result. It might be worth debating whether that promotion should be made optional, or at least document that it does so.

Nice work to find that.

My view is that lu does the right thing. If you want to compute in Float16, the language should let you. I think this is likely an unintended consequence of that line. It would
seem that

@inline choltype(A) = typeof(sqrt(oneunit(eltype(A)))
would make more sense.

However, I’m not clear that it would not break something else.

1 Like

I can make a guess for the reason, which makes this a bug. In the cholesky.jl code,

# The dispatch structure in the cholesky, and cholesky! methods is a bit
# complicated and some explanation is therefore provided in the following
#
# In the methods below, LAPACK is called when possible, i.e. StridedMatrices with Float32,
# Float64, ComplexF32, and ComplexF64 element types. 

My guess it that choltype was originally intended to ensure at least Float32 for the LAPACK branch, but is accidentally used even for the types that don’t get sent to LAPACK. In fact, later in the same comments,

# FixMe? The dispatch below seems overly complicated. One simplification could be to
# merge the two Cholesky types into one. 

So I believe the correct fix would be to keep both yours and the original version (e.g. by naming choltypeforlapack or by dispatch). As you say, it’s not clear if your fix would break something else, but I would label the current non-LAPACK behavior a bug.

I raised an issue. We’ll see what the complier people think.

I think there are two issues at play. First thing is that the LAPACK implementation that julia uses doesn’t have float16 routines which is probably why it has to get promoted, the other is that julia currently has a compiler pass to promote Float16 computations to Float32.
One thing that could be done is to truncate the result after doing the LAPACK routine so the user gets a Float16 back.
I think in a previous thread you opened I gave more details Half precision

Strange. Everything works as I would expect with lu. In particular, lu! takes a long time in Float16 because Float16 computations are done in software.

julia> A=rand(Float32, 1000,1000); A16=Float16.(A);

julia> @btime lu!($A);
  4.907 ms (1 allocation: 7.94 KiB)

julia> @btime lu!($A16);
  218.729 ms (1 allocation: 7.94 KiB)

However, cholesky seems to do all of its work in Float32, so the matrix gets promoted on the way in before LAPACK. You can see that the timings are much closer than for lu. That promotion makes some sense given LAPACK’s limits. Why is lu different? Somehow I can’t get cholesky! to work right with @btime.

julia> A=rand(Float32,1000,1000); A ./= opnorm(A);

julia> B=I + Hermitian(A*A); B16=Float16.(B);

julia> @btime cholesky($B);
  3.025 ms (4 allocations: 3.81 MiB)

julia> @btime cholesky($B16);
  3.386 ms (4 allocations: 3.81 MiB)

Probably LU has a simple fallback implementation in Julia while cholesky doesn’t.

The type promotion to at least Float32 is indeed to make use of LAPACK. Converting back down to Float16 was implemented in https://github.com/JuliaLang/julia/pull/41352 and is going to be available in v1.8.

Just what I wanted to hear. Thanks.

A related issue is that lowrankupdate!(C,v) fails if v is a vector of integers.

That’s good news. But @ctkelley note that this still promotes and computes in Float32 and then converts the output down to Float16. It still promotes in choltype, whether for LAPACK or generic. In this sense it’s still different from lu. It does ensures the right type is returned, but if what you want is a true Float16 emulation, the precision here may actually be a bit higher.

The main reason for using Float16 and such is performance.
Converting it to Float32, paying the run time tax and then going back is the worse.
Something like not having the cake and not eating it.

If it is an intermediate phase as the current implementations (LAPACK) doesn’t support Float16 it is OK. But the end goal should be that using Float16 must be faster than Float32 and might be less accurate.

Is there a plan for native Linear Algebra wide support for Float16?

float16 can’t be faster for most processors since most don’t have hardware float16 support.

2 Likes

Apple silicon supports Float16, as do the leadership class supercomputers. There is some pretty wide and serious interest in this.

1 Like

See https://github.com/JuliaLang/julia/issues/40216#issuecomment-1046411397 for progress on getting better FP16 support for Julia.

You forget one important element for performance - Throughput.
While the execution ports of most CPU’s doesn’t support Float16 natively, for most Linear Algebra operation the issue isn’t the rate of execution but the memory bandwidth which is effectively multiplied (Unless in memory in Julia array of Float16 isn’t dense).

But what I wrote was the end goal. We’re also during intermediate phase in CPU’s as well.
But what I wrote would have happened on CPU’s with Float16 support, so it holds.

1 Like