Cholesky promotes Float16

ctkelley · February 19, 2022, 7:07pm

I’m seeing Cholesky promote half precision to single on its own. Is this
supposed to happen? lu does not do this.

Running 1.7.2 on an M1 Macbook. Same results on Intel Mac with MKL Blas.

Bug or feature?

julia> R=rand(Float16,3,3)
3×3 Matrix{Float16}:
 4.00391e-01  8.24219e-01  2.50000e-01
 2.34375e-02  2.90527e-01  1.09375e-01
 7.85156e-01  9.42383e-01  3.76465e-01

julia> P=R'*R
3×3 Matrix{Float16}:
 7.77344e-01  1.07617e+00  3.98438e-01
 1.07617e+00  1.65234e+00  5.92773e-01
 3.98438e-01  5.92773e-01  2.16187e-01

julia> PF=lu(P)
LU{Float16, Matrix{Float16}}
L factor:
3×3 Matrix{Float16}:
 1.00000e+00  0.00000e+00  0.00000e+00
 7.22168e-01  1.00000e+00  0.00000e+00
 3.70117e-01  1.58325e-01  1.00000e+00
U factor:
3×3 Matrix{Float16}:
 1.07617e+00   1.65234e+00   5.92773e-01
 0.00000e+00  -1.17188e-01  -2.95410e-02
 0.00000e+00   0.00000e+00   1.50299e-03

julia> QF=cholesky(P)
Cholesky{Float32, Matrix{Float32}}
U factor:
3×3 UpperTriangular{Float32, Matrix{Float32}}:
 8.81671e-01  1.22060e+00  4.51912e-01
      ⋅       4.03073e-01  1.02135e-01
      ⋅            ⋅       3.91250e-02

artkuo · February 19, 2022, 8:02pm

It looks like it’s done on purpose. It promotes to at least Float32, based on the type of the square root of an element of the matrix:
https://github.com/JuliaLang/julia/blob/623ceb7834de47538eddeadfa84a8bf2d9741248/stdlib/LinearAlgebra/src/cholesky.jl#L181
I’m not sure why, but it could be that they wanted to ensure some minimum precision for the result. It might be worth debating whether that promotion should be made optional, or at least document that it does so.

ctkelley · February 19, 2022, 8:16pm

Nice work to find that.

My view is that lu does the right thing. If you want to compute in Float16, the language should let you. I think this is likely an unintended consequence of that line. It would
seem that

@inline choltype(A) = typeof(sqrt(oneunit(eltype(A)))
would make more sense.

However, I’m not clear that it would not break something else.

artkuo · February 19, 2022, 9:17pm

I can make a guess for the reason, which makes this a bug. In the cholesky.jl code,

# The dispatch structure in the cholesky, and cholesky! methods is a bit
# complicated and some explanation is therefore provided in the following
#
# In the methods below, LAPACK is called when possible, i.e. StridedMatrices with Float32,
# Float64, ComplexF32, and ComplexF64 element types.

My guess it that choltype was originally intended to ensure at least Float32 for the LAPACK branch, but is accidentally used even for the types that don’t get sent to LAPACK. In fact, later in the same comments,

# FixMe? The dispatch below seems overly complicated. One simplification could be to
# merge the two Cholesky types into one.

So I believe the correct fix would be to keep both yours and the original version (e.g. by naming choltypeforlapack or by dispatch). As you say, it’s not clear if your fix would break something else, but I would label the current non-LAPACK behavior a bug.

ctkelley · February 19, 2022, 9:51pm

I raised an issue. We’ll see what the complier people think.

gbaraldi · February 20, 2022, 1:47am

I think there are two issues at play. First thing is that the LAPACK implementation that julia uses doesn’t have float16 routines which is probably why it has to get promoted, the other is that julia currently has a compiler pass to promote Float16 computations to Float32.
One thing that could be done is to truncate the result after doing the LAPACK routine so the user gets a Float16 back.
I think in a previous thread you opened I gave more details Half precision

ctkelley · February 20, 2022, 12:42pm

Strange. Everything works as I would expect with lu. In particular, lu! takes a long time in Float16 because Float16 computations are done in software.

julia> A=rand(Float32, 1000,1000); A16=Float16.(A);

julia> @btime lu!($A);
  4.907 ms (1 allocation: 7.94 KiB)

julia> @btime lu!($A16);
  218.729 ms (1 allocation: 7.94 KiB)

However, cholesky seems to do all of its work in Float32, so the matrix gets promoted on the way in before LAPACK. You can see that the timings are much closer than for lu. That promotion makes some sense given LAPACK’s limits. Why is lu different? Somehow I can’t get cholesky! to work right with @btime.

julia> A=rand(Float32,1000,1000); A ./= opnorm(A);

julia> B=I + Hermitian(A*A); B16=Float16.(B);

julia> @btime cholesky($B);
  3.025 ms (4 allocations: 3.81 MiB)

julia> @btime cholesky($B16);
  3.386 ms (4 allocations: 3.81 MiB)

gbaraldi · February 20, 2022, 12:56pm

Probably LU has a simple fallback implementation in Julia while cholesky doesn’t.

dkarrasch · February 20, 2022, 5:35pm

The type promotion to at least Float32 is indeed to make use of LAPACK. Converting back down to Float16 was implemented in https://github.com/JuliaLang/julia/pull/41352 and is going to be available in v1.8.

ctkelley · February 20, 2022, 6:15pm

Just what I wanted to hear. Thanks.

Paul_Soderlind · February 20, 2022, 7:30pm

A related issue is that lowrankupdate!(C,v) fails if v is a vector of integers.

artkuo · February 20, 2022, 7:31pm

That’s good news. But @ctkelley note that this still promotes and computes in Float32 and then converts the output down to Float16. It still promotes in choltype, whether for LAPACK or generic. In this sense it’s still different from lu. It does ensures the right type is returned, but if what you want is a true Float16 emulation, the precision here may actually be a bit higher.

RoyiAvital · February 20, 2022, 8:10pm

The main reason for using Float16 and such is performance.
Converting it to Float32, paying the run time tax and then going back is the worse.
Something like not having the cake and not eating it.

If it is an intermediate phase as the current implementations (LAPACK) doesn’t support Float16 it is OK. But the end goal should be that using Float16 must be faster than Float32 and might be less accurate.

Is there a plan for native Linear Algebra wide support for Float16?

Oscar_Smith · February 20, 2022, 8:13pm

float16 can’t be faster for most processors since most don’t have hardware float16 support.

ctkelley · February 20, 2022, 9:27pm

Apple silicon supports Float16, as do the leadership class supercomputers. There is some pretty wide and serious interest in this.

Oscar_Smith · February 21, 2022, 2:40am

See https://github.com/JuliaLang/julia/issues/40216#issuecomment-1046411397 for progress on getting better FP16 support for Julia.

RoyiAvital · February 21, 2022, 5:24am

You forget one important element for performance - Throughput.
While the execution ports of most CPU’s doesn’t support Float16 natively, for most Linear Algebra operation the issue isn’t the rate of execution but the memory bandwidth which is effectively multiplied (Unless in memory in Julia array of Float16 isn’t dense).

But what I wrote was the end goal. We’re also during intermediate phase in CPU’s as well.
But what I wrote would have happened on CPU’s with Float16 support, so it holds.

Topic		Replies	Views
Apples to apples comparison of A\b with Float64 and Float16 on A64FX Performance question , linearalgebra	12	788	May 2, 2022
How to use SparseArrays single precision (Float32)? General Usage	2	836	August 4, 2019
How to promote a factorization? Numerics question	5	458	March 10, 2022
Massive performance penalty for Float16 compared to Float32 Performance performance	17	8063	June 20, 2022
Making Float16 LU better in generic_lu.jl Numerics	3	234	August 26, 2023

Cholesky promotes Float16

Related topics