Does float16 run natively on a compatible CPU?

pxshen · July 9, 2024, 3:09pm

Julia docs says f16 is implemented in software (presumably since most CPUs don’t support it). What about newer Xeons (IS AVX512) that accept f16?

Oscar_Smith · July 9, 2024, 3:45pm

in those cases it uses the hardware (although most of the libm type functions are written under an assumption that float16 is slow and as such will often do the math in float32 and convert)

mikmoore · July 9, 2024, 3:48pm

julia> code_llvm(+, NTuple{2,Float16})
;  @ float.jl:408 within `+`
define half @"julia_+_362"(half %0, half %1) #0 {
top:
  %2 = fpext half %0 to float
  %3 = fpext half %1 to float
  %4 = fadd float %2, %3
  %5 = fptrunc float %4 to half
  ret half %5
}

julia> code_llvm(+, NTuple{2,Float32})
;  @ float.jl:408 within `+`
define float @"julia_+_364"(float %0, float %1) #0 {
top:
  %2 = fadd float %0, %1
  ret float %2
}

As you can see here, to add 2 Float16 (aka “half”) numbers on my system, which lacks native Float16, it first converts them to Float32 (aka “float”), adds those, then truncates back to Float16. When I add two Float32, no such conversion is necessary.

So if your Float16 version looks like mine, you are not using native Float16. If your Float16 looks like my Float32 (with “half” instead of “float”), you are.

pxshen · July 9, 2024, 4:44pm

Thank you both! I think some laptop ARMs (eg Windows on ARM laptops) also support f16

giordano · July 9, 2024, 4:59pm

From somewhere around Julia v1.6 or so (don’t remember exactly), on CPUs with native support for fp16, julia does use native Float16 arithmetic without intermediate conversions (unless a function explicitly does that internally).

ctkelley · July 9, 2024, 9:44pm

The Apple M* chips support Float16 and you don’t need to do anything to use it in any recent version of Julia. I’ve been using this for a few years.

The bad news is that tools like LAPACK do not support half precision.

Palli · July 10, 2024, 1:06am

Are you sure? I mean I get same as you, but I thought LLVM (bitcode) is independent of the hardware, and you might get the same even if half is supported? Then this could get you one (plus small boilerplate), instead of 15+ vs 1+):

julia> code_native(+, NTuple{2,Float16})
	.text
	.file	"+"
	.globl	"julia_+_2121"                  # -- Begin function julia_+_2121
	.p2align	4, 0x90
	.type	"julia_+_2121",@function
"julia_+_2121":                         # @"julia_+_2121"
; ┌ @ float.jl:409 within `+`
# %bb.0:                                # %top
	push	rbp
	mov	rbp, rsp
	vpextrw	eax, xmm1, 0
	vpextrw	ecx, xmm0, 0
	movzx	ecx, cx
	vmovd	xmm0, ecx
	vcvtph2ps	xmm0, xmm0
	movzx	eax, ax
	vmovd	xmm1, eax
	vcvtph2ps	xmm1, xmm1
	vaddss	xmm0, xmm0, xmm1
	vcvtps2ph	xmm0, xmm0, 4
	vmovd	eax, xmm0
	vpinsrw	xmm0, xmm0, eax, 0
	pop	rbp
	ret

milankl · July 10, 2024, 10:38am

Also note that we have two papers out describing use cases of Julia’s Float16 natively on Fujitsu’s A64FX chip

Giordano M, M Klöwer and V Churavy, 2022. Productivity meets Performance: Julia on A64FX, 2022 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, 10.1109/CLUSTER51413.2022.00072
Klöwer M, S Hatfield, M Croci, PD Düben and TN Palmer, 2021. Fluid simulations accelerated with 16 bits: Approaching 4x speedup on A64FX by squeezing ShallowWaters.jl into Float16, Journal of Advances in Modeling Earth Systems, 14, 10.1029/2021MS002684

mikmoore · July 10, 2024, 2:14pm

We do generate different Julia (and hence LLVM) code depending on certain processor features. For example, if you look at less(Base.Math._hypot, NTuple{2,Float64}), you will see it has a branch depending on Core.Intrinsics.have_fma(typeof(h)) that uses a slightly less accurate calculation when native FMA is not available (because emulated FMA is extremely slow).

I’m afraid I don’t have such an example for Float16 support, and less(+, NTuple{2,Float16}) is no help since it just calls the intrinsic add_float. So somewhere deep in the guts of Julia (or LLVM, really I’m out of my depth at this point) add_float is being converted to the promote+demote version. I cannot guarantee that native Float16 would generate something different without a better understanding of the internals or access to a system with native support. However, I find it unlikely that LLVM generates the promoting/demoting code and then optimizes those conversions away during machine compilation.

One can always check the code_native to get a more decisive answer, but I didn’t want to do that here because it is (more) architecture-dependent and is harder to read.

I’m afraid I didn’t understand this part of your post. Are you meaning to compare instruction counts? Indeed, emulated Float16 is slower and the conversions will add many instructions. When I compare code_native(+, NTuple{2,Float16}) vs code_native(+, NTuple{2,Float32}) on my machine without native Float16, I get 13 vs 5 instructions (native support would almost certainly reduce the 13 to the same 5).

But counting instructions out of context to determine performance of a function like + can be misleading, since it is virtually always inlined into something larger. In practice, it will usually be 1 instruction with native support and 2+ without, depending on how many operations it can merge with neighboring operations. For example, some mov or cvt instructions may become unnecessary when the compiler controls the preceding/following instructions. I count 6 instructions per add on my machine in the context of code_native(+, NTuple{8,Float16}; debuginfo=:none) (look for the repeating pattern to remove the boilerplate), but it could be less in a calculation that didn’t need to read every input from an input register or convert every argument to every operation. To this end, code_native(x -> +(x,x,x,x,x,x), Tuple{Float16}; debuginfo=:none) uses only 3 instructions per add on my machine.

giordano · July 10, 2024, 10:59pm

LLVM IR is very much not independent of the hardware, not just in Julia, but probably with any frontend. There are backend-specific intrinsics which can be used, of course they make sense only with specific targets. Also all the vectorisation is totally dependent on the target, as an extreme case vscale can be used only with targets which support variable-size SIMD registries.

This is what you get for +(::Float16, ::Float16) on a CPU with native float16 (this is Nvidia Grace):

julia> code_llvm(+, NTuple{2,Float16})

;  @ float.jl:409 within `+`
define half @"julia_+_137"(half %0, half %1) #0 {
top:
  %2 = fadd half %0, %1
  ret half %2
}

julia> code_native(+, NTuple{2,Float16})

        .text
        .file   "+"
        .globl  "julia_+_144"                   // -- Begin function julia_+_144
        .p2align        2
        .type   "julia_+_144",@function
"julia_+_144":                          // @"julia_+_144"
; ┌ @ float.jl:409 within `+`
// %bb.0:                               // %top
        stp     x29, x30, [sp, #-16]!           // 16-byte Folded Spill
        mov     x29, sp
        fadd    h0, h0, h1
        ldp     x29, x30, [sp], #16             // 16-byte Folded Reload
        ret
.Lfunc_end0:
        .size   "julia_+_144", .Lfunc_end0-"julia_+_144"
; └
                                        // -- End function
        .section        ".note.GNU-stack","",@progbits

pxshen · July 11, 2024, 2:57pm

Can we float this fp16 hardware check to top level Julia API? for shipping code that chooses optimal float choice based on user hardware

giordano · July 11, 2024, 4:05pm

I’m not really sure what you mean.

mikmoore · July 11, 2024, 4:26pm

I think it’s a request for something like Core.Intrinsics.is_native(Float16) or Core.Intrinsics.native_f16() that returns true if the system supports native Float16 calculations and false otherwise.

pxshen · July 11, 2024, 4:45pm

Yep

giordano · July 11, 2024, 5:35pm

I don’t think anything like that is exposed on the Julia side (it’s only in the internals of the C side of the compiler).

But my question would also be what you want to do with that?

Topic		Replies	Views
Float16 with AMDGPU GPU	10	251	August 30, 2024
How is Float16 implemented? Internals & Design	3	461	May 9, 2023
Massive performance penalty for Float16 compared to Float32 Performance performance	17	8083	June 20, 2022
CUDAnative support for Float16 GPU question	5	1358	November 15, 2018
Apples to apples comparison of A\b with Float64 and Float16 on A64FX Performance question , linearalgebra	12	789	May 2, 2022

Does float16 run natively on a compatible CPU?

Related topics