Does float16 run natively on a compatible CPU?

Julia docs says f16 is implemented in software (presumably since most CPUs don’t support it). What about newer Xeons (IS AVX512) that accept f16?

in those cases it uses the hardware (although most of the libm type functions are written under an assumption that float16 is slow and as such will often do the math in float32 and convert)

4 Likes
julia> code_llvm(+, NTuple{2,Float16})
;  @ float.jl:408 within `+`
define half @"julia_+_362"(half %0, half %1) #0 {
top:
  %2 = fpext half %0 to float
  %3 = fpext half %1 to float
  %4 = fadd float %2, %3
  %5 = fptrunc float %4 to half
  ret half %5
}

julia> code_llvm(+, NTuple{2,Float32})
;  @ float.jl:408 within `+`
define float @"julia_+_364"(float %0, float %1) #0 {
top:
  %2 = fadd float %0, %1
  ret float %2
}

As you can see here, to add 2 Float16 (aka “half”) numbers on my system, which lacks native Float16, it first converts them to Float32 (aka “float”), adds those, then truncates back to Float16. When I add two Float32, no such conversion is necessary.

So if your Float16 version looks like mine, you are not using native Float16. If your Float16 looks like my Float32 (with “half” instead of “float”), you are.

3 Likes

Thank you both! I think some laptop ARMs (eg Windows on ARM laptops) also support f16

From somewhere around Julia v1.6 or so (don’t remember exactly), on CPUs with native support for fp16, julia does use native Float16 arithmetic without intermediate conversions (unless a function explicitly does that internally).

5 Likes

The Apple M* chips support Float16 and you don’t need to do anything to use it in any recent version of Julia. I’ve been using this for a few years.

The bad news is that tools like LAPACK do not support half precision.

1 Like

Are you sure? I mean I get same as you, but I thought LLVM (bitcode) is independent of the hardware, and you might get the same even if half is supported? Then this could get you one (plus small boilerplate), instead of 15+ vs 1+):

julia> code_native(+, NTuple{2,Float16})
	.text
	.file	"+"
	.globl	"julia_+_2121"                  # -- Begin function julia_+_2121
	.p2align	4, 0x90
	.type	"julia_+_2121",@function
"julia_+_2121":                         # @"julia_+_2121"
; ┌ @ float.jl:409 within `+`
# %bb.0:                                # %top
	push	rbp
	mov	rbp, rsp
	vpextrw	eax, xmm1, 0
	vpextrw	ecx, xmm0, 0
	movzx	ecx, cx
	vmovd	xmm0, ecx
	vcvtph2ps	xmm0, xmm0
	movzx	eax, ax
	vmovd	xmm1, eax
	vcvtph2ps	xmm1, xmm1
	vaddss	xmm0, xmm0, xmm1
	vcvtps2ph	xmm0, xmm0, 4
	vmovd	eax, xmm0
	vpinsrw	xmm0, xmm0, eax, 0
	pop	rbp
	ret

Also note that we have two papers out describing use cases of Julia’s Float16 natively on Fujitsu’s A64FX chip

  • Giordano M, M Klöwer and V Churavy, 2022. Productivity meets Performance: Julia on A64FX, 2022 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, 10.1109/CLUSTER51413.2022.00072
  • Klöwer M, S Hatfield, M Croci, PD Düben and TN Palmer, 2021. Fluid simulations accelerated with 16 bits: Approaching 4x speedup on A64FX by squeezing ShallowWaters.jl into Float16, Journal of Advances in Modeling Earth Systems, 14, 10.1029/2021MS002684
3 Likes

We do generate different Julia (and hence LLVM) code depending on certain processor features. For example, if you look at less(Base.Math._hypot, NTuple{2,Float64}), you will see it has a branch depending on Core.Intrinsics.have_fma(typeof(h)) that uses a slightly less accurate calculation when native FMA is not available (because emulated FMA is extremely slow).

I’m afraid I don’t have such an example for Float16 support, and less(+, NTuple{2,Float16}) is no help since it just calls the intrinsic add_float. So somewhere deep in the guts of Julia (or LLVM, really I’m out of my depth at this point) add_float is being converted to the promote+demote version. I cannot guarantee that native Float16 would generate something different without a better understanding of the internals or access to a system with native support. However, I find it unlikely that LLVM generates the promoting/demoting code and then optimizes those conversions away during machine compilation.

One can always check the code_native to get a more decisive answer, but I didn’t want to do that here because it is (more) architecture-dependent and is harder to read.

I’m afraid I didn’t understand this part of your post. Are you meaning to compare instruction counts? Indeed, emulated Float16 is slower and the conversions will add many instructions. When I compare code_native(+, NTuple{2,Float16}) vs code_native(+, NTuple{2,Float32}) on my machine without native Float16, I get 13 vs 5 instructions (native support would almost certainly reduce the 13 to the same 5).

But counting instructions out of context to determine performance of a function like + can be misleading, since it is virtually always inlined into something larger. In practice, it will usually be 1 instruction with native support and 2+ without, depending on how many operations it can merge with neighboring operations. For example, some mov or cvt instructions may become unnecessary when the compiler controls the preceding/following instructions. I count 6 instructions per add on my machine in the context of code_native(+, NTuple{8,Float16}; debuginfo=:none) (look for the repeating pattern to remove the boilerplate), but it could be less in a calculation that didn’t need to read every input from an input register or convert every argument to every operation. To this end, code_native(x -> +(x,x,x,x,x,x), Tuple{Float16}; debuginfo=:none) uses only 3 instructions per add on my machine.

LLVM IR is very much not independent of the hardware, not just in Julia, but probably with any frontend. There are backend-specific intrinsics which can be used, of course they make sense only with specific targets. Also all the vectorisation is totally dependent on the target, as an extreme case vscale can be used only with targets which support variable-size SIMD registries.

This is what you get for +(::Float16, ::Float16) on a CPU with native float16 (this is Nvidia Grace):

julia> code_llvm(+, NTuple{2,Float16})
;  @ float.jl:409 within `+`
define half @"julia_+_137"(half %0, half %1) #0 {
top:
  %2 = fadd half %0, %1
  ret half %2
}
julia> code_native(+, NTuple{2,Float16})
        .text
        .file   "+"
        .globl  "julia_+_144"                   // -- Begin function julia_+_144
        .p2align        2
        .type   "julia_+_144",@function
"julia_+_144":                          // @"julia_+_144"
; ┌ @ float.jl:409 within `+`
// %bb.0:                               // %top
        stp     x29, x30, [sp, #-16]!           // 16-byte Folded Spill
        mov     x29, sp
        fadd    h0, h0, h1
        ldp     x29, x30, [sp], #16             // 16-byte Folded Reload
        ret
.Lfunc_end0:
        .size   "julia_+_144", .Lfunc_end0-"julia_+_144"
; └
                                        // -- End function
        .section        ".note.GNU-stack","",@progbits
1 Like

Can we float this fp16 hardware check to top level Julia API? for shipping code that chooses optimal float choice based on user hardware

I’m not really sure what you mean.

I think it’s a request for something like Core.Intrinsics.is_native(Float16) or Core.Intrinsics.native_f16() that returns true if the system supports native Float16 calculations and false otherwise.

1 Like

Yep

I don’t think anything like that is exposed on the Julia side (it’s only in the internals of the C side of the compiler).

But my question would also be what you want to do with that?