Julia docs says f16 is implemented in software (presumably since most CPUs don’t support it). What about newer Xeons (IS AVX512) that accept f16?
in those cases it uses the hardware (although most of the libm type functions are written under an assumption that float16 is slow and as such will often do the math in float32 and convert)
julia> code_llvm(+, NTuple{2,Float16})
; @ float.jl:408 within `+`
define half @"julia_+_362"(half %0, half %1) #0 {
top:
%2 = fpext half %0 to float
%3 = fpext half %1 to float
%4 = fadd float %2, %3
%5 = fptrunc float %4 to half
ret half %5
}
julia> code_llvm(+, NTuple{2,Float32})
; @ float.jl:408 within `+`
define float @"julia_+_364"(float %0, float %1) #0 {
top:
%2 = fadd float %0, %1
ret float %2
}
As you can see here, to add 2 Float16
(aka “half”) numbers on my system, which lacks native Float16
, it first converts them to Float32
(aka “float”), adds those, then truncates back to Float16
. When I add two Float32
, no such conversion is necessary.
So if your Float16
version looks like mine, you are not using native Float16
. If your Float16
looks like my Float32
(with “half” instead of “float”), you are.
Thank you both! I think some laptop ARMs (eg Windows on ARM laptops) also support f16
From somewhere around Julia v1.6 or so (don’t remember exactly), on CPUs with native support for fp16, julia does use native Float16 arithmetic without intermediate conversions (unless a function explicitly does that internally).
The Apple M* chips support Float16 and you don’t need to do anything to use it in any recent version of Julia. I’ve been using this for a few years.
The bad news is that tools like LAPACK do not support half precision.
Are you sure? I mean I get same as you, but I thought LLVM (bitcode) is independent of the hardware, and you might get the same even if half is supported? Then this could get you one (plus small boilerplate), instead of 15+ vs 1+):
julia> code_native(+, NTuple{2,Float16})
.text
.file "+"
.globl "julia_+_2121" # -- Begin function julia_+_2121
.p2align 4, 0x90
.type "julia_+_2121",@function
"julia_+_2121": # @"julia_+_2121"
; ┌ @ float.jl:409 within `+`
# %bb.0: # %top
push rbp
mov rbp, rsp
vpextrw eax, xmm1, 0
vpextrw ecx, xmm0, 0
movzx ecx, cx
vmovd xmm0, ecx
vcvtph2ps xmm0, xmm0
movzx eax, ax
vmovd xmm1, eax
vcvtph2ps xmm1, xmm1
vaddss xmm0, xmm0, xmm1
vcvtps2ph xmm0, xmm0, 4
vmovd eax, xmm0
vpinsrw xmm0, xmm0, eax, 0
pop rbp
ret
Also note that we have two papers out describing use cases of Julia’s Float16 natively on Fujitsu’s A64FX chip
- Giordano M, M Klöwer and V Churavy, 2022. Productivity meets Performance: Julia on A64FX, 2022 IEEE International Conference on Cluster Computing (CLUSTER), IEEE, 10.1109/CLUSTER51413.2022.00072
- Klöwer M, S Hatfield, M Croci, PD Düben and TN Palmer, 2021. Fluid simulations accelerated with 16 bits: Approaching 4x speedup on A64FX by squeezing ShallowWaters.jl into Float16, Journal of Advances in Modeling Earth Systems, 14, 10.1029/2021MS002684
We do generate different Julia (and hence LLVM) code depending on certain processor features. For example, if you look at less(Base.Math._hypot, NTuple{2,Float64})
, you will see it has a branch depending on Core.Intrinsics.have_fma(typeof(h))
that uses a slightly less accurate calculation when native FMA is not available (because emulated FMA is extremely slow).
I’m afraid I don’t have such an example for Float16
support, and less(+, NTuple{2,Float16})
is no help since it just calls the intrinsic add_float
. So somewhere deep in the guts of Julia (or LLVM, really I’m out of my depth at this point) add_float
is being converted to the promote+demote version. I cannot guarantee that native Float16
would generate something different without a better understanding of the internals or access to a system with native support. However, I find it unlikely that LLVM generates the promoting/demoting code and then optimizes those conversions away during machine compilation.
One can always check the code_native
to get a more decisive answer, but I didn’t want to do that here because it is (more) architecture-dependent and is harder to read.
I’m afraid I didn’t understand this part of your post. Are you meaning to compare instruction counts? Indeed, emulated Float16
is slower and the conversions will add many instructions. When I compare code_native(+, NTuple{2,Float16})
vs code_native(+, NTuple{2,Float32})
on my machine without native Float16
, I get 13 vs 5 instructions (native support would almost certainly reduce the 13 to the same 5).
But counting instructions out of context to determine performance of a function like +
can be misleading, since it is virtually always inlined into something larger. In practice, it will usually be 1 instruction with native support and 2+ without, depending on how many operations it can merge with neighboring operations. For example, some mov
or cvt
instructions may become unnecessary when the compiler controls the preceding/following instructions. I count 6 instructions per add on my machine in the context of code_native(+, NTuple{8,Float16}; debuginfo=:none)
(look for the repeating pattern to remove the boilerplate), but it could be less in a calculation that didn’t need to read every input from an input register or convert every argument to every operation. To this end, code_native(x -> +(x,x,x,x,x,x), Tuple{Float16}; debuginfo=:none)
uses only 3 instructions per add on my machine.
LLVM IR is very much not independent of the hardware, not just in Julia, but probably with any frontend. There are backend-specific intrinsics which can be used, of course they make sense only with specific targets. Also all the vectorisation is totally dependent on the target, as an extreme case vscale
can be used only with targets which support variable-size SIMD registries.
This is what you get for +(::Float16, ::Float16)
on a CPU with native float16 (this is Nvidia Grace):
julia> code_llvm(+, NTuple{2,Float16})
; @ float.jl:409 within `+`
define half @"julia_+_137"(half %0, half %1) #0 {
top:
%2 = fadd half %0, %1
ret half %2
}
julia> code_native(+, NTuple{2,Float16})
.text
.file "+"
.globl "julia_+_144" // -- Begin function julia_+_144
.p2align 2
.type "julia_+_144",@function
"julia_+_144": // @"julia_+_144"
; ┌ @ float.jl:409 within `+`
// %bb.0: // %top
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
mov x29, sp
fadd h0, h0, h1
ldp x29, x30, [sp], #16 // 16-byte Folded Reload
ret
.Lfunc_end0:
.size "julia_+_144", .Lfunc_end0-"julia_+_144"
; └
// -- End function
.section ".note.GNU-stack","",@progbits
Can we float this fp16 hardware check to top level Julia API? for shipping code that chooses optimal float choice based on user hardware
I’m not really sure what you mean.
I think it’s a request for something like Core.Intrinsics.is_native(Float16)
or Core.Intrinsics.native_f16()
that returns true
if the system supports native Float16
calculations and false
otherwise.
Yep
I don’t think anything like that is exposed on the Julia side (it’s only in the internals of the C side of the compiler).
But my question would also be what you want to do with that?