Float16 with AMDGPU

snowgum · August 30, 2024, 11:36am

Does AMDGPU support optimised Float16 or BFloat16 operations? I am testing in a AMD Instinct MI100, which theoretically has much better performance for Float16 (and BFloat16). However, when I try the sample vadd!() function, I see the LLVM code is converting to single precision before switching back. Eg.:

function vadd!(c, a, b)
    i = workitemIdx().x + (workgroupIdx().x - 1) * workgroupDim().x
    if i ≤ length(c)
      @inbounds c[i] = a[i] + b[i]
    end
    return
end

If I inspect this when giving Float32 arrays, I get:

julia> @device_code_llvm @roc launch=false vadd!(c, a, b)
(...)
; ┌ @ float.jl:409 within `+`
   %22 = fadd float %18, %21
(...)

With Float16 arrays I get:

julia> @device_code_llvm @roc launch=false vadd!(c, a, b)
(...)
; ┌ @ float.jl:409 within `+`
   %22 = fpext half %18 to float
   %23 = fpext half %21 to float
   %24 = fadd float %22, %23
   %25 = fptrunc float %24 to half
(...)

And finally, with BFloat16:

julia> @device_code_llvm @roc launch=false vadd!(c, a, b)
(...)
; ┌ @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:233 within `+`
; │┌ @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:214 within `Float32`
; ││┌ @ boot.jl:788 within `UInt32`
; │││┌ @ boot.jl:751 within `toUInt32`
      %22 = zext i16 %18 to i32
; ││└└
; ││┌ @ int.jl:536 within `<<` @ int.jl:529
     %23 = shl nuw i32 %22, 16
; ││└
; ││┌ @ essentials.jl:581 within `reinterpret`
     %bitcast_coercion = bitcast i32 %23 to float
; ││└
; ││┌ @ boot.jl:788 within `UInt32`
; │││┌ @ boot.jl:751 within `toUInt32`
      %24 = zext i16 %21 to i32
; ││└└
; ││┌ @ int.jl:536 within `<<` @ int.jl:529
     %25 = shl nuw i32 %24, 16
; ││└
; ││┌ @ essentials.jl:581 within `reinterpret`
     %bitcast_coercion9 = bitcast i32 %25 to float
; │└└
; │ @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:233 within `+` @ float.jl:409
   %26 = fadd float %bitcast_coercion, %bitcast_coercion9
; │ @/home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:233 within `+`
; │┌ @/home/test.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:165 within `BFloat16`
; ││┌ @ /home/test/.julia/packages/AMDGPU/a1v0k/src/device/gcn/math.jl:48 within `#isnan`
     %27 = fcmp ord float %26, 0.000000e+00
; ││└
    br i1 %27, label %L85, label %L119

L85:                                              ; preds = %L45
; ││ @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:168 within `BFloat16`
; ││┌ @ essentials.jl:581 within `reinterpret`
     %bitcast_coercion15 = bitcast float %26 to i32
; ││└
; ││ @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:169 within `BFloat16`
; ││┌ @ int.jl:534 within `>>` @ int.jl:528
     %28 = lshr i32 %bitcast_coercion15, 16
; ││└
; ││┌ @ int.jl:1068 within `&` @ int.jl:347
     %29 = and i32 %28, 1
; ││└
; ││┌ @ int.jl:1068 within `+` @ int.jl:87
     %narrow = add nuw nsw i32 %29, 32767
     %30 = zext i32 %narrow to i64
; │││ @ int.jl:1066 within `+`
; │││┌ @ int.jl:551 within `rem`
; ││││┌ @ number.jl:7 within `convert`
; │││││┌ @ boot.jl:784 within `Int64`
; ││││││┌ @ boot.jl:708 within `toInt64`
         %31 = zext i32 %bitcast_coercion15 to i64
; │││└└└└
; │││ @ int.jl:1068 within `+` @ int.jl:87
     %32 = add nuw nsw i64 %30, %31
; ││└
; ││ @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:170 within `BFloat16`
; ││┌ @ int.jl:534 within `>>` @ int.jl:527
     %33 = lshr i64 %32, 16
; ││└
; ││┌ @ int.jl:544 within `rem`
     %34 = trunc i64 %33 to i16
; ││└
    br label %L119

Does not look like it is optimising for Float16, and BFloat16 is even worse. Any ideas on how to get around this limitation?

snowgum · August 30, 2024, 12:19pm

Forgot to mention that this is with AMDGPU 1.0.1, Julia 1.10.5:

julia> versioninfo()
Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 96 × AMD EPYC 74F3 24-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 96 virtual cores)
Environment:
  JULIA_CPU_TARGET = generic
  JULIA_CONDAPKG_BACKEND = Null
  JULIA_PYTHONCALL_EXE = /home/test/mamba/envs/py310/bin/python

ROCm version is 5.7.3.

maleadt · August 30, 2024, 1:54pm

Starting with Demote(B)Float16 pass: only keep enabled for PPC. by maleadt · Pull Request #55486 · JuliaLang/julia · GitHub, Julia 1.12, Julia’s LLVM IR won’t be converting to and from single precision. However, I don’t know if these chips actually support scalar bfloat ops (and if the current version of LLVM supports emitting that code).

snowgum · August 30, 2024, 2:28pm

Thank you for the clarification. Will keep this in mind for 1.12, maybe in a year from now?

AMD Instinct MI100 should support bfloat16, since AMD publish peak TFLOPS for BF16 specifically (and they are different from the FP16 numbers).

maleadt · August 30, 2024, 2:33pm

Not necessarily. Often support for these datatypes is only available to tensor-like hardware, not to scalar kernels. But I’m not familiar with the AMDGPU specifics.

maleadt · August 30, 2024, 2:42pm

For example: Compiler Explorer. Here, I have a scalar BFloat16 kernel, generating code for the latest GPU (MI300, GFX940). Again, I’m not familiar with AMD GPU hardware (cc @jpsamaroo), but the generate code does not perform native bfloat arithmetic. Instead, the back-end just converts to and from single precision again. Not sure whether the hardware doesn’t support it, or whether LLVM’s back-end doesn’t generate the correct code (I’m presuming the former).

snowgum · August 30, 2024, 3:26pm

I’m also not so familiar with the AMDGPU specifics. But they have no tensor cores like NVIDIA. A look at the AMD MI100 instruction set reference guide does give many references to BF16, but all are related to FMAs. In comparison, the AMD MI300 instruction set does have many more BF16 references outside FMA, for example add two BF16.

But even if the device has no native scalar BF16 ops, surely LLVM could still write performant code for it? At least faster than F32.

maleadt · August 30, 2024, 3:53pm

They do. Look at instructions like V_MFMA_F32_32X32X4_2B_BF16.

maleadt · August 30, 2024, 3:58pm

The compiler is no magician. The only scalar-like instruction I can find in the MI300’s datasheet is DS_PK_ADD_BF16, which still requires two packed BFloat16 numbers. And indeed, doing that (atomically) using a <2 x bfloat> yields the expected instructions: Compiler Explorer. Not sure why it doesn’t match that when doing non-atomic adds.

Bottom line, if you write code that emits atomic operations on a <2 x bfloat>, e.g., using SIMD.jl, you may get native bfloat instructions. But since the support for bfloat seems very limited, that’s unlikely to be usable. If you really want to use the bfloat hardware support, it’s probably better to look into using the MFMA instructions (which you’ll probably have to target explicitly; I don’t see LLVM pattern-matching a 4x4 matrix multiplication to these instructions).

snowgum · August 30, 2024, 7:30pm

Thank you for the fascinating insights. My knowledge of LLVM IR is nearly zero. How would one go about using the MFMA instructions explicitly? Do you know of a resource that could explain how to do this? I don’t even know where to start.

maleadt · August 30, 2024, 8:14pm

There’s target-specific intrinsics that lower to the instructions you want, e.g., @llvm.amdgcn.mfma.f32.32x32x4bf16.1k: llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx90a.ll at 923a1c1fc348f7c30ff4726b54ed63ce403dc3ce · llvm/llvm-project · GitHub. You can use those from Julia much like how we do in CUDA.jl to support tensor cores: CUDA.jl/src/device/intrinsics/wmma.jl at master · JuliaGPU/CUDA.jl · GitHub. Which can then be used to build higher-level abstractions like GemmKernels.jl.

Now, although all this is very much possible in Julia (as demonstrated by CUDA.jl WMMA intrinsics → WMMA abstraction → GEMM implementation), it’s a lot of work. Projects like MLIR try to do things like this at the compiler level, because there’s much more high-level information retained in the IR. MLIR seems to support MFMA in the amdgpu dialect, so alternatively it may be worth a look what dialects can target that, and whether it’s possible to do so from e.g. Reactant.jl.

Topic		Replies	Views
Does float16 run natively on a compatible CPU? General Usage	14	581	July 11, 2024
Status of BFloat16 Performance	4	794	August 24, 2023
CUDAnative support for Float16 GPU question	5	1356	November 15, 2018
Massive performance penalty for Float16 compared to Float32 Performance performance	17	8078	June 20, 2022
Apples to apples comparison of A\b with Float64 and Float16 on A64FX Performance question , linearalgebra	12	789	May 2, 2022

Float16 with AMDGPU

Related topics