Does AMDGPU support optimised Float16 or BFloat16 operations? I am testing in a AMD Instinct MI100, which theoretically has much better performance for Float16 (and BFloat16). However, when I try the sample vadd!()
function, I see the LLVM code is converting to single precision before switching back. Eg.:
function vadd!(c, a, b)
i = workitemIdx().x + (workgroupIdx().x - 1) * workgroupDim().x
if i β€ length(c)
@inbounds c[i] = a[i] + b[i]
end
return
end
If I inspect this when giving Float32 arrays, I get:
julia> @device_code_llvm @roc launch=false vadd!(c, a, b)
(...)
; β @ float.jl:409 within `+`
%22 = fadd float %18, %21
(...)
With Float16 arrays I get:
julia> @device_code_llvm @roc launch=false vadd!(c, a, b)
(...)
; β @ float.jl:409 within `+`
%22 = fpext half %18 to float
%23 = fpext half %21 to float
%24 = fadd float %22, %23
%25 = fptrunc float %24 to half
(...)
And finally, with BFloat16:
julia> @device_code_llvm @roc launch=false vadd!(c, a, b)
(...)
; β @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:233 within `+`
; ββ @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:214 within `Float32`
; βββ @ boot.jl:788 within `UInt32`
; ββββ @ boot.jl:751 within `toUInt32`
%22 = zext i16 %18 to i32
; ββββ
; βββ @ int.jl:536 within `<<` @ int.jl:529
%23 = shl nuw i32 %22, 16
; βββ
; βββ @ essentials.jl:581 within `reinterpret`
%bitcast_coercion = bitcast i32 %23 to float
; βββ
; βββ @ boot.jl:788 within `UInt32`
; ββββ @ boot.jl:751 within `toUInt32`
%24 = zext i16 %21 to i32
; ββββ
; βββ @ int.jl:536 within `<<` @ int.jl:529
%25 = shl nuw i32 %24, 16
; βββ
; βββ @ essentials.jl:581 within `reinterpret`
%bitcast_coercion9 = bitcast i32 %25 to float
; βββ
; β @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:233 within `+` @ float.jl:409
%26 = fadd float %bitcast_coercion, %bitcast_coercion9
; β @/home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:233 within `+`
; ββ @/home/test.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:165 within `BFloat16`
; βββ @ /home/test/.julia/packages/AMDGPU/a1v0k/src/device/gcn/math.jl:48 within `#isnan`
%27 = fcmp ord float %26, 0.000000e+00
; βββ
br i1 %27, label %L85, label %L119
L85: ; preds = %L45
; ββ @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:168 within `BFloat16`
; βββ @ essentials.jl:581 within `reinterpret`
%bitcast_coercion15 = bitcast float %26 to i32
; βββ
; ββ @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:169 within `BFloat16`
; βββ @ int.jl:534 within `>>` @ int.jl:528
%28 = lshr i32 %bitcast_coercion15, 16
; βββ
; βββ @ int.jl:1068 within `&` @ int.jl:347
%29 = and i32 %28, 1
; βββ
; βββ @ int.jl:1068 within `+` @ int.jl:87
%narrow = add nuw nsw i32 %29, 32767
%30 = zext i32 %narrow to i64
; βββ @ int.jl:1066 within `+`
; ββββ @ int.jl:551 within `rem`
; βββββ @ number.jl:7 within `convert`
; ββββββ @ boot.jl:784 within `Int64`
; βββββββ @ boot.jl:708 within `toInt64`
%31 = zext i32 %bitcast_coercion15 to i64
; βββββββ
; βββ @ int.jl:1068 within `+` @ int.jl:87
%32 = add nuw nsw i64 %30, %31
; βββ
; ββ @ /home/test/.julia/packages/BFloat16s/u3WQc/src/bfloat16.jl:170 within `BFloat16`
; βββ @ int.jl:534 within `>>` @ int.jl:527
%33 = lshr i64 %32, 16
; βββ
; βββ @ int.jl:544 within `rem`
%34 = trunc i64 %33 to i16
; βββ
br label %L119
Does not look like it is optimising for Float16, and BFloat16 is even worse. Any ideas on how to get around this limitation?