Why is my kernel as slow in FP32 as in FP64 on A2000 Ada-based GPU?

Ludovic_Dumoulin · March 11, 2025, 10:52am

Hello,
I am new to Julia and I wanted to use Float32 to debug my code expecting it be faster as my gpu provide only 1/64 of its performance in Float64.
I was surprised to see that using Float32 was almost as slow as Float64. I would like to know what I am doing wrong …
Here is one kernel I used to benchmark the difference of performance. This kernel is used to get a 2D velocity field from force balance equation.

using CUDA
block_dim = (16,16)
grid_dim = (63,63)
const N::Int32 = 16*63

function kernel_comp_v!(vn, v, F)
    # Indexing (i,j)
    i::Int32 = (blockIdx().x - 1f0) * blockDim().x + threadIdx().x
    j::Int32 = (blockIdx().y - 1f0) * blockDim().y + threadIdx().y

    i_::Int32 = i== 1f0 ? N : i-1f0; ip::Int32 = i== N ? 1f0 : i+1f0
    jm::Int32 = j== 1f0 ? N : j-1f0; jp::Int32 = j== N ? 1f0 : j+1f0

    @inbounds begin
        vn[i,j,1] = 1/(1+6/Δ2)*(F[i,j,1] + (   2*(v[i_,j,1]+v[ip,j,1]) + v[i,jm,1]+v[i,jp,1] + 0.25*(v[ip,jp,2]+v[i_,jm,2]-v[i_,jp,2]-v[ip,jm,2])  )/Δ2)
        vn[i,j,2] = 1/(1+6/Δ2)*(F[i,j,2] + (   2*(v[i,jm,2]+v[i,jp,2]) + v[i_,j,2]+v[ip,j,2] + 0.25*(v[ip,jp,1]+v[i_,jm,1]-v[i_,jp,1]-v[ip,jm,1])  )/Δ2)
    end

    return nothing
end

for Float64 I do

v = CUDA.zeros(Float64, N, N, 2)
v_temp = CUDA.zeros(Float64, N, N, 2)
F = CUDA.zeros(Float64, N, N, 2)
F[:,:,1] .= 1.0

comp_v! = @cuda launch=false kernel_comp_v!(v_temp, v, F)

@benchmark CUDA.@sync for _ = 1:10
   comp_v!($v_temp, $v, $F; threads = block_dim, blocks = grid_dim)
   comp_v!($v, $v_temp, $F; threads = block_dim, blocks = grid_dim)
end

For Float32 I just replace the 64 by 32 after restarting julia.
For both Float64 and Float32 I get approximatly 9.5 \pm 0.5ms.

I really don’t understand what I am doing wrong.
Thank you in advance for your help.

Best regards.

Palli · March 11, 2025, 12:09pm

Division is slow and you do 1/ … but it returns Float64 by default (at least in Julia/CPUs). I’m guessing Float32(1) / … would help (such be appropriate for at least anything other than 1), or in such cases even better inv(whatever), since it retains the type you put in. Or inv(Float16(whatever)), though there bfloat would be even better if you can substitute it in. Note, it’s similar, similar to but NOT the same as Float16. I’m not up to speed on when or if CUDA substitutes Float16 or even Float32 or Float64 with bfloat. I believe such magic may happen for other languages like Python, not sure what the driver or CUDA.jl does.

I’ve never done any CUDA programming, so I’m just talking from CPU experience (I’ve made some very good speedup, just with using Float32 on CPUs, for some obscure pi benchmark), where division is the slowest operation.

I’m just guessing here, but you might have many fewer division units than for multiply (on GPUs), and the ratio of 64x might only apply to mul, add and fma?

Julia should optimize to best assembly instruction (on the CPU at least) if you’re careful with types and SIMD:

Might be only for CPU, or even the opposite for GPUs, or both?:

bfloat16 or other 16-, or even more so for 8-bit, floats might be too little precision for you.

With type inference, if you get a wrong type by accident, like Float64 for the above it influences all the rest of your calculation at I guess 1/64th the speed, for divs or otherwise. So it’s good to check the types, e.g. for results, or with tools, that are available.

But you got me intrigued on what the latency (and throughput) is for such different floats, on GPUs, and CPUs:

Fundamental IEEE numeric: Addition, subtraction, multiplication, division, square root
…
Complex: Native complex-value multiply and fused-multiply operations

Offtopic, but I found intriguing:

Our solution achieves up to 220× speedup with respect to a naïve C port of the Multi-Head Self Attention PyTorch kernel

[Project] BFLOAT16 on ALL hardware (>= 2009), up to 2000x faster ML algos, 50% less RAM usage for all old/new hardware - Hyperlearn Reborn.

Ludovic_Dumoulin · March 11, 2025, 12:49pm

Thank you, as you said replacing the number by Float32(number) made the computation way faster ! 1.4ms

I tried with and without the prefactor 1/(1+6/Δ2). As I was expected, the compiler replace the prefactor during compilation, then the number of division doesn’t change:

  %127 = load float, float addrspace(1)* %126, align 4
  %128 = fsub float %124, %127
  %129 = fmul float %128, 2.500000e-01
  %130 = fadd float %104, %98
  %131 = fadd float %110, %130
  %132 = fadd float %131, %129
  only one by axis ---> %133 = fdiv float %132, 0x3F1A36E2E0000000
  %134 = fadd float %85, %133
  %135 = fmul float %134, 0x3EF179D980000000

I think it corresponds to the /Δ2, by computing this before and replacing it my a multiplication effectively changes this line by a fmul:

  %127 = load float, float addrspace(1)* %126, align 4
  %128 = fsub float %124, %127
  %129 = fmul float %128, 2.500000e-01
  %130 = fadd float %104, %98
  %131 = fadd float %110, %130
  %132 = fadd float %131, %129
  ---> %133 = fmul float %132, 1.000000e+04
  %134 = fadd float %85, %133
  %135 = fmul float %134, 0x3EF179D980000000

I am then reaching 1.2ms !!!

Thank you for your help !

Ludovic_Dumoulin · March 11, 2025, 12:52pm

Thank you, in my case, as I am doing hydrodynamics, I still need at least Float32. To get correct results I even need Float64.

Palli · March 11, 2025, 12:59pm

So 6.8x speedup so far from a simple tip. Not bad.

I’m not sure where Δ2 comes from, if it’s a constant, then it might multiply already, but you could try inv_Δ2 = inv(Δ2) somewhere out of your (implied?) loop, then in your code multiply by it. Also I’m not sure if it’s a global variable, seemingly, likely is be bad, as on CPUs (worse on GPUs?) and also bad if non-constant. Even if you want Float32 or higher in general, you could selectively use lower precision types in some places?

Ludovic_Dumoulin · March 11, 2025, 1:27pm

Δ2 is a FP32, I juste computed const Δ2i::Float32 = 1/Δ2 and used *Δ2i. I am a bit surprised that the compiler didn’t do it automatically.

Δ2is a const variable defined in at “an higher level” (I don’t have the correct vocabulary). The const are replaced by the compiler during compilation.

fmul float %132, 1.000000e+04 ----> here is the *Δ2i
fmul float %134, 0x3EF179D980000000 ----> here is the *1/(1+6/Δ2)

It is unfortunately impossible to lower precision in some places, but I use A100 GPUs to run “real” simulations.

Palli · March 11, 2025, 1:33pm

It can do it, I think only when you’re dividing by power of two (probably @fastmath) allows to do it always, but no sure). Otherwise a tiny bit of error (only in last bit?). Let’s say you would do bfloat16 division accurately vs. multiplying by inv(Float32(your_number)), it might be as accurate or actually more accurate? And faster? And at some point later you can convert back to your smaller type.

Did you get an even more speedup already and how much, from original?

You mean in some, but conversely you then can in (all the) other places. I.e. mixed-precision is a thing. I’m not sure how automated it is, probably (usually) not. And it takes skill to know when ok, and when not ok, so many just use hi/est.

Ludovic_Dumoulin · March 11, 2025, 1:49pm

In my case Δ is my Δx (usually 10^(-2) or 10^(-3)), then once inverted it is still fine. I always use Δx such that 1/Δ or 1/Δ^2 are round (even if I never think in power of 2 .
I’ll let it like this for now. I am still trying to get my shared arrays working properly…
I’ll come back to this later, thank you for your help.

mcabbott · March 11, 2025, 2:05pm

No idea if it matters for speed, but this looks odd: You seem to be making Float32 numbers to compute an index, and then converting to Int32.

julia> let
         @noinline blockDim() = (; x = Int32(rand(1:10)))
         @noinline blockIdx() = (; x = Int32(rand(1:10)))
         @noinline threadIdx() = (; x = Int32(rand(1:10)))
         f() = begin 
           i::Int32 = (blockIdx().x - 1f0) * blockDim().x + threadIdx().x
           i
         end
         @code_typed f()
       end
CodeInfo(
1 ─ %1  =    invoke var"#blockIdx#blockIdx##3"()()::@NamedTuple{x::Int32}
│   %2  =   builtin Base.getfield(%1, :x)::Int32
│   %3  = intrinsic Base.sitofp(Float32, %2)::Float32
│   %4  = intrinsic Base.sub_float(%3, 1.0f0)::Float32
│   %5  =    invoke var"#blockDim#blockDim##3"()()::@NamedTuple{x::Int32}
│   %6  =   builtin Base.getfield(%5, :x)::Int32
│   %7  = intrinsic Base.sitofp(Float32, %6)::Float32
│   %8  = intrinsic Base.mul_float(%4, %7)::Float32
│   %9  =    invoke var"#threadIdx#threadIdx##3"()()::@NamedTuple{x::Int32}
│   %10 =   builtin Base.getfield(%9, :x)::Int32
│   %11 = intrinsic Base.sitofp(Float32, %10)::Float32
│   %12 = intrinsic Base.add_float(%8, %11)::Float32
│   %13 = intrinsic Base.le_float(-2.1474836f9, %12)::Bool
└──       goto #3 if not %13
2 ─ %15 = intrinsic Base.lt_float(%12, 2.1474836f9)::Bool
└──       goto #4
3 ─       nothing::Nothing
4 ┄ %18 = φ (#2 => %15, #3 => false)::Bool
└──       goto #7 if not %18
5 ─ %20 = intrinsic Base.trunc_llvm(%12)::Float32
│   %21 = intrinsic Base.sub_float(%12, %20)::Float32
│   %22 = intrinsic Base.eq_float(%21, 0.0f0)::Bool
└──       goto #7 if not %22
6 ─ %24 = intrinsic Base.fptosi(Int32, %12)::Int32
└──       goto #8
7 ┄ %26 =    invoke Base.InexactError(:Int32::Symbol, Int32::Any, %12::Vararg{Any})::InexactError
│           builtin Base.throw(%26)::Union{}
└──       unreachable
8 ─       goto #9
9 ─       nothing::Nothing
└──       return %24
) => Int32

Ludovic_Dumoulin · March 11, 2025, 2:21pm

Ho thanks, I am so stupid, I wanted to be sure that (blockIdx().x - 1f0) was Int32 and I put f0 and not (blockIdx().x - Int32(1)) I’ll replace all the f0 of indices by Int32().
Thank you

eldee · March 11, 2025, 2:39pm

You could also use 1i32 as shorthand for Int32(1), after importing i32 from CUDA.jl (using CUDA: i32).

Topic		Replies	Views
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	240	June 4, 2025
Moving from `Float64` to `Float32` not improving performance Performance	25	4207	October 10, 2022
CUDA.jl kernel is half as fast as c++ Kernel Performance cuda , cudajl	11	1565	September 26, 2022
CUDA.jl - Better GPU but Worse Performance GPU gpu , performance	10	1656	June 29, 2022
Will decreasing the precision of intermediate variables improve performance of code? Performance performance	12	1180	May 9, 2020

[Project] BFLOAT16 on ALL hardware (>= 2009), up to 2000x faster ML algos, 50% less RAM usage for all old/new hardware - Hyperlearn Reborn.

Related topics