CUDA.jl kernel is half as fast as c++ Kernel

I have a kernel I wrote using CUDA.jl. It gets the correct answer, but is half as fast as the equivalent c++/cuda kernel I coded up. I was hoping someone could look at it and see if there are any optimizations to be done. Is it even resonable to assume similar performance to a c++ cuda kernel?

@views function mykernel(
    v0_range::CUDA.CuDeviceArray{ComplexF32,3,1}, # (1024,500,500)
    tr_array::CUDA.CuDeviceArray{Float32,3,1},# (3, 500, 500)
    voxel_list::CUDA.CuDeviceArray{Float32,2,1}, # (3, 125000)
    v1::CUDA.CuDeviceArray{ComplexF32,1}, # (125000)
    kc::Float32,
    range_res::Float32,
    range_max::Float32,
    bwinx::Bool,
    winx::CUDA.CuDeviceArray{Float32,1}, #(500)
    bwiny::Bool,
    winy::CUDA.CuDeviceArray{Float32,1}) # (500)
    @inbounds begin
        index = (blockIdx().x - 1CUDA.i32) * blockDim().x + threadIdx().x
        stride = gridDim().x * blockDim().x
        for i = index:stride:size(voxel_list, 2)
            tempsum = ComplexF32(0)
            # integrate along trajectories in the range domain
            for tr_yi = 1:size(tr_array, 3)
                for tr_xi = 1:size(tr_array, 2)
                    weight = 1.0f0
                    if bwinx
                        weight *= winx[tr_xi]
                    end
                    if bwiny
                        weight *= winy[tr_yi]
                    end
                    p1 = voxel_list[:, i]
                    p2 = tr_array[:, tr_xi, tr_yi]
                    r_theory = sqrt((p1[1] - p2[1])^2 + (p1[2] - p2[2])^2 + (p1[3] - p2[3])^2)
                    if r_theory < range_max
                        range_bin_float = r_theory / range_res + 1
                        range_bin_int = trunc(Int, range_bin_float)
                        delta = range_bin_float - range_bin_int
                        pp0 = v0_range[range_bin_int, tr_xi, tr_yi] * weight
                        pp1 = v0_range[range_bin_int+1, tr_xi, tr_yi] * weight
                        interp_range_respose = pp0 * (1.0f0 - delta) + pp1 * delta
                        temp = interp_range_respose * exp(+2.0f0im * kc * r_theory)
                        tempsum += temp
                    end
                end
            end
            v1[i] = tempsum
        end
    end
    return
end

it won’t fix everything, but you should probably use cis instead of exp(im*...). Another main difference between Julia and c++ here is that Julia doesn’t automatically reassociate your math, so you might want to use the MulAddMacro package that allows this.

1 Like

Yeah, it’s doable. You should inspect the LLVM IR to find out if there’s still any exceptions or bad conversions lurking. For example, it happens easily that you’re accidentally promoting to Int64 or Float64, inflating register usage. Have a look at Training/2-2-kernel_analysis_optimization.ipynb at master · JuliaComputing/Training · GitHub; you can inspect the number of registers by compiling the kernel with launch=false and calling CUDA.registers on it.

1 Like

Thanks for the suggestions. I’ll rework it and see waht I can do. With regards to In64 types when doing loops such as

for tr_yi = 1:size(tr_array, 3)

tr_yi will be 64 bit. For loops like this is there a straightforward way to control the index type? including eachindex() and CartesianIndices()?

thank you.

I guess map could be useful.

julia> typeof(2:3)
UnitRange{Int64}

julia> typeof(map(Int32, 2:3))
UnitRange{Int32}

Hi Maleadt. I looka th the notebook you sent me and there is indeed an exception. Though I don’t understand why. It has to do with the trunc(Int32, range_bin_float) line.

Here is the output:

; ┌ @ float.jl:781 within `trunc`
   call fastcc void @ijl_box_float32(float %106)
   call fastcc void @gpu_report_exception(i64 ptrtoint ([10 x i8]* @exception130 to i64))
   call fastcc void @gpu_signal_exception([1 x i64] %state)
   call void asm sideeffect "exit;", ""() #3
   unreachable

I want to take the float32 value and then convert and truncate to an integer. Is there a different way to do this?

Check out the trunc docs, it’s safe, so it throws in case the correct result isn’t possible:
https://docs.julialang.org/en/v1/base/math/#Base.trunc

Maybe you want unsafe_trunc instead.

1 Like

Thanks for your help.

I got it down to about ~30% slower than the c++ kernel. Which is much better. Perhaps with some more work I could get it about equal.

I think there is a bug with cis and sincos. looking at the LLVM IR there are some conversion to and from double that were slowing things down. calculating cos and sin independently and then makeing it complex fixed the issue.

here is the IR from cis in my kernel

__internal_trig_reduction_slowpath.exit.i.i.i:    ; preds = %164, %153
     %hi.i.i.i.1.i = phi i32 [ %169, %164 ], [ %159, %153 ]
     %lo.i.i.i.0.i = phi i32 [ %175, %164 ], [ %163, %153 ]
     %176 = lshr i32 %hi.i.i.i.1.i, 30
     %177 = call i32 @llvm.fshl.i32(i32 %hi.i.i.i.1.i, i32 %lo.i.i.i.0.i, i32 2) #4
     %178 = shl i32 %lo.i.i.i.0.i, 2
     %179 = lshr i32 %177, 31
     %180 = add nuw nsw i32 %179, %176
     %.not22.i = icmp eq i32 %154, 0
     %181 = sub nsw i32 0, %180
     %spec.select.i = select i1 %.not22.i, i32 %180, i32 %181
     %.not23.i = icmp sgt i32 %177, -1
     %182 = xor i32 %154, -2147483648
     %s.i.i.i.0.i = select i1 %.not23.i, i32 %154, i32 %182
     %not..not23.i = xor i1 %.not23.i, true
     %183 = sext i1 %not..not23.i to i32
     %hi.i.i.i.2.i = xor i32 %177, %183
     %lo.i.i.i.1.i = xor i32 %178, %183
     %184 = zext i32 %hi.i.i.i.2.i to i64
     %185 = shl nuw i64 %184, 32
     %186 = zext i32 %lo.i.i.i.1.i to i64
     %187 = or i64 %185, %186
     %188 = sitofp i64 %187 to double
     %189 = fmul double %188, 0x3BF921FB54442D19
     %190 = fptrunc double %189 to float
     %.not25.i = icmp eq i32 %s.i.i.i.0.i, 0
     %191 = fneg float %190
     %spec.select1.i = select i1 %.not25.i, float %190, float %191
     br label %__internal_trig_reduction_kernel.exit.i.i

my final kernel is:

@fastmath @views @muladd function mykernel(
    v0_range::CUDA.CuDeviceArray{ComplexF32,3,1},
    tr_array::CUDA.CuDeviceArray{Float32,3,1},
    voxel_list::CUDA.CuDeviceArray{Float32,2,1},
    v1::CUDA.CuDeviceArray{ComplexF32,1},
    kc::Float32,
    range_res::Float32,
    range_max::Float32,
    bwinx::Bool,
    winx::CUDA.CuDeviceArray{Float32,1},
    bwiny::Bool,
    winy::CUDA.CuDeviceArray{Float32,1})
    @inbounds begin
        index = (blockIdx().x - 1CUDA.i32) * blockDim().x + threadIdx().x
        stride = gridDim().x * blockDim().x
        for i = map(Int32, index:stride:size(voxel_list, 2))
            tempsum = ComplexF32(0)
            for tr_yi = map(Int32, 1:size(tr_array, 3))
                for tr_xi = map(Int32, 1:size(tr_array, 2))
                    weight = 1.0f0
                    if bwinx
                        weight *= winx[tr_xi]
                    end
                    if bwiny
                        weight *= winy[tr_yi]
                    end
                    p1 = voxel_list[:, i]
                    p2 = tr_array[:, tr_xi, tr_yi]
                    r_theory = sqrt((p1[1] - p2[1])^2 + (p1[2] - p2[2])^2 + (p1[3] - p2[3])^2)
                    if r_theory < range_max
                        range_bin_float = r_theory / range_res + 1CUDA.i32
                        range_bin_int = unsafe_trunc(Int32, range_bin_float)
                        delta = range_bin_float - range_bin_int
                        pp0 = v0_range[range_bin_int, tr_xi, tr_yi] * weight
                        pp1 = v0_range[range_bin_int+1CUDA.i32, tr_xi, tr_yi] * weight
                        interp_range_respose = pp0 * (1.0f0 - delta) + pp1 * delta
                        a = cos(2.0f0 * kc * r_theory)
                        b = sin(2.0f0 * kc * r_theory)
                        temp = interp_range_respose * complex(a, b)
                        tempsum = tempsum + temp
                    end
                end
            end
            v1[i] = tempsum
        end
    end
    return
end

Base.Math.rem_pio2_kernel(::Float32), which is, I believe, used by all trigonometric functions for range reduction, uses Float64 for some computation. So this might be the cause for the conversions that you’re seeing.

This means two things (if I’m right):

  1. you’d be better off using cis instead of separate sin and cos calls, because then you only pay for a single range reduction, instead of one each for sin and cos. This is partly why sincos and cis exist in the first place.
  2. Range reduction can be implemented in various ways, e.g. it’s even possible to do it with integer arithmetic as far as I remember, but in all cases there’s a further tradeoff between accuracy and speed (where standard library code must always prefer accuracy). My point is: you might be able to speed up your code by taking care of range reduction yourself. For some problems range reduction can be even eliminated, so it might be worth your time to think about range reduction a bit: what are your requirements of range reduction (the domains of your arguments), and what implementation strategy (if any is necessary in the end) is best suited for your hardware.

PS: if you do end up taking care of the range reduction yourself, it might also be fun to get your custom polynomials (with Float32 coefficients) for computing range-reduced sin and cos to the required accuracy. Then you could base your sin_kernel and cos_kernel (or even sincos_kernel directly) on these polynomials.

1 Like

Seems something is not working as you believe. My kernel is quite a bit faster if I have two separate calls to sin and cos instead of cis. Furthermore, when using cis there are references to double in the IR that aren’t there when using sin and cos.

I think i kind of understand what you are saying about about range reduction, but that doesn’t quite make sense to me. If I’m doing my computation using Float32 then I will accept the accuracy issues by doing so. Transparently converting to higher precision and back behind the scenes seems bad. Especially when using co-processors like with cuda kernels.

interesting thought about making my own kernel for sin and cos, but that seems a little overkill and kind of silly to have to do that to get the same performance as c++.

thanks for your thoughtful replies!

1 Like

Yes. Now I looked into what CUDA.jl was doing, and it seems they actually call NVIDIA’s LLVM bitcode library called libdevice. They do this for all of sincos, sin and cos. Libdevice does it’s own range reduction and everything else, so most of what I said doesn’t apply.
This however doesn’t explain the “references to double in the IR that aren’t there when using sin and cos”, that’s weird and interesting.

NB: this is the Base.sincos(::Float32) code for CUDA.jl:

It might be worth seeing if @fastmath makes a notable difference. I don’t know that it will, but it’s worth trying.